Judgment, Independence, and the Role of Testing in a Black-Box World
The Appraisal Standards Board’s proposed Advisory Opinion 41 (AO-41), Use of Technology in an Appraisal or Appraisal Review Assignment, has generated thoughtful—and in some cases pointed—discussion across the appraisal and collateral-risk communities. Much of that discussion centers on what AO-41 does not do: it does not define “technology,” it does not distinguish sharply between process tools and product tools, and it does not resolve long-standing tensions in USPAP between established practice and emerging methods. Those critiques are valid. But they also risk missing what AO-41 is really trying to accomplish. In our view, AO-41 is not about endorsing new technology, nor is it about forcing appraisers to become data scientists or software engineers. It is about how appraisers demonstrate professional judgment and competency when technology—especially opaque, third-party technology—becomes unavoidable. That problem is not new. What is new is its scale.
We’ve Seen This Movie Before Many appraisers will recognize the pattern. When multiple regression analysis (MRA) entered mainstream appraisal education, it was often presented as a way to produce mathematically precise, “market-supported” adjustments. In practice, MRA worked well in some markets and poorly in others. The issue was not regression itself—it was that appraisers were encouraged to use it without sufficient conceptual grounding in when its results were meaningful and when they were not. The result was often false confidence rather than better judgment. AO-41 reflects a similar inflection point—this time driven by AVMs, machine learning, computer vision, and generative AI. The tools are more powerful, more opaque, and far more client-driven than before. But the professional obligation has not changed: only the appraiser produces assignment results.
AVMs, AI, and the Accountability Gap One criticism raised in recent commentary is that AVMs are not subject to USPAP, are not transparent, and operate based on lender-defined scope and inputs. All of that is true. But it is precisely why AO-41 exists. AO-41 does not attempt to pull AVMs under USPAP. Instead, it forces an uncomfortable but necessary question: What does competent reliance look like when the mechanics of the tool are outside the appraiser’s control? AO-41 answers that question indirectly. It makes clear that appraisers are not required to understand or replicate algorithms—but they are required to understand enough to evaluate relevance, limitations, and credibility for the intended use. That is a judgment problem, not a coding problem.
Independent Testing as a Competency Enabler This is where the industry conversation needs to mature. For opaque tools, competency cannot reasonably come from inside the model. It must come from external, objective evidence of how the tool behaves. Independent, third-party testing—conducted outside the appraisal assignment—can provide exactly that context:
* historical accuracy and dispersion, * stability across markets, price tiers, and property types, * known limitations or failure modes, and * awareness of differential performance that may raise fair housing concerns.
Importantly, independent testing does not replace appraisal analysis or judgment. It produces informational evidence, not assignment results. It helps appraisers answer a practical AO-41 question: Is reliance on this tool reasonable here, or should it be limited—or avoided altogether? Or as is our motto here at AVMetrics… “The best thing an AVM can tell you is when NOT to use it” This framing is fully consistent with AO-41’s core principles and with the Interagency AVM Quality Control Standards, which emphasize ongoing monitoring of AVM accuracy, reliability, and potential bias. Appraisers are not being asked to perform fair lending analysis—but awareness of model behavior across market segments is now inseparable from credibility.
Education, Not Enforcement Another concern raised in recent commentary is that AO-41 risks merging new tools into old expectations and legacy education. That concern is well taken. In our opinion, USPAP has always struggled to balance encouragement of new methods with deference to established practice. The path forward is not more prescriptive rules. It is better education and clearer boundaries. Appraisers do not need to know how an AVM or AI model works internally. But they should be able to explain, in plain language:
* why a tool was appropriate (or not) for a specific assignment, * how its output was evaluated for reasonableness, and * why reliance was full, limited, or declined.
If that explanation cannot be made clearly—“to a sixth grader,” as one educator recently put it—then reliance probably wasn’t appropriate.
What AO-41 Is Really Signaling AO-41 is not a referendum on technology. It is a signal that the profession needs:
* clearer educational pathways, * shared reference points for evaluating opaque tools, and * realistic expectations about what appraisers are—and are not—being asked to understand.
If the exposure process leads to broader recognition that independent testing and education are necessary supports for professional judgment—not substitutes for it—then AO-41 will have served a useful purpose, even as its language continues to evolve. That conversation is exactly what the exposure draft process is meant to surface. And it is one the appraisal and collateral-risk communities should continue—carefully, constructively, and with judgment front and center.
The Fair Housing analyses published by AVM vendors such as Veros and Clear Capital represent important early efforts to evaluate potential disparate impact in automated valuation models. These studies contribute useful perspective to an evolving area of the industry, but they are inherently constrained by scope, methodology, and—most importantly—objectivity. Their findings are self-assessments rather than independent evaluations: each vendor analyzes only its own model, using its own data and assumptions, and typically concludes that little to no bias exists, which limits their usefulness for broader risk management and supervisory purposes.
Regulated institutions, however, must operate under much more rigorous expectations. The new Interagency AVM Quality Control Standards require lenders to demonstrate that AVMs used in credit decisions are independently validated and fairly applied. This standard cannot be meaningfully satisfied by vendor-authored whitepapers alone.
AVMetrics’ methodology is designed specifically to meet these supervisory needs. Rather than focusing on individual model performance within internally defined samples, AVMetrics conducts standardized, national-level testing across 700,000 to 1 million transactions each quarter. This approach ensures that fairness conclusions reflect real-world market diversity and enables consistent evaluation across models, markets, and time.
AVMetrics independently tests eight different dimensions in which AVMs could potentially disadvantage protected classes, including coverage rates (hit rate), accuracy, precision, and other core performance measures. To support statistically meaningful comparisons, AVMetrics has invested in neighborhood-level demographic data, enabling analysis across comparison neighborhoods- avoiding the masking effects of county-level aggregation while preserving sufficient sample size beyond census-tract granularity.
Further, AVMetrics applies Standardized Mean Difference (SMD)—the same effect-size metric commonly used in fair-lending analytics—providing a clear measure of whether disparities are material, not simply detectable. In contrast, many model-specific analyses typically use raw accuracy differences or simple correlations, which offer no interpretive scale for examiners assessing practical significance. AVMetrics’ approach produces metrics that are grounded in established methodology, interpretable, and defensible.
As the next generation of AVMs incorporates increasingly complex machine learning and generative AI techniques, vendor-driven testing becomes even less transparent. AVMetrics’ methodology is intentionally model-agnostic: we can evaluate the fairness and performance of traditional hedonic models, GBDT-based systems, deep learning models, or hybrid AI architectures with equal rigor. As models become more opaque, the need for a neutral, independent evaluator becomes increasingly essential.
In contrast to analyses intended to provide general assurance around individual models, AVMetrics delivers regulatory-grade evidence. By identifying how model risk and policy risk can interact to generate disproportionate impacts—an expectation embedded in the new regulatory framework—our testing equips lenders with the actionable intelligence needed to inform, calibrate, and justify their risk-policy decisioning.
As regulatory expectations around AVM fairness continue to mature, institutions must move beyond model-specific assurances toward independent, repeatable, and scalable evaluation frameworks. AVMetrics’ fair housing methodology is purpose-built to meet these expectations, providing lenders with nationally consistent, statistically rigorous, and model-agnostic evidence of AVM performance and potential disparate impact. By aligning testing design with supervisory standards and real-world production environments, AVMetrics enables institutions not only to identify and manage fair-lending risk, but also to demonstrate compliance with confidence in increasingly complex valuation ecosystems.
The Department of Justice’s recent move to eliminate disparate-impact liability under its Title VI regulations has raised understandable questions across housing and credit markets. But for lenders, GSE partners, and valuation providers preparing for the AVM Quality Control Standards, one thing is clear:
The obligations around AVM fairness haven’t gone away.
The interagency AVM rule—effective October 1, 2025—explicitly requires institutions to establish policies, practices, procedures, and control systems to ensure AVMs comply with applicable nondiscrimination laws. That requirement remains fully intact. So do the supervisory expectations of prudential regulators, FHFA, and CFPB around managing fair lending and bias risk in automated systems, whether or not DOJ narrows its enforcement tools under Title VI.
Even with political shifts, the industry continues to operate under:
The Fair Housing Act, where disparate-impact liability is still recognized by the Supreme Court.
ECOA/Reg B fair lending expectations, which continue to incorporate statistical evidence of adverse outcomes.
Growing scrutiny of AI and automated valuation, highlighted by recent GAO recommendations urging clearer guidance on emerging technology risks.
In short: Regulatory pendulums swing—but AVM fairness risk remains.
Institutions still need independent, statistically rigorous testing to understand whether their AVMs or cascades produce unjustified disparities, and to document business justification and alternatives when they arise. That’s where AVMetrics’ fifth-factor validation fits the bill. Our analysis is national, extensive, independent, thorough, examiner-ready and tested for significance.
The Challenge of AVM Testing in Quality Control Automated Valuation Models (AVMs) have become a cornerstone of the appraisal quality control process, with many lenders and Appraisal Management Companies (AMCs) implementing what is commonly known as the “15% rule.” Under this framework, if an appraised value falls within 15% of the AVM estimate, the appraisal undergoes only a cursory review. Conversely, variances exceeding 15% trigger more thorough scrutiny or require review by another appraiser.
While this approach offers operational efficiency, it fundamentally misunderstands the distinction between precision and accuracy. This article examines why using appraisals as benchmarks for AVM testing creates a dangerous circular logic and proposes a more robust methodology centered on actual sale transactions.
Understanding Precision Versus Accuracy Comparing AVM values to appraised values measures consistency—or precision—not accuracy. This distinction is critical yet often overlooked in the industry. While understanding the consistency between different valuation methods has operational value, particularly for ensuring that credit decisions are based on reliable estimates regardless of methodology, it serves a fundamentally different purpose than accuracy testing.
True accuracy in property valuation refers to how closely an estimate reflects the actual market value at a specific point in time. Market value, regardless of the valuation method employed, is universally defined as the most probable price negotiated between a willing buyer and seller in an arm’s length transaction. Therefore, the only meaningful measure of accuracy is the comparison to the actual negotiated sale price—the endpoint of the price negotiation process.
The Circular Logic Problem The practice of using appraisals as benchmarks for AVM testing raises a fundamental question: Are we testing whether the model provides an accurate value consistent with actual sale prices, or merely a precise value consistent with appraisal estimates? This distinction has significant implications.
If AVMs tested against appraisals are mistakenly labeled as “accurate” rather than simply “consistent with appraisal estimates,” the quality control process becomes streamlined in a potentially dangerous way. More appraisals pass through without complications or delays from further review, creating a self-reinforcing cycle. This circular process—using AVMs in quality control to validate appraisals while simultaneously using appraisals as benchmarks to test AVMs—is fraught with risk and undermines the integrity of both valuation methods.
The Critical Role of Blind Testing Determining the accuracy of any valuation estimate requires blind testing—the property’s sale price must be unknown when the estimate is produced. While implementing blind testing for appraisals presents significant challenges due to USPAP standards requiring appraisers to review and consider the terms of sale contracts, AVMs can and should be tested blindly with relative ease.
Most AVM vendors already conduct blind testing routinely as part of their quality assurance programs. This process allows modelers to understand model accuracy before a sale price becomes known—a critical capability for calibrating models and improving their predictive accuracy.
The Importance of MLS Data Suppression The blind testing process must extend to MLS data, particularly listing prices. While listing prices occasionally match final sale prices, they more typically represent the starting point of buyer-seller negotiations. However, the listing price itself is not arbitrary—considerable analysis has already been completed to determine it.
Real estate professionals conduct thorough studies of the property, neighborhood, and relevant market activity—many of the same activities required for property valuation—before establishing a listing price. When AVMs have access to this information, they gain a significant advantage that may not reflect their true predictive capabilities. Therefore, proper testing must ensure that listing prices remain unknown when estimates are produced, maintaining the integrity of the blind testing process.
MLS data suppression also aligns with real-world use cases. In prominent AVM applications such as mortgage refinancing or home equity loans, the properties serving as collateral are rarely listed on MLS. Most lenders actually prohibit refinancing properties that are currently or recently listed. Testing AVMs without MLS data therefore provides a more realistic assessment of model performance in actual lending scenarios.
Regulatory Guidance and Compliance The breadth and scope of AVM testing—including methodology selection and resource allocation—must be tailored to each institution’s unique circumstances. Regulatory guidance provides flexibility for organizations to demonstrate compliance within the context of their specific risk assessments.
While no explicit prohibition exists against using appraisal values as benchmarks, the Interagency Appraisal and Evaluation Guidance offers clear direction in Appendix B: “To ensure unbiased test results, an institution should compare the results of an AVM to actual sales data in a specified trade area or market prior to the information being available to the model.” This affirmative statement strongly supports the use of sales transactions as the foundation of AVM testing methodology.
Conclusion: Building a Robust Testing Framework The distinction between measuring precision and accuracy in AVM testing is not merely academic—it has practical implications for lending decisions, risk management, and regulatory compliance. While comparing AVMs to appraisals may offer insights into consistency between valuation methods, it cannot and should not be mistaken for accuracy testing.
A robust AVM testing methodology must incorporate the following core principles:
• Use actual sale transactions as the primary benchmark for accuracy testing
• Implement truly blind testing protocols that exclude sale prices during estimate generation
• Suppress MLS listing price data for the benchmark property only to ensure unbiased testing. All other MLS-derived information—such as property characteristics, comparable sales, and broader market metrics—should remain fully available and not suppressed.
• Recognize the different purposes served by precision and accuracy measurements
As the regulatory guidance makes clear, institutions have the flexibility to design testing programs that suit their specific needs and risk profiles. However, this flexibility should not obscure the fundamental requirement: AVM testing must measure actual predictive accuracy against real market outcomes, not merely consistency with other valuation estimates. Only through rigorous, unbiased testing against actual sale prices can institutions truly understand and rely upon the accuracy of their automated valuation models.
The Automated Valuation Model (AVM) industry is entering a critical phase—one where regulatory oversight is increasing, use cases are expanding, and performance analysis is under sharper scrutiny. In this environment, testing methodologies must evolve to ensure transparency, fairness, and real-world relevance. A recent whitepaper from Veros Real Estate Solutions, “Optimizing AVM Testing Methodologies,” advocates flawed logic that risks reversing progress in predictive model testing and validation.
This op-ed offers an affirmation of the core tenets of what is becoming the industry-standard testing framework: a data-driven testing methodology grounded in sound and prudent validation principles. While Veros challenges this approach, the broader AVM ecosystem—including regulators, lenders, and nearly all major AVM providers—have embraced a process that prioritizes objective, real-world performance measurements over now antiquated methods allowing for data leakage into the Automated Valuation Model.
The Listing Price Issue
The whitepaper in question should be understood as a salvo in the battle over listing prices and their influence on AVMs. Several industry participants have come out with analyses showing that when AVMs incorporate listing price data into their models, they perform much better in tests, but those test results are likely to be a poor reflection of real-world model performance. This is because in most use cases for AVMs, there is no listing price available – think of refinance and HELOC transactions, portfolio risk analyses or marketing (SeeAV Metrics August 29, 2024 whitepaper[1] or the AEI Housing Center’s Study of AVM Providers[2]).
Specific Issues in Veros’ “Optimizing…”
Below are seven points made in the aforementioned paper that don’t stand up to scrutiny. Let’s break them down one at a time.
Mischaracterization of the Listing Price Concern
Whitepaper Claim: “Knowing the list price doesn’t necessarily equate to knowing the final sale price.” The paper not only puts forward the strawman that others claim that listing prices are equal to sales prices, but it also rather awkwardly asserts that listing prices are not very useful to AVMs
Response: This argument overlooks a key behavioral phenomenon: anchoring. When listing prices are published, they tend to drive sale prices and valuations toward that price[3]. When listing prices become available to AVMs during testing, model outputs shift—often sharply—toward those prices. Look no further than one of the most prominent AVMs on the market, Zillow. They are a very transparent company and publish their accuracy statistics monthly, and when they do, they measure them with and without listing data available, because the accuracy results are strikingly different. As of August 2025, Zillow’s self-reported median error rate is 1.8% when listing prices are available and 7.0% when they are not.[4]
AEI noted this phenomenon in their recent analysis of multiple AVMs from 2024, “Results on the AEI Housing Center’s Evaluation of AVM Providers[5].” AEI referred to it as “springiness” because graphs of price estimates “spring” to the listing price when that data becomes available. The result is inflated performance metrics that don’t reflect true, unassisted, predictive ability. And finally, this issue has been empirically documented in AV Metrics’ internal studies and external publications.
When AVMs are tested with access to listing prices, vendors can tune their models to excel under known test conditions rather than perform reliably across real-world scenarios. This undermines model governance, especially for regulated entities, and conflicts with both OCC and IAEG guidance emphasizing model transparency, durability, and independence.
The solution being adopted as the emerging standard is simple but powerful: only use valuations generated before the listing price becomes known. This ensures unanchored estimates using real-world scenarios where listing prices are unavailable—a more accurate reflection of likely outcomes for use cases such as refinance, home equity, and portfolio surveillance.
Refinance Testing and the Fallacy of Appraisal Benchmarks
Whitepaper Claim: “Appraised values are the best (and often only) choice of benchmarks in this lending space currently as they are the default valuation approach used to make these lending decisions.”
Response: Appraisals are opinion-based and highly variable. In fact, Veros’ own white paper acknowledges that appraisals exhibit high variance, a concession that undermines their validity as testing benchmarks. Appraisal opinions are not standardized enough to provide consistent benchmarks as a measure for AVM accuracy.
Regulatory guidance also emphasizes the superiority of transactions over appraisals for AVM testing. Appendix B of the Interagency Appraisal and Evaluation guidance, December, 2010[7], still the most current guidance of AVM testing, specifically states, “To ensure unbiased test results, an institution should compare the results of an AVM to actual sales data in a specified trade area or market prior to the information being available to the model.”
Mischaracterization of Pre-Listing Valuations as “Outdated”
Whitepaper Claim: The whitepaper asserts that validation results using pre-listing AVM values are artificially low, asserting that these values are outdated and fail to reflect current market conditions. While Veros stops short of using the phrase “outdated and unfair,” that is the unmistakable thrust of their argument: that pre-listing AVM estimates do not reflect real-world usage and disadvantage high-performing models. In the webinar discussion of the whitepaper, Veros repeatedly suggested that “Pre-MLS” testing might use AVM value estimates that were 9 months old.
Response: This claim is both overstated and analytically misleading.
PTM testing never uses values that are 9 months old, and industry participants know that, because they are familiar with the methodology and AV Metrics’ paper describing it[8]. The reality is that almost all AVM values used in PTM testing were created mere weeks or a month or two prior to the relevant date, which is the contract date. The Veros paper uses confusion over different dates in the process of a real estate transaction to muddy the waters. The timeline below shows how the “median DOM” referred to in the paper and commonly published in business articles is not representative.
The typical transaction takes 50 days from listing to completion, but typically only 19 days from listing to contract. (April 2025)
In the real estate industry, Days on Market (DOM) is often defined as the number of days from Listing Date to Closing/Recording Date. Sources like FRED and Realtor.com report median DOM this way, which for April 2025 was about 50 days.
However, for valuation relevance, the more important measure is the time from Listing Date to Contract/Pending Sale Date—the point when the actual price agreement is made. This is typically much shorter—our April 2025 Zillow data show a median of 19 days nationally.
This matters because AVM predictions made just before the listing date are often only weeks ahead of the market decision point, not months. By contrast, the “closing” date used in some public stats is just a paperwork formality that lags well behind the actual market valuation event.
Furthermore, residential real estate markets do not shift dramatically week to week. The suggestion that valuations generated days or a few weeks prior to the listing date are best characterized as outdated misunderstands the pace of market change and misrepresents the data.
Using pre-listing AVM values does not disadvantage models, nor are those values meaningfully outdated. On the contrary, PTM removes a long-standing bias—early access to listing prices—and holds all AVMs to the same fair standard. The result is a more objective, transparent, and predictive test that rewards modeling performance rather than data timing advantage.
Key Points:
Veros’ “9 months” claim is unrealistic—typical contract timing is closer to 2–4 weeks after listing.
Residential markets move slowly: 1–2% change over several months, often less.
Any slight “age” in pre-listing AVM estimates is minimal, consistent across all models, and far outweighed by the benefit of removing listing price bias.
When tested properly, AVMs show robust performance even when limited to pre-listing data, proving that predictive strength—not access to post-listing artifacts—is the proper basis for fair evaluation.
The Flawed Analogy to Appraisers
Whitepaper Claim (Paraphrased): Veros argues that AVMs should be allowed to use listing data in testing because appraisers do. The whitepaper pleads for AVMs to be allowed to operate like appraisers with access to listing data in order to compete with appraisers on a level playing field.
Response: This argument confuses different points. First, appraisers and AVMs are not equals competing on a level playing field. They are different processes for estimating market value. Appraisers are held to standards to develop and report appraisal estimates by the Uniform Standards of Professional Appraisal Practice. These types of standards are non-existent for AVMs. Perhaps to counter the lack of standards at the manufacturing end of the AVM estimates, model estimates are tested on the backend to evaluate accuracy and meet regulatory expectations. Appraisers aren’t subjected to the rigorous testing that AVMs go through, though appraisal users typically have review processes in place at both the transactional and portfolio levels.
Second, there are several different “uses of appraisal data” being conflated in this claim. AVMs are able to use many different types of data from listings in their models without objection. They often ingest pictures and text descriptions and they’ve developed very sophisticated AI techniques to tease out information from those descriptions.
But there is one specific issue under debate, and that is the use of the listing price information when AVMs are being tested. Users of AVMs need to understand how accurate a model will be when listing data is not available, as it is not available in most AVM use applications: e.g. refinances, HELOCs, portfolio valuation and risk assessment, etc. For testing to be most applicable to those situations and uses, AVM testing must be done on value estimates not “anchored” to listing prices.
AVMs are evaluated by statistical comparison to a benchmark. Injecting listing prices into the models contaminates the experiment, especially when that price closely tracks the final sale. Appraisers aren’t ranked side by side using controlled benchmarks. That difference is why AVMs should not be tested with access to listing prices, but they certainly should be able to use listing data.
False Equivalency with Assessed Values
Whitepaper Claim: “If we eliminate the use of MLS list prices, should we also argue for excluding other potentially useful data, such as that from a county property tax assessor?” The paper claims that other estimates of value available in the marketplace are not excluded by PTM testing, so it asks why listing prices should be singled out for exclusion.
Response: This argument is a strawman set up to be knocked down easily. Assessed values are stale and generally unrelated to current market value. They also tend to cover every property, meaning that they don’t privilege the small percentage of properties that will be used as benchmarks, thereby invalidating accuracy testing. But, most importantly, they do not create the same anchoring distortion that listing prices do. For these reasons, no one has suggested excluding assessor values, because it wouldn’t make sense. Later in the whitepaper, they answer their own rhetorical question by saying that it is “absurd” to consider eliminating access to assessor data. We wholeheartedly agree. It was, in fact, absurd to even suggest it.
Alternative Proposal: Measure Anchoring
Whitepaper Suggestion: The paper proposes using some statistical techniques to measure the amount that each AVM adjusts in response to listing prices.
Response: This suggestion is interesting for exploratory research, but it is not a viable alternative. It fails to address the basic question: how well does this model predict value when no listing price is available? The Predictive Testing Methodology (PTM) answers that question in a scalable, repeatable, and unbiased way. Simply calculating how much an AVM responds to listing prices does not accomplish that goal.
The Flaws of “Loan Application Testing”
Whitepaper Proposal: Veros suggests a new AVM testing approach based on pulling values at the time of loan application—arguing that this better reflects how AVMs are used in production, especially in purchase and refinance transactions.
Response: While this may sound pragmatic, in practice, “loan application testing” is deeply flawed as a validation methodology. It introduces bias, undermines statistical validity, and fails to meet regulatory expectations for model risk governance. Here’s why:
Not Anchoring-Proof If an AVM runs after the property is listed (as many do at loan application), it may already have ingested the list price or be influenced by it. This reintroduces anchoring bias—precisely what PTM is designed to eliminate.
Biased Sample and Survivorship Distortion Loan applications represent a non-random, self-selecting subset of properties. They exclude properties for which there is no loan application (about 1/3 of all sales are for cash and don’t involve a loan) as well as those that are quickly denied, withdrawn, or canceled. This sampling would severely bias testing.
Inappropriate Appraisal Benchmarks The mix of AVM testing benchmarks would vacillate between appraisals for refinance loan applications and sales for purchase applications. Depending on market conditions, refinance applications can make up 80+% of loan originations, which would mean that the vast majority of AVM testing would be based on appraisals, which are subjective and inappropriate as a benchmark.
Non-Standardized Collection & Timing There is no consistent, auditable national timestamp for “application date” across lenders. This creates operational inconsistency, poor reproducibility, and potential for cherry-picking.
Veros’ proposal is not a viable alternative to PTM. It lacks the rigor, scalability, and objectivity that predictive testing delivers—and it would fall short of the new federal Quality Control Standards requiring random sampling, conflict-free execution, and protections against data manipulation.
About the Author and the Need for Independent Testing
It is also important to acknowledge that the Veros whitepaper was authored by a model vendor—evaluating methodologies that directly affect its own model’s competitive standing. This is not an independent or objective critique. Veros is an active participant in the AVM space with commercial interests tied to model performance rankings. By contrast, Predictive Testing Methodology (PTM) is conducted by an independent third party, is openly adopted by nearly all major AVM vendors, and has become a trusted standard among lenders seeking impartial performance assessment.
Conclusion: Clarity Over Convenience
At its core, AVM testing is about one thing: accurately establishing an expectation of a model’s ability to predict the most probable sale price of a property. To achieve this, we must rely on objective benchmarks, control for data contamination, and apply consistent standards across models.
The Predictive Testing Methodology (PTM)—already adopted by nearly all major AVM providers—meets these criteria. It has been embraced by lenders and validated through years of use and peer-reviewed research. Anchored in OCC 2011-12 model validation guidance, IAEG principles, and the newly codified 2024 Final Rule on AVM Quality Control Standards, PTM ensures that AVMs are tested as they are used—in real-world, data-constrained conditions. These new federal standards require AVM quality control programs to:
Protect against data manipulation, such as anchoring to listing prices;
Avoid conflicts of interest, emphasizing the importance of independent testing providers;
Conduct random sample testing and reviews, ruling out cherry-picked case studies or selectively favorable data;
And comply with fair lending laws, requiring AVM frameworks to be broadly equitable and empirically validated.
Veros’ whitepaper makes the case for less rigorous framework. But flimsy frameworks serve vendors, not users, and especially not regulated users. They inflate performance, mask limitations, and misguide deployment. The industry would do well to resist this regression as such approaches would fall short of the standards now required by law.
The industry should reaffirm our commitment to testing that is transparent, unbiased, and fit for purpose. That is how to build AVM systems worthy of trust and meet both the expectations of regulators and the needs of a fair, stable housing finance system.
AV Metrics is an independent AVM testing firm specializing in performance analytics, regulatory compliance, and model risk management.
Testing AVMs is a complex challenge, and when faced with dificult problems, creative solutions often arise. One such solution that has gained attention is using appraised values as benchmarks for AVM testing. The reasoning seems straightforward: both an AVM and an appraisal aim to estimate the market value of a property. So, in cases where a true market value from an arm’s-length transaction isn’t available, appraised values may seem like the next best option.
Appraisals are frequently conducted for refinances, home equity loans, and other non-sales transactions, meaning many properties that don’t transact on the open market still have appraised values available. This availability might seem advantageous, and one might reason that if an AVM—designed to be quicker and cheaper—can match these appraisal estimates, it would be sufficient for testing.
On the surface, this reasoning seems sound: why not test AVMs against appraisals if they’re both providing value estimates? However, this approach is based on flawed assumptions, and the problems with using appraised values as benchmarks for AVM testing begin at the fundamental level.
Appraised Values as Benchmarks
Inconsistency: Appraisals are conducted using structured processes with standardized guidelines, but the significant variability in the types of appraisals (e.g., desktop, drive-by, hybrid, interior inspections) and their purposes (e.g., non-mortgage lending, relocation, litigation, HELOC, refinancing) undermines consistency. This inconsistency makes appraised values less reliable as they may not fully reflect “Market Value,” potentially violating the uniformity required in AVM benchmark testing.
Appraisal Bias Study (quantifiable documented, systemic biases): Evidence shows that appraised values do not exhibit a normal distribution around actual sales prices, suggesting inherent bias*. In particular, appraisers often adjust values upwards to align with expected sales prices to avoid friction with clients. This artificial inflation undermines the accuracy of appraised values as benchmarks, illustrating that they may lead to inaccurate AVM testing.
Appraisal-Derived Data: Appraisal data has inherent limitations due to its sourcing methods, geographic constraints, and the relatively small number of observations. Additionally, it often focuses on specific assignment types and relies on limited-scope appraisals—such as desktop, drive-by, or hybrid data collection—which may not accurately capture the true market value of a property. This variability, coupled with the subjective nature of appraisals, makes them unreliable and impractical for AVM testing. Their limited availability by source and geography and potential for imprecise estimates further reduce their efectiveness as a comprehensive benchmark.
Subjectivity (individual judgment and professional biases): Despite adherence to guidelines, appraisals are subject to the judgment of individual appraisers, introducing bias**. Research supports that appraisals, particularly in certain residential contexts, are performed not to independently assess market value, but to justify predetermined loan amounts. This subjectivity compromises their reliability as benchmarks for AVM testing.
Measuring Error: Using appraised values introduces compounded errors. Testing an AVM against an estimate (appraised values) rather than the true standard (actual market transactions) results in a conflation of two errors—those of the AVM and those inherent in the appraisal. This distorts the AVM’s accuracy, as any bias or error in the appraisal will be reflected in the overall AVM Testing results.
Lag in Market Response: Appraisals are often “rear-looking,” based on historical closed sales, and thus do not capture rapid market fluctuations. This is particularly problematic in volatile markets where real estate prices can change quickly. The lag in appraisals reflecting current market conditions can further distort AVM testing when outdated data is used.
Arm’s-Length Transactions as Benchmarks
Reflects True Market Value: Arm’s-length transactions occur between unrelated, independent parties, ensuring that both buyer and seller act in their best interests. As a result, these transactions accurately reflect the true market value of the property, making them the most reliable benchmarks for AVM testing.
Up-to-Date Information: Arm’s-length sales represent current market conditions, providing real-time data that ensures AVMs are tested against accurate, relevant information. This minimizes the risk of outdated or inaccurate benchmarks, improving the reliability of the test results.
Data Availability: Arm’s-length transactions are abundant, with approximately 4-6 million sales recorded annually across the U.S. This wealth of comparable sales data enables AVM testing in a variety of markets, including less active markets. The breadth of this data makes it a more practical and reliable option for establishing benchmarks.
Conclusion:
The use of appraised values as benchmarks for AVM testing is inherently flawed due to inconsistency, subjectivity, and the lag in reflecting current market trends. These limitations introduce bias, compounded error, and misrepresentation of accuracy in the testing process. This approach can lead to skewed and unreliable results, potentially violating new AVM quality control standards that emphasize the need for market-reflective testing.
In contrast, arm’s-length transactions provide the most reliable benchmarks, as they reflect true market values at the time of sale, accounting for factors like property exposure and typical days on market. To ensure objectivity and compliance with quality standards, AVM testing must prioritize arm’s-length transactions and ensure that the AVMs are blind to the most recent sale and listing prices***. This safeguards the integrity of the results by eliminating any recent influence of pre-existing market values or pricing.
Ultimately, appraised values should be avoided as benchmarks wherever possible, recognizing the inherent methodological limitations and risks to accuracy in testing results. For rigorous AVM validation, the reliance on arm’s-length transactions ensures more reliable, market-aligned outcomes.
*Systemic Risks in Residential Property Valuations: Perceptions and Reality, CATC June 2005 Page 13, “Full Appraisal Bias- Purchase Transactions”
**Other research corroborates the notion that certain residential appraisals are performed NOT to establish an independent and objective market value estimate, but to justify a loan amount. (Ferguson, J.T., After-Sale Evaluations: Appraisals or Justifications? Journal of Real Estate Research, 1988, 3, 19-26).
*** Current AVM Testing Methodologies versus our new Predictive Testing Methodology (PTM™) OR How Listing Prices Made Current AVM Testing Obsolete and How to Fix It (AVMNews September 2024)
Every quarter we analyze all the top independently tested AVMs and compile the results. Click on this GIF to see the top AVM in each county for each quarter. As you watch the quarters change, you can see that the colors representing the top honors change frequently.
The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change, and AVM’s have different strengths and tendencies. In Q4, AVMetrics independently tested 25 models; however, the GIF only highlights the 14 models that ranked in the top position of the MPTs. At least 11 AVMs shouldn’t be anyone’s first choice ANYWHERE, but they still have customers, presumably customers who don’t know the real performance of their AVMs. AVM Vendors and resellers are not Independent referees.
Independent testing is the only way to know how AVMs perform. Every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several dozen new research articles about AVMs every year).
Q4 Change Highlights- Quarterly Trends Across the Coast
As ever, if you watch a part of the map, you’ll see several changes. The dynamics in Q4 really highlighted that achieving sustained stability at higher interest rate levels was a complex challenge. The heightened volatility we saw was a clear indicator that even as efforts were made to stabilize markets, the financial landscape remained fluid. It’s like watching a constantly shifting map—just when you think things have settled, new variables come into play. Here are some places to watch:
In the Golden State, one of the highest priority counties in state has a new king, Los Angeles. Ventura, Orange, and San Diego counties also changes hands (just to name a few).
Several less-populated states had almost wholesale changes, such as the Dakotas, Alaksa, Wyoming, Nebraska, Kansas and New Mexico.
In the Sunshine State, models changed hands in several counties, including Palm Beach, Lee, Volusia, Flagler, and St. Johns.
Takeaways
Things change—a lot. Don’t rely on results from last year, earlier this year, or even last quarter! Markets evolve quickly, and often, 3 months of data are required to gather a large enough sample in smaller regions. But we can slice and analyze data in many different ways to get a clearer picture.
Use more than one AVM. It’s not always obvious from a map showing just one AVM per county, but if you consider what goes into producing those results, you’ll see that AVMs have different strengths. There are many competing to climb to the top of the rankings, so when valuing a property, you can’t be sure which AVM will be the best fit. When an AVM produces a result with low confidence, there’s a very good chance another AVM will provide a more reasonable estimate.
Use the right AVM for each use case and keep testing. Things change quickly and frequently, so staying adaptable and testing different models will help ensure you’re getting the best results.
Every quarter we analyze all the top independently tested AVMs and compile the results. Click on this GIF to see the top AVM in each county for each quarter. As you watch the quarters change, you can see that the colors representing the top honors change frequently.
The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change, and AVM’s have different strengths and tendencies. In Q3, AVMetrics independently tested 24 models; however, the GIF only highlights the 13 models that ranked in the top position of the MPTs. At least 11 AVMs shouldn’t be anyone’s first choice ANYWHERE, but they still have customers, presumably customers who don’t know the real performance of their AVMs. AVM Vendors and resellers are not Independent referees.
Independent testing is the only way to know how AVMs perform. This past quarter we saw several models retire while whole new models were introduced. Every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several dozen new research articles about AVMs every year).
Q3 Change Highlights- Quarterly Trends Across the Coast
As ever, if you watch a part of the map, you’ll see several changes. But, in Q3, as markets stabilized at higher interest rate levels, we saw a changing of the guard. Here are some places to watch:
In the Golden State, one of the highest priority counties in state has a new king, Los Angeles. Inyo, Imperial, King, and Tulare counties also changes hands (just to name a few).
Several less-populated states had almost wholesale changes, such as the Dakotas, Alaksa, Montana, Wyoming, Nebraska, Oklahoma and Kansas.
In the Sunshine State models were able to value several smaller counties that were not captured by models in Q2 including Collier, Columbia, Okeechobee, Suwannee, and Union.
Takeaways
Things change – a lot. Don’t rely on the results from last year or earlier this year. Heck, you can’t even trust last quarter! Often, 3 months’ of data are required to get a large enough sample in smaller regions, but we can slice it every way imaginable.
Use more than one AVM. It’s not obvious from a map showing just one AVM in each county, but if you think about what’s going on to produce these results, you’ll realize that AVMs have different strengths and there are a lot of them climbing all over each other to get to the top of the ranking. So, when you’re valuing a particular property, you just don’t know if it will be a good candidate for even the best AVM. When that AVM produces a result with low confidence, there’s a very good chance that another AVM will produce a reasonable estimate.
Use the right AVM for each use case and keep testing, because things change a lot and often.
1. First American has specifically enhanced its AVM, their testing, and some of their tools in anticipation of the new rules. For example, FA has invested in explainable AI (xAI) in order to address fairness concerns.
Newer AVMs, like our Procision™ AVM Suite, were designed to comply with current AVM guidelines and in anticipation of the new guidelines.
2. First American expects AVM users to be expected to take on their own testing responsibility, and this doesn’t just apply to banks.
…new guidelines, Quality Control Standards for Automated Valuation Models, requires mortgage originators and secondary market issuers “to maintain policies, practices, procedures, and control systems to ensure that automated valuation models used in these transactions adhere to quality control standards
3. AEI’s recent AVM study has drawn attention to the biggest issues with AVM testing, and our new testing techniques are advancing testing beyond any other innovation in a decade.
For several years, AVMetrics has been developing a blind testing system that it will roll out later this year. Rather than sending the same addresses to various providers each month and getting back their valuations, AVM providers will now value every property in the U.S. — more than 100 million valuations each month — and send this data to AVMetrics. The testing company will ingest this data and then blind test it against future sales and listing prices as they transact. As you would expect, this is a massive undertaking for AVM vendors and AVMetrics, but it will separate the AVMs that test well from those that actually perform well in real-world conditions.
Wierks’ conclusions are right on target with our beliefs that improving AVM accuracy, precision and confidence scoring are making them more useful to industry, and that appropriate testing is a prerequisite to their widespread adoption.
Every quarter we analyze all the top AVMs and compile the results. Click on this GIF to see the top AVM in each county for each quarter. As you watch the quarters change, you can see that the colors representing the top honors change frequently.
The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change and AVM’s have different strengths and tendencies. Phoenix has more tract housing, and some AVMs are optimized for that. Cities in the northeast have more row housing, and some models are better there. But AVMs also change – a lot. Whole new models are introduced, but every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several hundred new research articles about AVMs every year).
Q2 Change Highlights
As ever, if you watch a part of the map, you’ll see several changes. In Q2, we saw a changing of the guard. Here are some places to watch:
In Texas, most counties changed leadership. The counties that include Austin and its suburbs changed leadership. Not Dallas, but most of the counties around Dallas, and not Houston (Harris), but most of the counties around Harris County changed leadership.
Much of Alaska, and the West Coast changed leadership.
Some less-populated areas had almost wholesale changes, such as Colorado, Nevada, New Mexico, the Dakotas, rural Michigan, Illinois, Missouris, Arkansas, Louisiana, Mississippi and more.
Takeaways
Things change – a lot. Don’t rely on the results from last year. Heck, you can’t even trust last quarter! We compile these results quarterly, but our testing is non-stop, and we can produce new optimizations monthly based on a rolling 3 months or any other time period. Often, 3 months’ of data are required to get a large enough sample in smaller regions, but we can slice it every way imaginable.
Use more than one AVM. It’s not obvious from a map showing just one AVM in each county, but if you think about what’s going on to produce these results, you’ll realize that AVMs are climbing all over each other to get to the top of the ranking. So, when you’re valuing a particular property, you just don’t know if it will be a good candidate for even the best AVM. When that AVM produces a result with low confidence, there’s a very good chance that another AVM will produce a reasonable estimate. Why not be able to take three, four or five bites at the apple?