Tag: AVMs

MISMO’s Step in the Right Direction, The Common Confidence Score, Is Just One Step

To appreciate what MISMO’s new AVM Common Confidence Score Standard accomplishes, it helps to know where confidence scores have been.

First-generation confidence score schemas were all over the map: 1–10, 1–100, A–B–C–D, High–Medium–Low. In 2005, Doug Gordon, PhD, then chief modeler at Freddie Mac, published a paper called Metrics Matter[i] in an attempt to explain and defend Freddie Mac’s High–Medium–Low schema. That paper ignited a broader industry effort to create a second generation: getting every AVM to calculate and publish a Forecast Standard Deviation (FSD) alongside its value estimate, providing a mathematically grounded and theoretically consistent measure of uncertainty. It was a genuine improvement. Unfortunately, independent testing revealed that not all models were calibrating FSD correctly — their projections did not match actual outcomes.

The AVM Quality Control Standards[ii] rule renewed attention on the problem[iii]. End users still couldn’t reliably compare one model’s confidence output to another’s. That frustration, combined with regulatory momentum, drove the third-generation effort: a truly common confidence score schema with the potential for industry-wide adoption. MISMO’s Common Confidence Score[iv] is that effort. Whether it becomes the enduring standard will depend on what happens next.

“Well Calibrated”

For the first time, there is a shared definition of what a confidence score should mean: the estimated probability that a valuation falls within plus or minus ten percent of the property’s actual market value. Until now, three AVM providers could report confidence scores of “92,” “B,” and “High” for the same property with no way to compare them or hold any of them to a consistent standard.[v] That clarity is genuinely valuable.

But defining what a confidence score should mean is not the same as knowing whether any particular score actually delivers on that definition. The MISMO guidance makes this plain: “for a Common Confidence Score model which is well calibrated, 85% of the AVM values with a Common Confidence Score of .85 will be within 10% of the market value.”

That leaves just one question: how do we know if a model is well calibrated? That is where the real work begins.

Calibration is distinct from accuracy. A model can produce reasonably accurate values overall while still assigning confidence scores that systematically overstate or understate the probability of being within the expected range. An AVM provider assigning a Common Confidence Score of 0.85 is making a specific, testable claim: that across all valuations carrying that score, roughly 85% should fall within ten percent of true market value. Whether that claim holds is not a matter of definition. It is a matter of evidence — evidence that requires testing.

The MISMO standard acknowledges this directly, noting that the Common Confidence Score “should be subject to regular testing to verify alignment with the above definition.” The standard identifies the need for testing. It does not supply it.

The Scale Problem

Calibration testing is not a simple accuracy check. To verify that a confidence score is well calibrated, you need to observe performance across many thousands of valuations, segmented by score range. You need enough observations in each confidence band to draw statistically meaningful conclusions. Too few data points in the 0.80–0.85 band, for instance, and you cannot determine whether the score is performing as claimed — you are left with approximation rather than verification.[vi]

For most lenders, assembling that volume of test data is genuinely out of reach. Community banks, credit unions, and mid-sized regional lenders may not originate enough loans in a given quarter to support a properly powered calibration test. The only institutions potentially capable of approaching this problem with adequate sample sizes are the largest in the country.

Geography Compounds the Problem

AVM performance is not uniform across geography, and confidence score calibration is no exception. A model that is well calibrated nationally may perform poorly in specific markets — rural counties with low transaction volumes, rapidly appreciating urban submarkets, or areas with heterogeneous housing stock.

A lender operating in those markets needs to know how confidence scores perform there, not just on average. Geographic segmentation requires substantially more data than aggregate testing. Assessing calibration at the county or MSA level requires a dataset large enough to support meaningful analysis in each geography and each score cohort separately — a threshold well beyond the reach of most institutions. A confidence score that performs well nationally but poorly in the counties where a lender does business is not protecting that lender.

Who Verifies the Score?

The MISMO standard represents a voluntary industry commitment to a shared definition. It does not require AVM providers to submit their confidence score models to independent external testing. The verification of calibration is left to the users of those scores — institutions that often lack the data volume and infrastructure to do it rigorously.

This is the structural problem. The models themselves have the ability to test their own confidence scoring, and they certainly do — but they cannot be independent in that assessment. Very few AVM users have the capability to perform that testing. That gap is what independent testing is designed to address.

The AVM Quality Control Standards that took effect in October 2025 require institutions to maintain policies and procedures ensuring a high level of confidence in AVM estimates. Verifying that confidence scores mean what they claim to mean is precisely what that requirement demands.

MISMO has done the work of defining the target. Independently verifying that AVM providers are hitting it — across the full range of markets and conditions on which lenders depend — is the work that follows.

AVMetrics is an independent AVM testing and validation firm serving banks, credit unions, nonbank lenders, and AVM providers. AVMetrics tests AVM performance against actual sale prices across a national dataset of 500,000 to 700,000+ records per quarter, covering 1,700 counties representing 96+% of the U.S. population. Join our community at avmetrics.net to stay up-to-date.

[i] Douglas Gordon, Metrics Matter, The Thomson Corporation and National Mortgage News at 1 (2005).

[ii] Quality Control Standards for Automated Valuation Models, 89 Fed. Reg. 64538 (Aug. 7, 2024), https://www.federalregister.gov/documents/2024/08/07/2024-16197/quality-control-standards-for-automated-valuation-models

[iii] The Appraisal Foundation Industry Advisory Council (IAC) Automated Valuation Model (AVM) Task Force Report, Phase 2: A Report on the Use of AVMs in the Valuation of Residential Real Estate (2023), https://appraisalfoundation.org/pages/resources-b/the-appraisal-foundation-industry-advisory-council-automated-valuation-model-avm-task-force-report-phase-2

[iv] AVM Common Confidence Score Standard & Guidance, Version 1.0 (July 2025), Mortgage Industry Standards Maintenance Organization, Inc. (MISMO).

[v] While AVM providers’ native confidence scores have not been comparable, independent testing methodologies (e.g., AVMetrics’ Predictive Testing Methodology (PTM™)) have provided a consistent, empirically grounded basis for cross-model comparison and validation.

[vi] See William G. Cochran, Sampling Techniques (3rd ed. 1977). Cochran’s formula establishes the minimum sample size required to achieve a desired confidence level and margin of error, underscoring the need for sufficiently large sample sizes within each segmented group.

Best Practices for AVM Testing: Why Sale Prices Matter More Than Appraisal Values

The Challenge of AVM Testing in Quality Control
Automated Valuation Models (AVMs) have become a cornerstone of the appraisal quality control process, with many lenders and Appraisal Management Companies (AMCs) implementing what is commonly known as the “15% rule.” Under this framework, if an appraised value falls within 15% of the AVM estimate, the appraisal undergoes only a cursory review. Conversely, variances exceeding 15% trigger more thorough scrutiny or require review by another appraiser.

While this approach offers operational efficiency, it fundamentally misunderstands the distinction between precision and accuracy. This article examines why using appraisals as benchmarks for AVM testing creates a dangerous circular logic and proposes a more robust methodology centered on actual sale transactions.

Understanding Precision Versus Accuracy
Comparing AVM values to appraised values measures consistency—or precision—not accuracy. This distinction is critical yet often overlooked in the industry. While understanding the consistency between different valuation methods has operational value, particularly for ensuring that credit decisions are based on reliable estimates regardless of methodology, it serves a fundamentally different purpose than accuracy testing.

True accuracy in property valuation refers to how closely an estimate reflects the actual market value at a specific point in time. Market value, regardless of the valuation method employed, is universally defined as the most probable price negotiated between a willing buyer and seller in an arm’s length transaction. Therefore, the only meaningful measure of accuracy is the comparison to the actual negotiated sale price—the endpoint of the price negotiation process.

The Circular Logic Problem
The practice of using appraisals as benchmarks for AVM testing raises a fundamental question: Are we testing whether the model provides an accurate value consistent with actual sale prices, or merely a precise value consistent with appraisal estimates? This distinction has significant implications.

If AVMs tested against appraisals are mistakenly labeled as “accurate” rather than simply “consistent with appraisal estimates,” the quality control process becomes streamlined in a potentially dangerous way. More appraisals pass through without complications or delays from further review, creating a self-reinforcing cycle. This circular process—using AVMs in quality control to validate appraisals while simultaneously using appraisals as benchmarks to test AVMs—is fraught with risk and undermines the integrity of both valuation methods.

The Critical Role of Blind Testing
Determining the accuracy of any valuation estimate requires blind testing—the property’s sale price must be unknown when the estimate is produced. While implementing blind testing for appraisals presents significant challenges due to USPAP standards requiring appraisers to review and consider the terms of sale contracts, AVMs can and should be tested blindly with relative ease.

Most AVM vendors already conduct blind testing routinely as part of their quality assurance programs. This process allows modelers to understand model accuracy before a sale price becomes known—a critical capability for calibrating models and improving their predictive accuracy.

The Importance of MLS Data Suppression
The blind testing process must extend to MLS data, particularly listing prices. While listing prices occasionally match final sale prices, they more typically represent the starting point of buyer-seller negotiations. However, the listing price itself is not arbitrary—considerable analysis has already been completed to determine it.

Real estate professionals conduct thorough studies of the property, neighborhood, and relevant market activity—many of the same activities required for property valuation—before establishing a listing price. When AVMs have access to this information, they gain a significant advantage that may not reflect their true predictive capabilities. Therefore, proper testing must ensure that listing prices remain unknown when estimates are produced, maintaining the integrity of the blind testing process.

MLS data suppression also aligns with real-world use cases. In prominent AVM applications such as mortgage refinancing or home equity loans, the properties serving as collateral are rarely listed on MLS. Most lenders actually prohibit refinancing properties that are currently or recently listed. Testing AVMs without MLS data therefore provides a more realistic assessment of model performance in actual lending scenarios.

Regulatory Guidance and Compliance
The breadth and scope of AVM testing—including methodology selection and resource allocation—must be tailored to each institution’s unique circumstances. Regulatory guidance provides flexibility for organizations to demonstrate compliance within the context of their specific risk assessments.

While no explicit prohibition exists against using appraisal values as benchmarks, the Interagency Appraisal and Evaluation Guidance offers clear direction in Appendix B: “To ensure unbiased test results, an institution should compare the results of an AVM to actual sales data in a specified trade area or market prior to the information being available to the model.” This affirmative statement strongly supports the use of sales transactions as the foundation of AVM testing methodology.

Conclusion: Building a Robust Testing Framework
The distinction between measuring precision and accuracy in AVM testing is not merely academic—it has practical implications for lending decisions, risk management, and regulatory compliance. While comparing AVMs to appraisals may offer insights into consistency between valuation methods, it cannot and should not be mistaken for accuracy testing.

A robust AVM testing methodology must incorporate the following core principles:

• Use actual sale transactions as the primary benchmark for accuracy testing

• Implement truly blind testing protocols that exclude sale prices during estimate generation

• Suppress MLS listing price data for the benchmark property only to ensure unbiased testing. All other MLS-derived information—such as property characteristics, comparable sales, and broader market metrics—should remain fully available and not suppressed.

• Recognize the different purposes served by precision and accuracy measurements

As the regulatory guidance makes clear, institutions have the flexibility to design testing programs that suit their specific needs and risk profiles. However, this flexibility should not obscure the fundamental requirement: AVM testing must measure actual predictive accuracy against real market outcomes, not merely consistency with other valuation estimates. Only through rigorous, unbiased testing against actual sale prices can institutions truly understand and rely upon the accuracy of their automated valuation models.

Using Appraised Values vs. Arm’s-Length Transactions for Testing AVMs

Testing AVMs is a complex challenge, and when faced with dificult problems, creative solutions often arise. One such solution that has gained attention is using appraised values as benchmarks for AVM testing. The reasoning seems straightforward: both an AVM and an appraisal aim to estimate the market value of a property. So, in cases where a true market value from an arm’s-length transaction isn’t available, appraised values may seem like the next best option.

Appraisals are frequently conducted for refinances, home equity loans, and other non-sales transactions, meaning many properties that don’t transact on the open market still have appraised values available. This availability might seem advantageous, and one might reason that if an AVM—designed to be quicker and cheaper—can match these appraisal estimates, it would be sufficient for testing.

On the surface, this reasoning seems sound: why not test AVMs against appraisals if they’re both providing value estimates? However, this approach is based on flawed assumptions, and the problems with using appraised values as benchmarks for AVM testing begin at the fundamental level.

Appraised Values as Benchmarks

  1. Inconsistency: Appraisals are conducted using structured processes with standardized guidelines, but the significant variability in the types of appraisals (e.g., desktop, drive-by, hybrid, interior inspections) and their purposes (e.g., non-mortgage lending, relocation, litigation, HELOC, refinancing) undermines consistency. This inconsistency makes appraised values less reliable as they may not fully reflect “Market Value,” potentially violating the uniformity required in AVM benchmark testing.
  2. Appraisal Bias Study (quantifiable documented, systemic biases): Evidence shows that appraised values do not exhibit a normal distribution around actual sales prices, suggesting inherent bias*. In particular, appraisers often adjust values upwards to align with expected sales prices to avoid friction with clients. This artificial inflation undermines the accuracy of appraised values as benchmarks, illustrating that they may lead to inaccurate AVM testing.
  3. Appraisal-Derived Data: Appraisal data has inherent limitations due to its sourcing methods, geographic constraints, and the relatively small number of observations. Additionally, it often focuses on specific assignment types and relies on limited-scope appraisals—such as desktop, drive-by, or hybrid data collection—which may not accurately capture the true market value of a property. This variability, coupled with the subjective nature of appraisals, makes them unreliable and impractical for AVM testing. Their limited availability by source and geography and potential for imprecise estimates further reduce their efectiveness as a comprehensive benchmark.
  4. Subjectivity (individual judgment and professional biases): Despite adherence to guidelines, appraisals are subject to the judgment of individual appraisers, introducing bias**. Research supports that appraisals, particularly in certain residential contexts, are performed not to independently assess market value, but to justify predetermined loan amounts. This subjectivity compromises their reliability as benchmarks for AVM testing.
  5. Measuring Error: Using appraised values introduces compounded errors. Testing an AVM against an estimate (appraised values) rather than the true standard (actual market transactions) results in a conflation of two errors—those of the AVM and those inherent in the appraisal. This distorts the AVM’s accuracy, as any bias or error in the appraisal will be reflected in the overall AVM Testing results.
  6. Lag in Market Response: Appraisals are often “rear-looking,” based on historical closed sales, and thus do not capture rapid market fluctuations. This is particularly problematic in volatile markets where real estate prices can change quickly. The lag in appraisals reflecting current market conditions can further distort AVM testing when outdated data is used.

Arm’s-Length Transactions as Benchmarks

  1. Reflects True Market Value: Arm’s-length transactions occur between unrelated, independent parties, ensuring that both buyer and seller act in their best interests. As a result, these transactions accurately reflect the true market value of the property, making them the most reliable benchmarks for AVM testing.
  2. Up-to-Date Information: Arm’s-length sales represent current market conditions, providing real-time data that ensures AVMs are tested against accurate, relevant information. This minimizes the risk of outdated or inaccurate benchmarks, improving the reliability of the test results.
  3. Data Availability: Arm’s-length transactions are abundant, with approximately 4-6 million sales recorded annually across the U.S. This wealth of comparable sales data enables AVM testing in a variety of markets, including less active markets. The breadth of this data makes it a more practical and reliable option for establishing benchmarks.

Conclusion:

The use of appraised values as benchmarks for AVM testing is inherently flawed due to inconsistency, subjectivity, and the lag in reflecting current market trends. These limitations introduce bias, compounded error, and misrepresentation of accuracy in the testing process. This approach can lead to skewed and unreliable results, potentially violating new AVM quality control standards that emphasize the need for market-reflective testing.

In contrast, arm’s-length transactions provide the most reliable benchmarks, as they reflect true market values at the time of sale, accounting for factors like property exposure and typical days on market. To ensure objectivity and compliance with quality standards, AVM testing must prioritize arm’s-length transactions and ensure that the AVMs are blind to the most recent sale and listing prices***. This safeguards the integrity of the results by eliminating any recent influence of pre-existing market values or pricing.

Ultimately, appraised values should be avoided as benchmarks wherever possible, recognizing the inherent methodological limitations and risks to accuracy in testing results. For rigorous AVM validation, the reliance on arm’s-length transactions ensures more reliable, market-aligned outcomes.

 

*Systemic Risks in Residential Property Valuations: Perceptions and Reality, CATC June 2005 Page 13, “Full Appraisal Bias- Purchase Transactions”

**Other research corroborates the notion that certain residential appraisals are performed NOT to establish an independent and objective market value estimate, but to justify a loan amount. (Ferguson, J.T., After-Sale Evaluations: Appraisals or Justifications? Journal of Real Estate Research, 1988, 3, 19-26).

*** Current AVM Testing Methodologies versus our new Predictive Testing Methodology (PTM™) OR How Listing Prices Made Current AVM Testing Obsolete and How to Fix It (AVMNews September 2024)

#1 AVM in Each County Updated for Q4 2024

Every quarter we analyze all the top independently tested AVMs and compile the results. Click on this GIF to see the top AVM in each county for each quarter. As you watch the quarters change, you can see that the colors representing the top honors change frequently.

 

The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change, and AVM’s have different strengths and tendencies. In Q4, AVMetrics independently tested 25 models; however, the GIF only highlights the 14 models that ranked in the top position of the MPTs. At least 11 AVMs shouldn’t be anyone’s first choice ANYWHERE, but they still have customers, presumably customers who don’t know the real performance of their AVMs. AVM Vendors and resellers are not Independent referees.

Independent testing is the only way to know how AVMs perform.  Every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several dozen new research articles about AVMs every year).

Q4 Change Highlights- Quarterly Trends Across the Coast 

As ever, if you watch a part of the map, you’ll see several changes. The dynamics in Q4 really highlighted that achieving sustained stability at higher interest rate levels was a complex challenge. The heightened volatility we saw was a clear indicator that even as efforts were made to stabilize markets, the financial landscape remained fluid. It’s like watching a constantly shifting map—just when you think things have settled, new variables come into play.  Here are some places to watch:

  1. In the Golden State, one of the highest priority counties in state has a new king, Los Angeles.  Ventura, Orange, and San Diego counties also changes hands (just to name a few).
  2. Several less-populated states had almost wholesale changes, such as the Dakotas, Alaksa, Wyoming, Nebraska, Kansas and New Mexico.
  3. In the Sunshine State, models changed hands in several counties, including Palm Beach, Lee, Volusia, Flagler, and St. Johns.

Takeaways

Things change—a lot. Don’t rely on results from last year, earlier this year, or even last quarter! Markets evolve quickly, and often, 3 months of data are required to gather a large enough sample in smaller regions. But we can slice and analyze data in many different ways to get a clearer picture.

Use more than one AVM. It’s not always obvious from a map showing just one AVM per county, but if you consider what goes into producing those results, you’ll see that AVMs have different strengths. There are many competing to climb to the top of the rankings, so when valuing a property, you can’t be sure which AVM will be the best fit. When an AVM produces a result with low confidence, there’s a very good chance another AVM will provide a more reasonable estimate.

Use the right AVM for each use case and keep testing. Things change quickly and frequently, so staying adaptable and testing different models will help ensure you’re getting the best results.

#1 AVM in Each County Updated for Q3 2024

Every quarter we analyze all the top independently tested AVMs and compile the results. Click on this GIF to see the top AVM in each county for each quarter. As you watch the quarters change, you can see that the colors representing the top honors change frequently.

The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change, and AVM’s have different strengths and tendencies. In Q3, AVMetrics independently tested 24 models; however, the GIF only highlights the 13 models that ranked in the top position of the MPTs. At least 11 AVMs shouldn’t be anyone’s first choice ANYWHERE, but they still have customers, presumably customers who don’t know the real performance of their AVMs. AVM Vendors and resellers are not Independent referees.

Independent testing is the only way to know how AVMs perform. This past quarter we saw several models retire while whole new models were introduced. Every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several dozen new research articles about AVMs every year).

Q3 Change Highlights- Quarterly Trends Across the Coast 

As ever, if you watch a part of the map, you’ll see several changes. But, in Q3, as markets stabilized at higher interest rate levels, we saw a changing of the guard. Here are some places to watch:

  1. In the Golden State, one of the highest priority counties in state has a new king, Los Angeles.  Inyo, Imperial, King, and Tulare counties also changes hands (just to name a few).
  2. Several less-populated states had almost wholesale changes, such as the Dakotas, Alaksa, Montana, Wyoming, Nebraska, Oklahoma and Kansas.
  3. In the Sunshine State models were able to value several smaller counties that were not captured by models in Q2 including Collier, Columbia,  Okeechobee, Suwannee, and Union. 

Takeaways

Things change – a lot. Don’t rely on the results from last year or earlier this year. Heck, you can’t even trust last quarter! Often, 3 months’ of data are required to get a large enough sample in smaller regions, but we can slice it every way imaginable.

Use more than one AVM. It’s not obvious from a map showing just one AVM in each county, but if you think about what’s going on to produce these results, you’ll realize that AVMs have different strengths and there are a lot of them climbing all over each other to get to the top of the ranking. So, when you’re valuing a particular property, you just don’t know if it will be a good candidate for even the best AVM. When that AVM produces a result with low confidence, there’s a very good chance that another AVM will produce a reasonable estimate.

Use the right AVM for each use case and keep testing, because things change a lot and often.