Blog - AVMetrics

Using Appraised Values vs. Arm’s-Length Transactions for Testing AVMs

Testing AVMs is a complex challenge, and when faced with dificult problems, creative solutions often arise. One such solution that has gained attention is using appraised values as benchmarks for AVM testing. The reasoning seems straightforward: both an AVM and an appraisal aim to estimate the market value of a property. So, in cases where a true market value from an arm’s-length transaction isn’t available, appraised values may seem like the next best option.

Appraisals are frequently conducted for reﬁnances, home equity loans, and other non-sales transactions, meaning many properties that don’t transact on the open market still have appraised values available. This availability might seem advantageous, and one might reason that if an AVM—designed to be quicker and cheaper—can match these appraisal estimates, it would be sufficient for testing.

On the surface, this reasoning seems sound: why not test AVMs against appraisals if they’re both providing value estimates? However, this approach is based on ﬂawed assumptions, and the problems with using appraised values as benchmarks for AVM testing begin at the fundamental level.

Appraised Values as Benchmarks

Inconsistency: Appraisals are conducted using structured processes with standardized guidelines, but the signiﬁcant variability in the types of appraisals (e.g., desktop, drive-by, hybrid, interior inspections) and their purposes (e.g., non-mortgage lending, relocation, litigation, HELOC, reﬁnancing) undermines consistency. This inconsistency makes appraised values less reliable as they may not fully reﬂect “Market Value,” potentially violating the uniformity required in AVM benchmark testing.
Appraisal Bias Study (quantiﬁable documented, systemic biases): Evidence shows that appraised values do not exhibit a normal distribution around actual sales prices, suggesting inherent bias*. In particular, appraisers often adjust values upwards to align with expected sales prices to avoid friction with clients. This artiﬁcial inflation undermines the accuracy of appraised values as benchmarks, illustrating that they may lead to inaccurate AVM testing.
Appraisal-Derived Data: Appraisal data has inherent limitations due to its sourcing methods, geographic constraints, and the relatively small number of observations. Additionally, it often focuses on speciﬁc assignment types and relies on limited-scope appraisals—such as desktop, drive-by, or hybrid data collection—which may not accurately capture the true market value of a property. This variability, coupled with the subjective nature of appraisals, makes them unreliable and impractical for AVM testing. Their limited availability by source and geography and potential for imprecise estimates further reduce their efectiveness as a comprehensive benchmark.
Subjectivity (individual judgment and professional biases): Despite adherence to guidelines, appraisals are subject to the judgment of individual appraisers, introducing bias**. Research supports that appraisals, particularly in certain residential contexts, are performed not to independently assess market value, but to justify predetermined loan amounts. This subjectivity compromises their reliability as benchmarks for AVM testing.
Measuring Error: Using appraised values introduces compounded errors. Testing an AVM against an estimate (appraised values) rather than the true standard (actual market transactions) results in a conﬂation of two errors—those of the AVM and those inherent in the appraisal. This distorts the AVM’s accuracy, as any bias or error in the appraisal will be reﬂected in the overall AVM Testing results.
Lag in Market Response: Appraisals are often “rear-looking,” based on historical closed sales, and thus do not capture rapid market ﬂuctuations. This is particularly problematic in volatile markets where real estate prices can change quickly. The lag in appraisals reﬂecting current market conditions can further distort AVM testing when outdated data is used.

Arm’s-Length Transactions as Benchmarks

Reﬂects True Market Value: Arm’s-length transactions occur between unrelated, independent parties, ensuring that both buyer and seller act in their best interests. As a result, these transactions accurately reﬂect the true market value of the property, making them the most reliable benchmarks for AVM testing.
Up-to-Date Information: Arm’s-length sales represent current market conditions, providing real-time data that ensures AVMs are tested against accurate, relevant information. This minimizes the risk of outdated or inaccurate benchmarks, improving the reliability of the test results.
Data Availability: Arm’s-length transactions are abundant, with approximately 4-6 million sales recorded annually across the U.S. This wealth of comparable sales data enables AVM testing in a variety of markets, including less active markets. The breadth of this data makes it a more practical and reliable option for establishing benchmarks.

Conclusion:

The use of appraised values as benchmarks for AVM testing is inherently flawed due to inconsistency, subjectivity, and the lag in reflecting current market trends. These limitations introduce bias, compounded error, and misrepresentation of accuracy in the testing process. This approach can lead to skewed and unreliable results, potentially violating new AVM quality control standards that emphasize the need for market-reflective testing.

In contrast, arm’s-length transactions provide the most reliable benchmarks, as they reflect true market values at the time of sale, accounting for factors like property exposure and typical days on market. To ensure objectivity and compliance with quality standards, AVM testing must prioritize arm’s-length transactions and ensure that the AVMs are blind to the most recent sale and listing prices***. This safeguards the integrity of the results by eliminating any recent influence of pre-existing market values or pricing.

Ultimately, appraised values should be avoided as benchmarks wherever possible, recognizing the inherent methodological limitations and risks to accuracy in testing results. For rigorous AVM validation, the reliance on arm’s-length transactions ensures more reliable, market-aligned outcomes.

*Systemic Risks in Residential Property Valuations: Perceptions and Reality, CATC June 2005 Page 13, “Full Appraisal Bias- Purchase Transactions”

**Other research corroborates the notion that certain residential appraisals are performed NOT to establish an independent and objective market value estimate, but to justify a loan amount. (Ferguson, J.T., After-Sale Evaluations: Appraisals or Justiﬁcations? Journal of Real Estate Research, 1988, 3, 19-26).

*** Current AVM Testing Methodologies versus our new Predictive Testing Methodology (PTM™) OR How Listing Prices Made Current AVM Testing Obsolete and How to Fix It (AVMNews September 2024)

#1 AVM in Each County Updated for Q4 2024

Every quarter we analyze all the top independently tested AVMs and compile the results. Click on this GIF to see the top AVM in each county for each quarter. As you watch the quarters change, you can see that the colors representing the top honors change frequently.

The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change, and AVM’s have different strengths and tendencies. In Q4, AVMetrics independently tested 25 models; however, the GIF only highlights the 14 models that ranked in the top position of the MPTs. At least 11 AVMs shouldn’t be anyone’s first choice ANYWHERE, but they still have customers, presumably customers who don’t know the real performance of their AVMs. AVM Vendors and resellers are not Independent referees.

Independent testing is the only way to know how AVMs perform. Every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several dozen new research articles about AVMs every year).

Q4 Change Highlights- Quarterly Trends Across the Coast

As ever, if you watch a part of the map, you’ll see several changes. The dynamics in Q4 really highlighted that achieving sustained stability at higher interest rate levels was a complex challenge. The heightened volatility we saw was a clear indicator that even as efforts were made to stabilize markets, the financial landscape remained fluid. It’s like watching a constantly shifting map—just when you think things have settled, new variables come into play. Here are some places to watch:

In the Golden State, one of the highest priority counties in state has a new king, Los Angeles. Ventura, Orange, and San Diego counties also changes hands (just to name a few).
Several less-populated states had almost wholesale changes, such as the Dakotas, Alaksa, Wyoming, Nebraska, Kansas and New Mexico.
In the Sunshine State, models changed hands in several counties, including Palm Beach, Lee, Volusia, Flagler, and St. Johns.

Takeaways

Things change—a lot. Don’t rely on results from last year, earlier this year, or even last quarter! Markets evolve quickly, and often, 3 months of data are required to gather a large enough sample in smaller regions. But we can slice and analyze data in many different ways to get a clearer picture.

Use more than one AVM. It’s not always obvious from a map showing just one AVM per county, but if you consider what goes into producing those results, you’ll see that AVMs have different strengths. There are many competing to climb to the top of the rankings, so when valuing a property, you can’t be sure which AVM will be the best fit. When an AVM produces a result with low confidence, there’s a very good chance another AVM will provide a more reasonable estimate.

Use the right AVM for each use case and keep testing. Things change quickly and frequently, so staying adaptable and testing different models will help ensure you’re getting the best results.

AVM Testing Schedule Announced for 2025

In our menu above, under “AVM Information” we always have the latest version of our testing schedule. 2025 AVM Validation Testing Dates have been published there as of today.

#1 AVM in Each County Updated for Q3 2024

The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change, and AVM’s have different strengths and tendencies. In Q3, AVMetrics independently tested 24 models; however, the GIF only highlights the 13 models that ranked in the top position of the MPTs. At least 11 AVMs shouldn’t be anyone’s first choice ANYWHERE, but they still have customers, presumably customers who don’t know the real performance of their AVMs. AVM Vendors and resellers are not Independent referees.

Independent testing is the only way to know how AVMs perform. This past quarter we saw several models retire while whole new models were introduced. Every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several dozen new research articles about AVMs every year).

Q3 Change Highlights- Quarterly Trends Across the Coast

As ever, if you watch a part of the map, you’ll see several changes. But, in Q3, as markets stabilized at higher interest rate levels, we saw a changing of the guard. Here are some places to watch:

In the Golden State, one of the highest priority counties in state has a new king, Los Angeles. Inyo, Imperial, King, and Tulare counties also changes hands (just to name a few).
Several less-populated states had almost wholesale changes, such as the Dakotas, Alaksa, Montana, Wyoming, Nebraska, Oklahoma and Kansas.
In the Sunshine State models were able to value several smaller counties that were not captured by models in Q2 including Collier, Columbia, Okeechobee, Suwannee, and Union.

Takeaways

Things change – a lot. Don’t rely on the results from last year or earlier this year. Heck, you can’t even trust last quarter! Often, 3 months’ of data are required to get a large enough sample in smaller regions, but we can slice it every way imaginable.

Use more than one AVM. It’s not obvious from a map showing just one AVM in each county, but if you think about what’s going on to produce these results, you’ll realize that AVMs have different strengths and there are a lot of them climbing all over each other to get to the top of the ranking. So, when you’re valuing a particular property, you just don’t know if it will be a good candidate for even the best AVM. When that AVM produces a result with low confidence, there’s a very good chance that another AVM will produce a reasonable estimate.

Use the right AVM for each use case and keep testing, because things change a lot and often.

Study: AVMetrics’ Predictive Testing Methodology

This content is restricted to subscribers

AVM Testing and Evaluation using AVM Performance Metrics

This content is restricted to subscribers

Principles for Calculating AVM Performance Metrics

This content is restricted to subscribers

Study: AVMetrics New Testing Methodology

This content is restricted to subscribers

AVMs React to New Final AVM Rules

On August 16th, Jon Wierks from First American penned an article about how First American is reacting to the new AVM Final Ruling. The article made several interesting points:

1. First American has specifically enhanced its AVM, their testing, and some of their tools in anticipation of the new rules. For example, FA has invested in explainable AI (xAI) in order to address fairness concerns.

Newer AVMs, like our Procision^™ AVM Suite, were designed to comply with current AVM guidelines and in anticipation of the new guidelines.

2. First American expects AVM users to be expected to take on their own testing responsibility, and this doesn’t just apply to banks.

…new guidelines, Quality Control Standards for Automated Valuation Models, requires mortgage originators and secondary market issuers “to maintain policies, practices, procedures, and control systems to ensure that automated valuation models used in these transactions adhere to quality control standards

3. AEI’s recent AVM study has drawn attention to the biggest issues with AVM testing, and our new testing techniques are advancing testing beyond any other innovation in a decade.

For several years, AVMetrics has been developing a blind testing system that it will roll out later this year. Rather than sending the same addresses to various providers each month and getting back their valuations, AVM providers will now value every property in the U.S. — more than 100 million valuations each month — and send this data to AVMetrics. The testing company will ingest this data and then blind test it against future sales and listing prices as they transact. As you would expect, this is a massive undertaking for AVM vendors and AVMetrics, but it will separate the AVMs that test well from those that actually perform well in real-world conditions.

Wierks’ conclusions are right on target with our beliefs that improving AVM accuracy, precision and confidence scoring are making them more useful to industry, and that appropriate testing is a prerequisite to their widespread adoption.

#1 AVM in Each County Updated for Q2 2024

Every quarter we analyze all the top AVMs and compile the results. Click on this GIF to see the top AVM in each county for each quarter. As you watch the quarters change, you can see that the colors representing the top honors change frequently.

The main point is how frequently AVM performance changes. That should be no surprise, since market conditions change and AVM’s have different strengths and tendencies. Phoenix has more tract housing, and some AVMs are optimized for that. Cities in the northeast have more row housing, and some models are better there. But AVMs also change – a lot. Whole new models are introduced, but every model is constantly being improved as builders add new data feeds and use new techniques to get better results (with respect to new techniques, over at the AVMNews, we curate articles about AVMs, and we highlight several hundred new research articles about AVMs every year).

Q2 Change Highlights

As ever, if you watch a part of the map, you’ll see several changes. In Q2, we saw a changing of the guard. Here are some places to watch:

In Texas, most counties changed leadership. The counties that include Austin and its suburbs changed leadership. Not Dallas, but most of the counties around Dallas, and not Houston (Harris), but most of the counties around Harris County changed leadership.
Much of Alaska, and the West Coast changed leadership.
Some less-populated areas had almost wholesale changes, such as Colorado, Nevada, New Mexico, the Dakotas, rural Michigan, Illinois, Missouris, Arkansas, Louisiana, Mississippi and more.

Takeaways

Things change – a lot. Don’t rely on the results from last year. Heck, you can’t even trust last quarter! We compile these results quarterly, but our testing is non-stop, and we can produce new optimizations monthly based on a rolling 3 months or any other time period. Often, 3 months’ of data are required to get a large enough sample in smaller regions, but we can slice it every way imaginable.
Use more than one AVM. It’s not obvious from a map showing just one AVM in each county, but if you think about what’s going on to produce these results, you’ll realize that AVMs are climbing all over each other to get to the top of the ranking. So, when you’re valuing a particular property, you just don’t know if it will be a good candidate for even the best AVM. When that AVM produces a result with low confidence, there’s a very good chance that another AVM will produce a reasonable estimate. Why not be able to take three, four or five bites at the apple?

The AVMetrics Blog