Engineering

How We Achieved 99.2% Scraping Accuracy

Jan 14, 2026 · 14 min read

When we started building HostFeeds, accuracy was our north star. Not uptime, not speed, not total volume — accuracy. Rental data is only useful if you can trust it, and there is no middle ground: either your nightly rate, occupancy estimate, and review count are correct, or the downstream decision (pricing, acquisition, portfolio rebalancing) is worse than having no data at all. This article is a behind-the-scenes walkthrough of how we built an extraction pipeline that maintains 99.2% accuracy across five platforms and 208+ normalized data fields, including the mistakes we made along the way.

The short version: there is no single trick. Our accuracy comes from four independent layers — multi-path extraction, validation pipelines, platform-specific adapters, and continuous automated benchmarking — that together catch 99.2% of the errors a naive scraper would ship. Removing any single layer drops accuracy by 3-8 percentage points. You need all four.

Layer 1: Multi-path extraction

Every serious scraping system eventually learns the same lesson: never rely on a single extraction method. The web is too messy. A page that renders perfectly today can break tomorrow because an engineer at the host platform renamed a CSS class. A JSON endpoint that returns clean data for 10,000 listings can suddenly return truncated results for the 10,001st. A regex that worked for two years can silently fail on Unicode characters it never saw before.

To protect against this, every listing in our pipeline goes through three independent parsing paths:

  1. Structured data extraction — we look for JSON-LD, microdata, and Open Graph tags embedded in the page. When these are present and complete, they give us the cleanest possible data.
  2. DOM analysis — we parse the rendered HTML and extract fields from known selectors. This is the workhorse path and handles 70%+ of all fields.
  3. API response capture — for platforms that expose internal JSON APIs (Evolve, for example, uses Algolia), we intercept those responses directly. This gives us the richest data but is the most fragile to platform changes.

When all three paths agree on a value, we have high confidence and ship the listing. When they disagree, the record gets flagged for a second pass through our reconciliation logic, which weights each source based on its historical reliability for that specific field. If reconciliation still produces a conflict, the listing is marked "needs review" and excluded from the export rather than silently shipping wrong data.

The single best decision we made early was to fail loudly instead of shipping wrong numbers. Excluded listings are obvious and fixable. Wrong listings poison downstream analysis silently for months.

Layer 2: Validation pipelines

Multi-path extraction gets you most of the way to accuracy, but some errors slip through even when all three paths agree. This is where validation pipelines take over. Every data point passes through three independent validators before it's allowed into the export:

Type validation

Every field has a declared type — integer, float, ISO date, currency code, URL, enum. If a nightly rate parses as a string instead of a float, the field is rejected. If a check-in date parses outside the ISO 8601 format, it's rejected. Type validation catches silent encoding errors — the kind where a scraper accidentally captures a price with a non-breaking space before the number and ships $ 250 instead of $250.

Range validation

Each field has a reasonable value range. A nightly rate of $0 gets flagged (probably a free cancellation block, not a real rate). A nightly rate above $50,000 gets flagged (probably a misparsed total-stay price). A bedroom count above 15 gets flagged (probably a corporate housing block, not a single listing). These rules are mostly common sense, but they catch hundreds of edge cases per million records.

Cross-field consistency

Fields must be consistent with each other. A 6-bedroom property with 1 bathroom is almost certainly a parsing error. A listing with 200 reviews but a 4.0 rating is plausible; a listing with 200 reviews and a 2.1 rating is suspicious. A listing claiming "sleeps 12" with only 1 bed is clearly broken. We have hundreds of these cross-field rules and we add new ones every time we find a new failure mode in production.

Layer 3: Platform-specific adapters

Every platform structures data differently. Airbnb's listing page layout is nothing like VRBO's. Booking.com embeds JSON-LD that is mostly complete; Vacasa renders almost nothing server-side and requires waiting for JavaScript to execute. Evolve captures data from Algolia search responses. Writing one generic scraper to handle all of these is how you get 85% accuracy and a lot of angry customers.

Instead, we maintain dedicated adapters for each platform. Each adapter understands the specific DOM structure, API response shape, rate limiting behavior, and edge cases of one platform. The adapter exposes a normalized interface to the rest of the pipeline — which means upstream code never has to know whether it's dealing with Airbnb or Booking.com. Normalization is where accuracy is born.

Example: the cleaning fee problem

Airbnb stores cleaning fees in one place. VRBO stores them in another. Booking.com sometimes embeds them in the base rate and sometimes lists them separately. Without platform-specific adapters, you end up with a "cleaning fee" column that means different things depending on which row you're looking at — which is worse than having no column at all. Our adapters normalize every fee into a consistent schema: base_rate, cleaning_fee, service_fee, taxes, total_before_fees, total_after_fees. Same meaning everywhere.

Layer 4: Continuous automated benchmarking

Platforms change their layouts frequently — sometimes silently, sometimes in ways that break one scraper but not the others. We run automated accuracy benchmarks every 6 hours against a known dataset of 500 canonical listings across all five platforms. Each benchmark compares current scrape output against a verified ground truth, calculates a per-field accuracy score, and alerts our on-call engineer if any platform drops below 99.2% overall.

When an alert fires, we have a standard playbook:

  1. Identify which field broke by diffing the benchmark output against ground truth.
  2. Determine whether the platform changed its markup (most common) or whether we have a new edge case we have never seen before.
  3. Patch the adapter, push the fix, rerun the benchmark to confirm accuracy is restored.
  4. Write a regression test so the same edge case can never silently pass again.

The typical cycle from alert to deployed fix is 2-6 hours. In 2025 we had exactly 14 incidents where accuracy dropped below threshold; 11 were caught by automated benchmarks before any customer noticed, 3 were reported by customers first. We treat the three customer-reported incidents as the most important data points — they represent gaps in our benchmark coverage, and each one generated new test cases.

What we learned the hard way

A few lessons we paid tuition for:

  • Silent failures are worse than loud failures. We used to return "0" for unparseable numeric fields. That turned out to be catastrophic because downstream dashboards treated $0 rates as real. Now unparseable fields return null and the listing is flagged.
  • Your test data goes stale. Our original benchmark dataset was 100 listings. We grew it to 500 after a platform change caused a localized regression that affected property types we hadn't tested for.
  • Speed is a distraction until accuracy is solved. Early on we optimized for throughput. That was a mistake. Accuracy and throughput are in tension, and you cannot compensate for wrong data by having more of it faster.
  • Customer reports are gold. The 3 customer-reported incidents in 2025 taught us more than the 11 automated alerts combined. Make it easy for customers to flag suspicious data and reply to every single report.

The result: data you can actually trust

This multi-layered approach is why you can trust the data you export from HostFeeds. Whether you're making a $500K investment decision or setting next week's nightly rate, the numbers are solid. Not perfect — 99.2% means 8 in 1,000 records still have some minor issue, and we're transparent about that — but materially better than any single-layer scraper we've ever benchmarked.

Accuracy is not a feature we ship and then forget about. It's the constraint that shapes every engineering decision we make. When we add a new platform, the first question is always "how do we hit 99.2% on this one too?" — not "how fast can we launch?" That discipline is the actual moat.