Why Sending Raw HTML to an LLM for Web Scraping Is a Mistake (and What to Do Instead)
The Hidden Cost of Large DOM Inputs
When developers first attempt to build a web scraper using large language models (LLMs), the natural instinct is to feed the entire page's HTML into the model and ask it to extract relevant data. This seems straightforward, but it quickly reveals a major inefficiency: a typical product listing page contains 500–700 KB of raw DOM markup. Processing that much input means paying for approximately 150,000 tokens per request, enduring 15–30 seconds of latency, and frequently hitting context limits—especially for complex pages. Many projects stall at this first hurdle.

The Reality Check: 15 Models, Consistent Performance
Over a four-month period, an exhaustive evaluation was conducted across 15 different models, including GPT-4, GPT-4o, Gemini 1.5 Pro, Gemini Ultra, Claude 3 Opus, Claude 3.5 Sonnet, Mistral Large, Llama 3 70B, Cohere Command R+, and several smaller fine-tuned variants. The results fell into a predictable pattern:
- GPT-4 and Gemini Ultra delivered high accuracy but required 25–35 seconds per page.
- Claude 3.5 Sonnet offered the best accuracy-to-latency trade-off but still needed 5–10 seconds.
- Smaller models were faster but frequently hallucinated field names or produced inconsistent output.
No model solved the core latency problem because the fundamental approach—sending massive, unprocessed HTML—was flawed from the start.
The Breakthrough: Pre-Processing DOM
The real bottleneck was not the model's reasoning capability but the sheer volume of input data. To address this, a DOM pre-processor was developed with the following steps:
- Strip all
<script>,<style>, and tracking pixel elements. - Remove navigation, footer, and sidebar components.
- Collapse deeply nested wrappers that carry no semantic meaning.
- Apply SimHash to deduplicate structurally identical subtrees.
The result was a dramatic reduction from 580 KB to just 4.2 KB—a 99.3% decrease in input size. With a 4 KB input, every model became fast. More importantly, the reduced input made repeating structural patterns obvious: product cards, directory rows, and search results repeated 20, 50, or 100 times. This insight led to a fundamental shift in the architecture.
The Architecture Decision: Heuristics Before AI
Once the structural patterns were visible, it became clear that paying an LLM to detect those patterns was unnecessary. Instead, a heuristic detector was designed to:

- Identify elements with three or more structurally identical siblings.
- Score candidate lists based on depth, child count uniformity, and text density.
- Return ranked list candidates in under 0.2 milliseconds.
Then, AI enters only after detection—not to identify the list, but to label fields and structure the output. This reduces the LLM's job from 150,000 tokens to approximately 200 tokens. The resulting performance is dramatic:
| Step | Approach | Latency |
|---|---|---|
| List detection | Heuristics | 0.2 ms |
| Field labeling | LLM (small input) | ~2 s |
| Total | ~2 s |
Compare this to the naive LLM approach, which takes 25–35 seconds per page.
What Was Actually Shipped
This architecture became the foundation for Clura, a heuristic-first AI web scraper Chrome extension. On any page, Clura automatically detects every list using the heuristic engine. Users simply pick the desired list and the fields to extract; all records are retrieved in seconds. There are no prompts to describe data, no training phase, and no long waits. The heuristic layer handles detection; AI handles labeling.
The Lesson: LLMs Excel at Meaning, Not Scanning HTML
Large language models are exceptional at understanding what something means. They are terrible at scanning 600 KB of HTML to find where something is. That is a structural pattern problem—and structural pattern problems are what algorithms are built for. By combining fast, cheap heuristics for pattern detection with small, targeted LLM calls for semantic labeling, you can achieve speeds and accuracy that neither method can reach alone.
Related Articles
- MegaETH’s MEGA Token Goes Live, Hits $2 Billion Market Cap on Debut
- Ethereum's Glamsterdam Upgrade: Doubling Down on Scalability with 200M Gas Cap
- Apple Breaks R&D Spending Record as AI Race Intensifies
- 7 Essential Principles for Building Financial Products That Users Love and Stick
- Why the Motorola Razr Fold Could Dethrone Samsung's Foldable Dominance: 10 Key Points
- Reliable Rust Workers: Mastering Panic and Abort Recovery with wasm-bindgen
- 10 Crucial Updates About docs.rs Build Target Changes Starting May 2026
- Design System Crisis: Rigid Rules Lead to Zero Task Completion in Real-World Tests