From Rules to Reasoning: Building a B2B Document Extractor with OCR and LLMs
Overview
Extracting structured data from B2B documents like purchase orders, invoices, or delivery notes is a common challenge in enterprise automation. Traditionally, rule-based systems using OCR (Optical Character Recognition) have been the go‑to solution. However, with the rise of large language models (LLMs), a new approach emerges: using a local LLM via Ollama to parse document content with natural language understanding. This tutorial rebuilds the same B2B document extractor twice—once with pytesseract and rule-based parsing, and once with an LLM (LLaMA 3) hosted on Ollama—using a realistic order scenario. You’ll learn the strengths, weaknesses, and practical implementation of both methods.

Prerequisites
Before diving in, ensure your environment is ready:
- Python 3.9+ installed
- Tesseract OCR engine (installation guide)
- Ollama (download) and the LLaMA 3 model pulled (
ollama pull llama3) - Python packages:
pytesseract,Pillow,opencv-python,pandas,requests(for Ollama API)
Install dependencies with:
pip install pytesseract pillow opencv-python pandas requests
Step-by-Step Instructions
1. Sample Document Preparation
Create a sample PDF (or use a real scanned order) containing typical fields: Purchase Order Number, Vendor Name, Item Description, Quantity, Unit Price, and Total Amount. For demonstration, we’ll use a clean image of an order form. Save it as order.png.
2. Rule-Based Extraction with pytesseract
This approach relies on precise template matching and regex patterns. We’ll break it into steps:
2.1 OCR the image
import pytesseract
from PIL import Image
image = Image.open('order.png')
text = pytesseract.image_to_string(image)
print(text)
2.2 Parse key fields using regular expressions
Assume the document has known labels like “PO#:”, “Vendor:”, etc.
import re
def extract_rule_based(text):
data = {}
# PO number
po_match = re.search(r'PO#:\s*(\w+)', text)
if po_match:
data['po_number'] = po_match.group(1)
# Vendor
vendor_match = re.search(r'Vendor:\s*(.+)', text)
if vendor_match:
data['vendor'] = vendor_match.group(1).strip()
# Items: assume a table with lines like "Item: ... Qty: ... Price: ..."
item_pattern = r'Item:\s*(.+?)\s*Qty:\s*(\d+)\s*Price:\s*(\d+\.?\d*)'
items = re.findall(item_pattern, text)
data['items'] = [{'desc': i[0], 'qty': int(i[1]), 'price': float(i[2])} for i in items]
return data
result_rules = extract_rule_based(text)
print(result_rules)
2.3 Handle alignment issues
Use OpenCV to deskew the image before OCR if needed.
3. LLM-Based Extraction with Ollama & LLaMA 3
Instead of rigid patterns, we let the LLM interpret the raw text. The local LLM runs via Ollama’s API.
3.1 Send OCR text to Ollama
import requests
import json
def extract_with_llm(raw_text):
prompt = f"""Extract the following fields from this order document and return them in JSON format with keys: po_number, vendor, items (list of objects with desc, qty, price). Document text:
{raw_text}
"""
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': 'llama3',
'prompt': prompt,
'stream': False
}
)
result = response.json()['response']
# Try to extract JSON from the response
try:
# find JSON block
start = result.index('{')
end = result.rindex('}') + 1
json_str = result[start:end]
return json.loads(json_str)
except:
return {'error': 'Failed to parse LLM output', 'raw': result}
3.2 Process the image
# Same ocr step
raw_text = pytesseract.image_to_string(image)
result_llm = extract_with_llm(raw_text)
print(result_llm)
4. Testing Both on a Realistic B2B Order
Use a document with varying layouts (e.g., handwritten numbers, rotated text). The rule-based method struggles if labels move; the LLM is more resilient but may hallucinate.

Common Mistakes & Pitfalls
- Ignoring OCR quality: Garbage in, garbage out. Both methods require clean OCR. Preprocess images (binarization, contrast adjustment).
- Over‑fitting regex patterns: Rule-based parsers break with tiny layout changes. Always test on multiple document variants.
- Prompt engineering failures: LLMs need explicit instructions. Forgetting to restrict output format can yield verbose or malformed responses.
- Not handling edge cases: Missing fields, multiple tables, or non‑English text. Rule systems need fallbacks; LLMs may need retry logic.
- Performance overhead: LLM inference is slower and more resource‑intensive than regex. For high‑volume processing, rules are faster.
Summary
Both rule‑based and LLM‑based document extractors have their place. The rule‑based approach is deterministic, fast, and cheap, but brittle. The LLM approach offers flexibility and understanding, but requires careful prompt design, offline models, and more computing power. For B2B documents with stable layouts, stick with rules. For variable or messy documents, an LLM can save countless hours of manual rule maintenance. This tutorial gave you a working foundation for both—choose wisely based on your use case.
Related Articles
- Reclaiming the American Dream: A Guide to Building a Future of Fairness and Opportunity
- Unifying Cloud Visibility: HCP Terraform with Infragraph – Your Questions Answered
- Smart Laptop Shopping: Top Deals for Every Budget Right Now
- Remembering Tomáš Kalibera: Key Questions About His Life and Legacy in the R Project
- Samsung Galaxy S26 Ultra Hits Unprecedented $300 Discount on Amazon – Analysts Call It a Must-Buy
- Mastering Modern Web Performance: A Step-by-Step Guide to JetStream 3's WebAssembly Revolution
- Building Resilient Multi-Step Workflows with Microsoft Agent Framework
- The RAM Crisis Deepens: 10 Shocking Facts You Need to Know