The Problem: Your AI Can't Read Technical Documents
You've tried GPT-4 Vision on your technical drawings. The results? Garbage.
Standard AI approaches fail on specialized documents because:
- No domain knowledge: The model doesn't understand switchboard layouts, tier structures, or component conventions
- No visual anchors: It can't distinguish what's important from what's noise
- Inconsistent outputs: Every response has a different format, breaking your downstream processing
- Missed sections: Critical data gets overlooked entirely
Our client faced exactly this. They needed to extract tier numbers, widths, ventilation status, and component values from electrical switchboard drawings.
First attempt with zero-shot prompting: 18.2% accuracy.
That's not a typo. The AI got it right less than 1 in 5 times. Production-ready? Not even close.
What if you could triple that accuracy without changing models or increasing costs?
What We Built: Few-Shot + Computer Vision Pipeline
We combined two techniques that individually help, but together create something powerful:
The Approach
| Technique | What It Does | Impact |
|---|---|---|
| CV Preprocessing | Highlights key areas before LLM sees the image | Focuses attention |
| Few-Shot Examples | Shows the model exactly what success looks like | Teaches patterns |
| Structured Output | Enforces JSON schema with Pydantic | Guarantees valid data |
The Results
| Metric | Zero-Shot | Few-Shot (4 Examples) | Improvement |
|---|---|---|---|
| Exact Match Rate | 18.2% | 54.5% | +36.3% |
| Field-Level Accuracy | 78.7% | 92.6% | +13.9% |
| Tier Count Accuracy | 72.7% | 100.0% | +27.3% |
The tier count went from 72.7% to 100%. The zero-shot model fundamentally misunderstood the document structure. With examples, it got it right every single time.
Step 1: Computer Vision Preprocessing
Before the LLM ever sees your document, prepare it visually.
What We Did
We trained a lightweight detection model (Roboflow) to identify key components:
- MCMP plates: Yellow overlay + green border
- Metering units: Blue border outline
Why This Works
The LLM receives a "cheat sheet" image. Instead of scanning the entire complex drawing, it knows exactly where to focus.
Think of it like highlighting a textbook before an exam. The content is the same, but the important parts are marked.
The Code
# Run CV model on original image
detections = cv_model.predict(original_image)
# Create highlighted overlay
highlighted = original_image.copy()
for detection in detections:
if detection.class_name == "mcmp_plate":
draw_overlay(highlighted, detection.bbox, color="yellow", border="green")
elif detection.class_name == "metering_unit":
draw_border(highlighted, detection.bbox, color="blue")
# Now send highlighted image to LLM
Step 2: Few-Shot Examples That Actually Work
The difference between good and great few-shot prompting is example selection.
Bad Examples
- All similar complexity
- Same document type
- No edge cases
Good Examples (What We Used)
- Example 1: Simple layout, minimal components
- Example 2: Heavy ventilation, complex tier structure
- Example 3: Mixed components, unusual widths
- Example 4: Edge case with missing data
The Conversation Structure
messages = [
# System prompt with domain expertise
SystemMessage(content="""
You are an expert at analyzing electrical switchboard drawings.
Extract: tier_count, widths, ventilation, components.
Output valid JSON matching the provided schema.
"""),
# Example 1: Simple case
HumanMessage(content=[
{"type": "text", "text": "Analyze this switchboard drawing."},
{"type": "image_url", "image_url": {"url": example_1_url}}
]),
AIMessage(content=json.dumps(example_1_output)),
# Example 2: Complex ventilation
HumanMessage(content=[
{"type": "image_url", "image_url": {"url": example_2_url}}
]),
AIMessage(content=json.dumps(example_2_output)),
# Example 3: Mixed components
HumanMessage(content=[
{"type": "image_url", "image_url": {"url": example_3_url}}
]),
AIMessage(content=json.dumps(example_3_output)),
# Example 4: Edge case
HumanMessage(content=[
{"type": "image_url", "image_url": {"url": example_4_url}}
]),
AIMessage(content=json.dumps(example_4_output)),
# Now the actual task
HumanMessage(content=[
{"type": "text", "text": "Analyze this new drawing."},
{"type": "image_url", "image_url": {"url": actual_image_url}}
])
]
Key insight: 4 diverse examples beat 20 similar ones. Quality and coverage matter more than quantity.
Step 3: Enforce Structure with Pydantic
Even with great examples, LLMs can output malformed JSON. We use Pydantic to guarantee valid outputs.
The Schema
from pydantic import BaseModel
from typing import List, Optional
class ComponentSpec(BaseModel):
type: str
value: Optional[float]
unit: str
class TierData(BaseModel):
tier_number: int
width_mm: int
has_ventilation: bool
components: List[ComponentSpec]
class SwitchboardExtraction(BaseModel):
tier_count: int
tiers: List[TierData]
total_width_mm: int
extraction_confidence: float
Structured Output with LangChain
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro")
structured_llm = llm.with_structured_output(SwitchboardExtraction)
result = structured_llm.invoke(messages)
# result is guaranteed to be a valid SwitchboardExtraction object
No more JSON parsing errors. No more missing fields. No more type mismatches.
Why This Works: The Psychology of AI Learning
1. Pattern Recognition Over Instruction Following
Telling an LLM "extract tier numbers" is vague. Showing it 4 examples of tier extraction teaches the pattern implicitly.
Analogy: Teaching a child to ride a bike by showing them, not by describing the physics of balance.
2. Visual Priming Reduces Cognitive Load
The CV preprocessing acts like selective attention. The model doesn't waste capacity parsing irrelevant diagram elements.
Analogy: A highlighted textbook vs. a wall of unmarked text.
3. Structured Output Eliminates Variability
Without schema enforcement, every response is a surprise. With Pydantic, you know exactly what you're getting.
Analogy: A tax form vs. a blank page that says "describe your income."
What You Can Apply Today
For Any Document Extraction Project
-
Don't start with the LLM. Use CV to preprocess and highlight key regions first.
-
Build a diverse example set. Cover edge cases, not just happy paths. 4-6 high-quality examples beats 20 mediocre ones.
-
Enforce structure. Use Pydantic, Zod, or JSON Schema. Never trust free-form LLM output in production.
-
Measure properly. Track exact match rate AND field-level accuracy. They tell different stories.
When to Use This Approach
| Use Case | Expected Improvement |
|---|---|
| Technical drawings | 30-40% accuracy boost |
| Medical forms | 20-35% accuracy boost |
| Financial documents | 25-40% accuracy boost |
| Handwritten forms | 15-30% accuracy boost |
The Technical Stack
| Component | Tool | Why |
|---|---|---|
| LLM | Google Gemini 2.5 Pro | Best multimodal performance for technical docs |
| CV Model | Roboflow (custom trained) | Fast inference, easy annotation |
| Framework | LangChain | Structured output support |
| Validation | Pydantic v2 | Strict mode, great error messages |
| Deployment | FastAPI + Modal | Serverless, auto-scaling |
Results Summary
| Before | After | Impact |
|---|---|---|
| 18.2% exact match | 54.5% exact match | 3x improvement |
| Inconsistent JSON | Guaranteed schema | Zero parsing errors |
| Manual review required | Automated pipeline | Hours saved daily |
| Prototype quality | Production ready | Deployed to client |
This approach transformed an experimental failure into a production system processing hundreds of documents daily.

