From 18% to 54% Accuracy: Few-Shot Prompting for Document Extraction — MindX

Your AI can't read technical documents? Neither could ours—until we combined computer vision preprocessing with strategic few-shot examples.

The Problem: Your AI Can't Read Technical Documents

You've tried GPT-4 Vision on your technical drawings. The results? Garbage.

Standard AI approaches fail on specialized documents because:

No domain knowledge: The model doesn't understand switchboard layouts, tier structures, or component conventions
No visual anchors: It can't distinguish what's important from what's noise
Inconsistent outputs: Every response has a different format, breaking your downstream processing
Missed sections: Critical data gets overlooked entirely

Our client faced exactly this. They needed to extract tier numbers, widths, ventilation status, and component values from electrical switchboard drawings.

First attempt with zero-shot prompting: 18.2% accuracy.

That's not a typo. The AI got it right less than 1 in 5 times. Production-ready? Not even close.

What if you could triple that accuracy without changing models or increasing costs?

What We Built: Few-Shot + Computer Vision Pipeline

We combined two techniques that individually help, but together create something powerful:

The Approach

Technique	What It Does	Impact
CV Preprocessing	Highlights key areas before LLM sees the image	Focuses attention
Few-Shot Examples	Shows the model exactly what success looks like	Teaches patterns
Structured Output	Enforces JSON schema with Pydantic	Guarantees valid data

The Results

Metric	Zero-Shot	Few-Shot (4 Examples)	Improvement
Exact Match Rate	18.2%	54.5%	+36.3%
Field-Level Accuracy	78.7%	92.6%	+13.9%
Tier Count Accuracy	72.7%	100.0%	+27.3%

The tier count went from 72.7% to 100%. The zero-shot model fundamentally misunderstood the document structure. With examples, it got it right every single time.

Step 1: Computer Vision Preprocessing

Before the LLM ever sees your document, prepare it visually.

What We Did

We trained a lightweight detection model (Roboflow) to identify key components:

MCMP plates: Yellow overlay + green border
Metering units: Blue border outline

Why This Works

The LLM receives a "cheat sheet" image. Instead of scanning the entire complex drawing, it knows exactly where to focus.

Think of it like highlighting a textbook before an exam. The content is the same, but the important parts are marked.

The Code

# Run CV model on original image
detections = cv_model.predict(original_image)

# Create highlighted overlay
highlighted = original_image.copy()
for detection in detections:
    if detection.class_name == "mcmp_plate":
        draw_overlay(highlighted, detection.bbox, color="yellow", border="green")
    elif detection.class_name == "metering_unit":
        draw_border(highlighted, detection.bbox, color="blue")

# Now send highlighted image to LLM

Step 2: Few-Shot Examples That Actually Work

The difference between good and great few-shot prompting is example selection.

Bad Examples

All similar complexity
Same document type
No edge cases

Good Examples (What We Used)

Example 1: Simple layout, minimal components
Example 2: Heavy ventilation, complex tier structure
Example 3: Mixed components, unusual widths
Example 4: Edge case with missing data

The Conversation Structure

messages = [
    # System prompt with domain expertise
    SystemMessage(content="""
        You are an expert at analyzing electrical switchboard drawings.
        Extract: tier_count, widths, ventilation, components.
        Output valid JSON matching the provided schema.
    """),

    # Example 1: Simple case
    HumanMessage(content=[
        {"type": "text", "text": "Analyze this switchboard drawing."},
        {"type": "image_url", "image_url": {"url": example_1_url}}
    ]),
    AIMessage(content=json.dumps(example_1_output)),

    # Example 2: Complex ventilation
    HumanMessage(content=[
        {"type": "image_url", "image_url": {"url": example_2_url}}
    ]),
    AIMessage(content=json.dumps(example_2_output)),

    # Example 3: Mixed components
    HumanMessage(content=[
        {"type": "image_url", "image_url": {"url": example_3_url}}
    ]),
    AIMessage(content=json.dumps(example_3_output)),

    # Example 4: Edge case
    HumanMessage(content=[
        {"type": "image_url", "image_url": {"url": example_4_url}}
    ]),
    AIMessage(content=json.dumps(example_4_output)),

    # Now the actual task
    HumanMessage(content=[
        {"type": "text", "text": "Analyze this new drawing."},
        {"type": "image_url", "image_url": {"url": actual_image_url}}
    ])
]

Key insight: 4 diverse examples beat 20 similar ones. Quality and coverage matter more than quantity.

Step 3: Enforce Structure with Pydantic

Even with great examples, LLMs can output malformed JSON. We use Pydantic to guarantee valid outputs.

The Schema

from pydantic import BaseModel
from typing import List, Optional

class ComponentSpec(BaseModel):
    type: str
    value: Optional[float]
    unit: str

class TierData(BaseModel):
    tier_number: int
    width_mm: int
    has_ventilation: bool
    components: List[ComponentSpec]

class SwitchboardExtraction(BaseModel):
    tier_count: int
    tiers: List[TierData]
    total_width_mm: int
    extraction_confidence: float

Structured Output with LangChain

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro")
structured_llm = llm.with_structured_output(SwitchboardExtraction)

result = structured_llm.invoke(messages)
# result is guaranteed to be a valid SwitchboardExtraction object

No more JSON parsing errors. No more missing fields. No more type mismatches.

Why This Works: The Psychology of AI Learning

1. Pattern Recognition Over Instruction Following

Telling an LLM "extract tier numbers" is vague. Showing it 4 examples of tier extraction teaches the pattern implicitly.

Analogy: Teaching a child to ride a bike by showing them, not by describing the physics of balance.

2. Visual Priming Reduces Cognitive Load

The CV preprocessing acts like selective attention. The model doesn't waste capacity parsing irrelevant diagram elements.

Analogy: A highlighted textbook vs. a wall of unmarked text.

3. Structured Output Eliminates Variability

Without schema enforcement, every response is a surprise. With Pydantic, you know exactly what you're getting.

Analogy: A tax form vs. a blank page that says "describe your income."

What You Can Apply Today

For Any Document Extraction Project

Don't start with the LLM. Use CV to preprocess and highlight key regions first.
Build a diverse example set. Cover edge cases, not just happy paths. 4-6 high-quality examples beats 20 mediocre ones.
Enforce structure. Use Pydantic, Zod, or JSON Schema. Never trust free-form LLM output in production.
Measure properly. Track exact match rate AND field-level accuracy. They tell different stories.

When to Use This Approach

Use Case	Expected Improvement
Technical drawings	30-40% accuracy boost
Medical forms	20-35% accuracy boost
Financial documents	25-40% accuracy boost
Handwritten forms	15-30% accuracy boost

The Technical Stack

Component	Tool	Why
LLM	Google Gemini 2.5 Pro	Best multimodal performance for technical docs
CV Model	Roboflow (custom trained)	Fast inference, easy annotation
Framework	LangChain	Structured output support
Validation	Pydantic v2	Strict mode, great error messages
Deployment	FastAPI + Modal	Serverless, auto-scaling

Results Summary

Before	After	Impact
18.2% exact match	54.5% exact match	3x improvement
Inconsistent JSON	Guaranteed schema	Zero parsing errors
Manual review required	Automated pipeline	Hours saved daily
Prototype quality	Production ready	Deployed to client

This approach transformed an experimental failure into a production system processing hundreds of documents daily.

From 18% to 54%: How Few-Shot Prompting Tripled Our Extraction Accuracy