We are working on PDF data extraction doing a POC with a few LLMs like Claude, Co-pilot, Power automate, Textract etc. While, machine printed pdfs are fairly trivial and accurate from LLMs extraction, we are running into a challenge with hand written PDFs, as the image clarity is often very poor. Despite trying to OCR it and then parsing, it is not working well. I am trying to figure out, if anyone has a better suggestion or solution for this need?

11k views3 Upvotes13 Comments

Sort by:

Data Specialist17 days ago

For handwritten PDFs, standard OCR + LLM parsing breaks down primarily because poor image resolution, skew, noise, and handwriting variability destroy the signal before intelligence is applied. In those cases, even the best LLMs are downstream consumers of already-degraded text.

A few approaches that have worked better in practice:

Aggressive image pre-processing before OCR
Apply de-skewing, contrast normalization, denoising, binarization, and super-resolution (ESRGAN-style models) before OCR. Improving the pixels often yields more lift than changing the LLM.

Handwriting-specific OCR models (not general OCR)
Use handwriting-trained engines (e.g., AWS Textract Handwriting, Google Vision Handwriting, Azure Read Handwritten Text) rather than generic OCR. They handle stroke variability and spacing far better.

Field-level extraction instead of full-text OCR
If the documents are semi-structured, train models to detect and extract specific fields (names, dates, amounts) rather than attempting full transcription.

Vision-language models instead of OCR → text → LLM
Use multimodal models that reason directly over images. They often outperform OCR pipelines when text quality is poor because they infer context rather than relying on perfect character recognition.

Human-in-the-loop for low-confidence zones
Route only low-confidence regions for manual validation. This hybrid approach usually delivers the best cost-accuracy tradeoff for handwritten data.

Set realistic accuracy thresholds
For low-quality handwritten scans, 100% automation is rarely achievable. Designing for assisted automation rather than full automation avoids diminishing returns.

In short, LLMs amplify what they’re given. For handwritten PDFs, success depends far more on image quality, handwriting-aware models, and selective human review than on model selection alone.

Chief Information Officera month ago

So we got this challange from one of our customer and we've solve it custom solution. If you share your usecase/senario, I am happy to see if we can support you with any insights, because we've spend some good time there.

Manager, Data Science5 months ago

Take a screen shot and feed it to LLMs they do a better Job, Also try Google's Document AI had good experience with it.

IT Manager5 months ago

We had a good experience using the Docling library (https://docling-project.github.io/docling/). It was able to extract accurate data even in low quallity scenarios.

Program Director, Intelligent Automation + Entrepreneur in Healthcare and Biotech5 months ago

We have found quite a bit of success using Microsoft AI Document Intelligence. It's worth checking out.

1 Reply

no title5 months ago

*With human in the loop for validation

Content you might like

All companies are now software companies.

Strongly agree13%

Agree59%

Neutral15%

Disagree11%

Strongly disagree

View Results

I lead a department that builds APIs that power the client experience. I am looking to rebrand and establish a new department name that more accurately reflects what we do. Our current name is API and Microservices however, we are not the only department that builds APIs. I am looking for suggestions for a new department name.

Looking for a new Contract Lifecycle management solution to support end to end management of NEC contracts for a major public sector construction programme. Currently using CEMAR alongside Aconex and Autodesk Construction Cloud. Also keen to understand opportunities for using AI to support Contract management.

Do any peers have a current strategy for this capability or experience/preference in this space?

Which software development approaches or methodologies are your developers following? Select all that apply.

Agile49%

BDD23%

CI/CD47%

DevOps52%

Lean16%

Sequential (Waterfall)19%

TDD/ATDD6%

Other2%

Don't know2%

View Results

Sort by:

Content you might like

All companies are now software companies.

Which software development approaches or methodologies are your developers following? Select all that apply.

What sets us apart?

RELATED ONE-MINUTE INSIGHTS

CrowdStrike Outage: Impact And Recovery

Generative AI and Software Engineering Teams: Adoption and Training

Generative AI and ChatGPT: Adoption and Use

DevSecOps: Strategies, Organizational Benefits and Challenges

Take Your Insights On-the-Go