We are working on PDF data extraction doing a POC with a few LLMs like Claude, Co-pilot, Power automate, Textract etc. While, machine printed pdfs are fairly trivial and accurate from LLMs extraction, we are running into a challenge with hand written PDFs, as the image clarity is often very poor. Despite trying to OCR it and then parsing, it is not working well. I am trying to figure out, if anyone has a better suggestion or solution for this need?

5.9k viewscircle icon2 Upvotescircle icon12 Comments
Sort by:
Chief Information Officer21 hours ago

So we got this challange from one of our customer and we've solve it custom solution. If you share your usecase/senario, I am happy to see if we can support you with any insights, because we've spend some good time there.

Manager, Data Science4 months ago

Take a screen shot and feed it to LLMs they do a better Job, Also try Google's Document AI had good experience with it.

IT Manager4 months ago

We had a good experience using the Docling library (https://docling-project.github.io/docling/). It was able to extract accurate data even in low quallity scenarios.

Program Director, Intelligent Automation + Entrepreneur in Healthcare and Biotech4 months ago

We have found quite a bit of success using Microsoft AI Document Intelligence. It's worth checking out.

1 Reply
no title4 months ago

*With human in the loop for validation

Team Leader4 months ago

Handwritten PDFs can be tricky, especially when the scans are low quality. 

Clean up the images first (contrast, noise removal, deskewing)

Use OCR that’s designed for handwriting (like Azure Read or TrOCR)

Or even let a vision-capable LLM look at the image directly to extract info

For tricky parts, a quick human check can save a lot of headaches

This approach usually works much better than just running standard OCR.

Content you might like

Yes63%

No31%

Not yet, but we are planning to in 20214%

View Results

Yes35%

Yes, but not enough, we want/need to ramp up38%

No20%

No, but I expect this will change soon5%

View Results
We are working on PDF data extraction doing a POC with a few LLMs like Claude, Co-pilot, Power automate, Textract etc. While, machine printed pdfs are fairly trivial and accurate from LLMs extraction, we are running into a challenge with hand written PDFs, as the image clarity is often very poor. Despite trying to OCR it and then parsing, it is not working well. I am trying to figure out, if anyone has a better suggestion or solution for this need? | Gartner Peer Community