Extracting Unstructured Data from 1000s of PDFs using Automation and OCR


A U.S. based company that markets and underwrites specialty insurance products and programs to a variety of niche markets required a solution to extract unstructured data from 1000s of policies in various file formats such as PDF and Word documents.

Client Challenges and Requirements

  • Manual effort to read and extract information from various file formats such as PDF, Excel, email, image, etc.
  • Identify documents that are scanned PDFs with unstructured data or digital PDFs and apply appropriate extraction method.
  • Solution to upload the extracted data in usable format to data system.

Bitwise Solution

End-to-end solution to address key pain areas and show value quickly. Bitwise solution covered 3 phases:

  • Strategy and Assessment – identify and prioritize file types and pain areas
  • Solution Development – develop best extraction option using Bitwise re-usable modular utilities and third-party tools to provide maximum level of automation and configuration of scripts to extract the data
  • Validation – ensure accuracy on highly critical files and provide search feature to search the original document

Reusable ‘modular’ utilities used:

  • Email extraction
  • Reading contents of PDF to identify if it is digital or OCR
  • Routing utility to direct to auto or manual
  • Script to auto extract identified data points
  • Script that pushes JSON, CSV or other preferred file type to data system

Tools & Technologies We Used

Open source tools
Tesseract for OCR of scanned PDFs
iText for digital PDFs

Key Results

Reduced data entry job by over 60% resulting in more efficient use of resources

Ability to achieve 100% accuracy on highly critical files

Modular application allows for easy re-use

Download Case Study

    To get our latest updates subscribe to our Newsletter.

    Ready to start a conversation?