Extracting Unstructured Data from 1000s of PDFs using Automation and OCR
A U.S. based company that markets and underwrites specialty insurance products and programs to a variety of niche markets required a solution to extract unstructured data from 1000s of policies in various file formats such as PDF and Word documents.
Client Challenges and Requirements
- Manual effort to read and extract information from various file formats such as PDF, Excel, email, image, etc.
- Identify documents that are scanned PDFs with unstructured data or digital PDFs and apply appropriate extraction method.
- Solution to upload the extracted data in usable format to data system.
End-to-end solution to address key pain areas and show value quickly. Bitwise solution covered 3 phases:
- Strategy and Assessment – identify and prioritize file types and pain areas
- Solution Development – develop best extraction option using Bitwise re-usable modular utilities and third-party tools to provide maximum level of automation and configuration of scripts to extract the data
- Validation – ensure accuracy on highly critical files and provide search feature to search the original document
Reusable ‘modular’ utilities used:
- Email extraction
- Reading contents of PDF to identify if it is digital or OCR
- Routing utility to direct to auto or manual
- Script to auto extract identified data points
- Script that pushes JSON, CSV or other preferred file type to data system
Tools & Technologies We Used
Open source tools
Tesseract for OCR of scanned PDFs
iText for digital PDFs
Reduced data entry job by over 60% resulting in more efficient use of resources
Ability to achieve 100% accuracy on highly critical files
Modular application allows for easy re-use