SoKat has worked on multiple projects where we used a subset of Natural Language Processing tools, called Optical Character Recognition, to capture and extract data from scanned images, PDF files and handwritten documents into a structured data warehouse. In one project, we extracted financial compensation and personal data from the IRS Form 990 (handwritten and scanned before 2014, after which IRS mandated computer-filled PDF forms). Our team then utilized computer vision AI models to extract data for all such forms and pulled out data from over 500,000 scanned forms between 2008 and 2013.
Project: Electronic Healthcare Records
Problem: Our Client had a variety of medical data records (e.g., phone pictures, handwritten notes, etc.) that, as unstructured data, needed to be in a more usable format. In other words, they needed a system that could ingest the healthcare records and extract all text data in a structured format.
Solution: We built an automated, cloud-based pipeline for the complete digitization of all of the Client’s scanned electronic healthcare records. Due to the variety of the types of records, we built a system that is versatile in its ability to read records. Our use of the most advanced NLP tools provided improved data quality, new insights into the data itself and was widely scalable and flexible, as we built our system in the cloud.
Lessons Learned: Medical professionals are too busy to clean data and their time is better spent on actual patient care. Also, we found that the current set of AI tools still need some tweaking when applied to healthcare data.