Salary Certificate NER
Salary Certificate Named Entity Recognizer is an application that is based on the Natural Language Processing (NLP) field of Machine Learning (ML). This application extracts significant information from images of scanned Salary Certificates that are written in Kuwaiti Arabic dialect. Salary Certificate NER application uses a hybrid approach i.e., Combining ML with a rule-based approach in a pipelined process
The Salary Certificate Named Entity Recognizer is an application that is based on Natural Language Processing (NLP) field of Machine Learning (ML). This application extracts significant information from images of scanned Salary Certificates that are written in Kuwaiti Arabic dialect. This information consists of,
- Person’s Name
- Nationality
- Company Name
- Civil ID Number
- Salary
Salary Certificate NER application uses hybrid approach i.e., Combining ML with rule-based approach in a pipelined process. The architecture of pipeline consists of:
- Applying OCR on scanned document.
- Preparing extracted text for further processing.
- Tokenizing the processed text.
- Tagging relevant Parts of Speech (POS) tags to Tokenized text.
- Passing POS tagged text to Named Entity Recognizer (NER).
- Using rule-based approach to extract any missed fields.
Generating output in required format (json, CSV, TSV, etc.)
Technology Stack:
- Python 3.7
- Natural Language Processing (NLP)
- NLTK Stanford – Stanza
- Tesseract OCR
- Python Flask Framework (for web interface)
Benefits:
- Extract Significant Information in a trice
- Faster than Manual Data Entry
- Usable as a part in Process Automation