Recognition of Named Entities on Invoices for IxorDocs

As part of automated document flow by IxorDocs, an invoice in PDF format needs to be transformed to UBL, a standard format for digital invoices.

IxorDocs connects businesses with clients, employees and the government. Invoices are secure, HR documents optimised and businesses are connected to the international PEPPOL platform.

In order to avoid manual manipulation during the document processing, machine learning is the way to go. Using machine learning, it is possible to analyse a PDF invoice in an automatic way. To handle the document-flow in a fulyl automated way, all fields of interest, like invoice-number, order-number, date, VAT number have to be detected.

The IxorThink team was able to create and train a named entity recognition (NER) model to correctly analyze new invoices. This NER model uses the Java tool PDFBox, N-gram techniques, hashing trick and a XGBoost classifier.