About two years ago, we started developing a machine learning model for named entity recognition (NER) on invoices. We did not have a big invoice-dataset available at the time, so we created a detection model using an XGBoost classifier. With more data available now, we believe it is possible to make a big leap in performance by changing our AI model. To achieve state-of-the-art performance, it is necessary to combine information from content and layout.
To tackle this, we use an artificial neural network trained on our invoice dataset. Specifically, the network is applied on a grid of text, to incorporate the document structure, while words are embedded as features for semantic connotations. We further improved accuracy by adding different types of data augmentation and adding an attention module inside the CNN network structure.
To transform words to embeddings which enclose important semantic meaning we used Word2Vec. These kinds of embedding models turn text into a numerical form that neural nets can understand by using the context information of every word. We transform all numeric fields to patterns to improve the quality of our custom embeddings.
Even with a relatively small dataset (around 600 unique invoice templates), our model is able to perform surprisingly well. You can take a look at the results to see the big leap we took in comparison with our previous invoice model one year ago (on a mixed testset of seen and unseen invoice templates).It gets even more interesting if we look at the recognition scores on new unseen templates, a complex task that was certainly not possible using the previous version. As a result, these are actually very competitive scores, which can be improved by training on client-specific datasets to increase to almost fully correct field-detection.
Of course, we try to keep improving our techniques for document analysis. At this moment we are developing techniques to improve our model for a specific client, without the problem of “catastrophic forgetting” (a neural net trained on new data tends to forget its skills). Furthermore, we are expanding the IxorDocs AI toolkit to be able to do NER and classification on all kinds of unstructured and structured documents.
For a more in-depth explanation, take a look at our Medium Blog.