OCR Correction with ByT5

Simon De Gheselle

Published in

ML6team

5 min readDec 15, 2021

We have trained a Dutch OCR correction seq2seq model that identifies and corrects OCR mistakes

OCR, say what? 🤔

OCR or optical character recognition is used to transform scanned documents into digitized text, the sheer volume of documents that should be digitized motivates why OCR solutions are so popular. However, OCR solutions sometimes get it wrong and produce text that differs from the source.

For example; The letter ‘D’ is often misinterpreted as an ‘0’, for a reader with some Dutch knowledge it is easy to induce that ‘0it’ is incorrect and should have been translated into ‘Dit’.

There are a lot of factors that influence the accuracy of the OCR process such as the paper quality, the typesetting, the quality of the scanner, and the degradation of the original paper source.

Often a manual post-correction phase is required. Maybe we could automate this step with a layer of machine learning? ✨ Let’s see

ByT5 model

Google recently introduced a model called ‘ByT5’, where the authors propose a token-free model that uses the raw bytes of the text instead of a tokenized representation. The authors concluded that the ByT5 architecture should be more resistant to noisy data in comparison to token-based models, hence we think it would be a good fit for this use case.

A word on tokenization 💡

Tokenization of the input text is considered to be a fundamental step in traditional NLP methods. A tokenized representation chunks up the raw texts into smaller units, called `tokens’.

These tokens can either be constructed on word, subword, or character level. The tokens are used to set up a vocabulary, which is the unique set of tokens in a given corpus or language.

Models that use tokenization have the downside that they are not robust to handle out of vocabulary words (OOV). For example; Typos, variants in spelling, and capitalization cause the tokenized representation of a word to change drastically, making it harder for the model to find similarities with the intended word.

An example of word, subword, and character-based tokenization

Token-free models to the rescue 🚀

Models such as byT5 operate directly on the raw bytes of the text have the following advantages in comparison with models that use a dedicated tokenizer:

More robust to noise in the data
Do not require language-specific tokenizers

The ByT5 model directly operates on raw bytes of the textual data, Source ByT5: Towards a token-free future with pre-trained byte-to-byte models

Token-free models align really well with the end-to-end learning approach which aims to train models that directly map raw data into meaningful predictions.

Following the `there is no free lunch’ theorem, there are a few drawbacks of using token-free models. The main drawback is an increase in computation cost since the byte level representation is longer than the subword or word tokenized representation. The self-attention mechanism at the core of the Transformer architecture has a quadratic time complexity with respect to the sequence length. Because of this dependence, processing byte sequences will result in a higher computational cost.

The data, our fuel ⛽

Our strategy to tackle this problem is to finetune the ByT5 model on a large Dutch dataset. The OSCAR corpus is a publicly available multilingual dataset obtained by performing language classification, filtering, and cleaning of the Common Crawl corpus, which is an aggregation of web crawl data.

Additionally, we’ll need to simulate OCR mistakes. For this task, we’ll use nlpaug, a library that provides a toolset for various data augmentation techniques on textual data.

Implementation 🔨

Let’s get our hands dirty and look at how we implemented our OCR corrector.

Argument parsing 🔎

To allow our network to be retrained with modified arguments at runtime, we use the argument parsing functionality of the HuggingFace API.

Argument parsing snippet

Loading the dataset and perform preprocessing 💾

We train the model on the Dutch section of the OSCAR dataset. To simulate common OCR mistakes, we used the nlpaug library. We also limit the dataset to 100k training samples and 10k test samples, for even better results this limitation can be removed.

Loading the dataset and augmenting with common OCR errors

Loading the model 🚀

The HuggingFace API provides an easy way to load a pretrained model, to load the ByT5 model you can use the following snippet.

Loading the model and the byte tokenizer

☝️ Note the need to explicitly set the model max_length in the snippet above!

Next, we’ll process the dataset so that it’s in the required format to feed into the model.

Reformatting both the train and test dataset

Training the model 💪

To train the model, we use the generic Trainer object provided by HuggingFace.

Time to do inference 🧪

Let’s fire up this bad boy 🧨! After training the model, we should be able to test it out by initializing and executing a pipeline object.

Constructing a pipeline to do inference

Results

The ByT5 model performs really well on the OCR-correction task thanks to its noise-resistant capabilities, the following table shows a few examples. Can you guess this well-known Dutch song 🎵?

Results OCR correction pipeline

Conclusion

The token-free ByT5 model turned out to work surprisingly well on our OCR correction task. If you have a downstream task that processes very noisy small to medium-sized sentences, the ByT5 model is a perfect fit and will outperform token-based models.

If you want to test out the model, you can either train it yourself with the snippets we’ve provided or you can load the model directly from the HuggingFace hub. We’ve also made a small demo, feel free to test it out 🔥.

About ML6

We are a team of AI experts and the fastest-growing AI company in Belgium. With offices in Ghent, Amsterdam, Berlin, and London, we build and implement self-learning systems across different sectors to help our clients operate more efficiently. We do this by staying on top of research, innovation and applying our expertise in practice. To find out more, please visit www.ml6.eu