NashTech Insights

ID Card Extraction with Donut and how to support for specific languages

Hung Nguyen Dinh
Hung Nguyen Dinh
Table of Contents
artificial intelligence, brain, think-3382507.jpg

Key takeaways:
Donut is suitable for parsing document images that do not have too much information such as receipt, ID card. It is built on the state-of-the-art Transformer architecture which is the architecture behind Chat GPT. We can also train the model with different languages rather than English on on-premise infrastructure/GPU easily because of the simplicity in labelling and efficiency in GPU memory usage.

What is Donut?

Donut stands for Document Understanding Transformer, which is an open-source deep learning model for understanding/extracting information in document images. Donut is built based on Transformner – the famous state of the art deep learning model architecture behind Chat GPT. 

Understanding document images such as invoices, receipts and ID card is a challenging task because it requires complex operations for both analysing the input document images and natural language processing capabilities to extract the desired information from the image.

One of approaches for this kind of task is using an existing OCR model that consists of several modules (text detection, text recognition) to output the entire text from the document image and after that, the useful information will be extracted by some special techniques such as text/location matching or using another language model

Image from the original paper

The OCR-dependent approach has some limitations:

  • Expensive training costs and large-scale datasets are required when training an OCR model.
  • Non-flexibility to deal with different languages.

In contrast, Donut is an end-to-end model to deal with document image and natural language processing rather than relying on an existing OCR model. That means Donut covers both OCR and natural language processing capabilities by itself.

Architecture of Donut

Image from the original paper

Donut follows the popular Encoder-Decoder architecture that has been widely used in various tasks such as Machine Translation, Image Captioning.

It has two main components:

  • Visual Encoder: Swin Transformer architecture is used for extracting features of input images.
  • Text Decoder: BART decoder is used for generating output text conditioned on the features of the encoder image.

In fact, CNN based models can also be used as the visual encoder but in the study, Swin Transformer is adopted because it shows the best performance in document image parsing. The text decoder works similarly to the GPT models, it generates next words given the previous words.

Tasks that Donut can do

Because of using the GPT-like training style (Auto-regressive), several kind of tasks can be solved with Donut:

  • Document classification
  • Visual Document Question Answering
  • Visual Document Parsing

Document classification

This task is to classify to what category the document belongs. How to use Donut to classify document image? We need to pass the text <classification> into the decoder and given the input image, the decoder should output for example <class>receipt</class></classification>

Visual Document Question Answering

Given a receipt image, we ask Donut a question, for example we pass the text <vqa><question>what is the price of choco mochi?</question><answer> to the decoder, the decoder should output 14,000</answer></vqa>

Visual Document Parsing

With this task, we don’t need to provide much information to the decoder, just provide the tag (we can decide which tag to use when training) for example <parsing> then Donut should generate all the required information that is in the form of XML tags <item><name>item 01</name> other properties…</item></parsing>

How to train Donut?


The dataset for training Donut is quite simple in terms of its structure, it requires only document images and the corresponding texts and does not require location annotation of texts. This makes the labelling process more convenient to prepare.

Training method

For training Donut, we can apply two phases

  • Pre-training
  • Fine-tuning


Pre-training is a process that helps the model how to read and extract text on various types of document images which contain various layouts, font styles, etc. In this phase, Donut is not trained to focus on a specific task, the objective is just overall understanding the image and the mapping to the text.

This process may require a large number of data and a lot of time to prepare the dataset. To minimize that, authors propose a tool for generating synthetic dataset called SynthDoG (Synthetic Document Generator). By using the tool, we can generate the dataset in various languages such as Vietnamese, Japanese, Chinese, etc. that saves time a lot.

Documents generated by SynthDoG


After the model learns “how to read”, in the fine-tuning phase, we teach the model how to understand the document image by learning one of the downstream tasks listed above (classification, visual question answer, document parsing).

The decoder is trained to generate a text sequence in the form of XML tags that can be easily converted to JSON format. For example, in document classification task, the decoder is trained to generate the token sequence <class>memo</class> which can be converted to JSON as {“class”: “memo”}

Train Donut for Vietnamese ID Card Extraction

At NashTech, we have been developing various Accelerator platforms including Software development Accelerator and Data/AI Accelerator that help our Clients accelerate their development and ID Card Extraction is one of our modules in the AI Accelerator, you could see more at

In this section, I will share the experience what we have been researching and applying Donut for Vietnamese ID Card Extraction, of course we can adapt the model with any other languages such as Japanese, Chinese, Korean.

ID Card exploration

You know that for training Donut, we need to cover the OCR part as well. Furthermore, ID card is a kind of data that contains personal information and very hard to collect. So what we should do? Be inspired by the idea of generating synthetic data in the Donut original paper, we also create a custom tool that is for generating synthetic Vietnamese ID cards. 

The reason why we did not utilize the existing SynthDoG tool is that SynthDoG should be used for generating multiple kinds of documents with different layouts. This is quite different with the Vietnamese ID card layout so we decided to create a simple tool by ourselves, just need to use a piece of Open CV and Python code.

An ID card contains the information below:

  • Number
  • Full name
  • Date of birth
  • Sex
  • Nationality
  • Place of origin
  • Place of residence
  • Expiry date

The work that is the most boring and takes most of time is when we collected all the addresses (place of origin and place of residence) in Vietnam on wikipedia. The list of addresses after being collected will be used for the tool to generate synthetic ID cards.

The synthetic ID card looks like this:

Fake ID card generated with OpenCV

New idea for generating ID Card

After collecting addresses of some city/provinces, we tried to train the model with a limited data and it works. Nice! but it is boring again to get back with the address collecting process. We decided to pause the process and thought if there is some way to make that faster.

An idea in our mind, how about not rendering the address in the true order, instead just mix all the address units (ward, district, city/province) in a random orders that might help speed up the address collecting? By doing that we just need to have the dictionary of all possible address units that can be collected much more quickly than collecting the addresses in true order. On the other hand, the goal of the model is to learn how to map the pixels in the image to the corresponding words, so the text label can be anything as long as it matches with the text in the image.

The modified version of the ID card with that idea looks like:

Fake ID card generated with OpenCV

The corresponding label is as following:

{“file_name“: “image.jpg”, “ground_truth“: “{\”gt_parse\”: {\”number\”: \”651344185039\”,\”name\”: \”Tiêu<sp>Như<sp>Hồng\”,\”dob\”: \”27/08/1932\”,\”sex\”: \”Nữ\”,\”nation\”: \”Việt<sp>Nam\”,\”a1\”: \”Cuối<sp>Pơng<sp>Thải<sp>Mồ<sp>Động<sp>Lũng\”,\”a2\”: \”P.<sp>Cuối<sp>Pơng<sp>Thải<sp>Hảo,<sp>Bính,<sp>Hừa,<sp>Tựu,<sp>Dề,<sp>Tụ\”,\”expiry\”: \”16/02/1949\”}}”}


One of the problem we need to solve is the tokenizer in the original implementation does not fully support Vietnamese, some words will be returned as “unknown” after tokenizing.

We have some idea to use another tokenizer such as BARTpho  tokenizer on HuggingFace that has the capability to tokenize Vietnamese but another issue is the dictionary is too big causing the GPU memory exception.

After diving into the code, we finally come up with the idea that we still use the current tokenizer but just doing some extra works by adding some special Vietnamese tokens to the tokenizer. This really helps because it does not require changing too much code and the dictionary of the tokenizer is not too big to run on our GPU. Overall, the whole model has about 160 millions parameters and occupies about 14GB RAM on GPU, this is acceptable.

decoder.add_special_tokens([“ả”,”á”,”ã”,”à”,”ạ”,”ă” “ằ”, “ặ”, “ẳ”,”ẵ”,”ẳ“, “â”, “ầ”, “ấ”, “ậ”, “ẫ”, “ẩ”, “ơ”,”ớ”,”ờ“, “ợ”, “ở”, “ỡ”,”ò“, “ó”,”ọ“, “ỏ”, “õ”, “è”, “é”, “ẹ”, “ẻ”, “ẽ”, “ê”,”ề”,”ế”,”ể“, “ễ”,”ệ“, “ô”,”ồ”,”ố“, “ổ” “ỗ”, “ộ”,”ú“, “ù”, “ụ”, “ủ”, “ũ”, “ư”, “ừ”, “ứ”, “ữ”, “ử”, “ự”, “í”, “ì”, “ị”, “ỉ”, “ĩ”, “ý”, “ỵ”, “ỷ”, “ỳ”, “ỹ”, “đ”, “<sp>”]) 

After several days training the model with the modified dataset and it also works! Amazing!


Because of the fact that the real ID cards are extremely rare to find, we just test the model on a very limited dataset. Overall, the model produces a good result despite the fact that it has been trained on a synthetic/fake dataset, the inferencing speed is just around 4 seconds (including the inferencing time of another alignment deep learning model). 

We believe that the model is very potential if it is trained with higher quality datasets instead of synthetic datasets only.

Inference results, partial sensitive information is hidden


Like all the other GPT models, the decoder is based on auto-regressive training so the output sequence is generated word by word. This leads to the fact that it might take much more time in case of outputting a whole long document (the case we use Donut as just an OCR engine). So I think Donut is suitable for parsing document images that do not have too much information such as receipt, ID card.

If you are interested in this topic and would like to have a deep dive session on applying this AI technology on production, please access to for more information and don’t hesitate to book a demo to discuss with our AI experts.

Hung Nguyen Dinh

Hung Nguyen Dinh

I am an AI engineer at NashTech Vietnam. I have been with the company for over 8 years. I am passionate about exploring new technologies and knowledge in software development and the AI field.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

%d bloggers like this: