July 13, 2022 | OCR

Document Layout Analysis, the Key for Document Understanding

In today’s article, we will show why a strong document layout analysis system is crucial for document understanding and Intelligent Document Processing solutions.

Definitions
What are the benefits of a good layout analysis system?
The layout analysis system of Tesseract

In our latest blog article, we described our new Key-Value Pair extractor.

To give you a better insight into how it’s working, we will talk a bit more about the technology essential for key-value pair extraction: document layout analysis. With a strong OCR engine, document layout analysis is the first level of any document understanding process.

Definitions

Document Layout Analysis

Document layout analysis (DLA) is the identification (or detection) and categorization (or decoding) of regions.

DLA implies a geometric analysis of tables, pictures, equations, and barcodes and a logical layout analysis (paragraphs, lines, words, characters) of the document.

DLA and OCR

An OCR solution is a complex system that combines several engines which intervene at different stages of the process.

A standard OCR process includes:

Preprocessing (the cleanup phase);
Thresholding (segmentation);
Layout analysis;
Recognition;
Post-processing.

All OCR processes include these steps, with more or less success. It is why you obtain different results with the same document when testing solutions from different vendors.

Document Understanding

When you combine layout analysis with a recognition engine, you obtain the first level of document understanding.

Document understanding is at the core of any Intelligent Document Processing solution.

If you add Natural Language Processing (NLP) capabilities, as we do in GdPicture.NET, you obtain reinforced document understanding with accurate results.

What are the benefits of a good layout analysis system?

A performant layout analysis system will, of course, help with OCR, but the benefits go far beyond:

It improves the OCR recognition step, especially with LSTM-based recognizers.
It also permits to move beyond simple OCR processes and toward document understanding.
It provides better accessibility support by making the conversion to PDF/UA easier.
It improves the conversion of a fixed into an editable layout (IE: PDF to Office).

The layout analysis system of Tesseract

Hewlett-Packard first developed tesseract in the 1980s. At first proprietary software, it became open-source in 2005. Since 2006, Google has been sponsoring its development.

At first, Tesseract was developed to OCR scanned books.
This is why it is excellent to detect 2-columns pages of text.

However, there are three main constraints with this engine.

It aggressively aggregates words into lines.
Results are poor with business documents.
Tesseract is almost impossible to maintain.

Even a good pre-processing phase does little to improve OCR results with an engine that relies on Tesseract only.

For GdPicture.NET, we chose a hybrid approach that includes heuristics, mathematics, and ML capabilities.

The GdPicture.NET layout analysis system is much stronger, and together with the other processes (pre/post-processing and segmentation engines), it gets better results than Tesseract, especially in difficult cases such as:

Skewed documents
Text on colored background
Documents with a lot of noise
Underlined text
Text in graphics and tables

In a future blog article, we will show examples highlighting the differences between Tesseract and GdPicture.NET.
Stay tuned!

Cheers,

Elodie & Loïc