Document AI reduced/replaces the need for humans in converting documents into digital format. It used Natural Language Processing (NLP) and Machine Learning (ML) in training and to get knowledge. Once trained, it can process various types of information contained within a document.
Actually, the document's variety of formats makes it quite a challenging task to process documents. In the document, there are several layouts, such as images, tables, barcodes, handwritten text and logos. Processing becomes tough due to the variation and differences in every layout. Apart from this, the quality of document images might affect the procedure of processing.
Today, data expands at a higher rate. It is approximated that by 2023, unstructured data makes up over 80% of enterprise data. Organizations predict generating 73,000 exabytes of data in the year 2023 alone.
By 2028, about 70% of data will be estimated to be stored in unstructured format. This up trend will be an enforcement of necessity for machine learning AI solutions to address.
Accessibility can become the greatest barrier to the wider adoption of Document AI very quickly. While all Amazon AWS, Google and Microsoft Azure offer powerful Document AI tools backed by their cloud services, the costs can runaway rapidly. Most often charges are levied on per-page basis or per thousand characters handled.
This may pose a cost barrier for smaller businesses or individual practitioners into ordering advanced document processing technologies if their user-levels are low but the amount of processing is high. In the following sections, we take a look at state-of-the-art models that allow us to build custom Document AI pipelines.
In a nutshell, how Document AI works
Document AI leverages Machine Learning (ML) and Natural Language Processing (NLP) to extract actionable information from free-form documents.
I'll explain the process in steps:
Ingest: The first step is to ingest the PDF. This can be done manually by uploading the PDF to the Document AI system
Once the PDF has been ingested, it is preprocessed so as to prepare the document for analysis. This may be inclusive of working like image quality detection and noise removal but using powerful multimodal models even noisy data can be tolerated to a certain extent.
Some systems would then try to improve the quality of the image or maybe de-skew for better performance.
Document Layout Analysis (DLA): DLA is performed to understand the structure of the document, which includes detecting and categorizing text blocks, images, tables, and other layout elements.
Optical Character Recognition (OCR): After DLA, OCR is applied to the structured layout to accurately recognize and convert the text within each identified block into machine-readable text.
Extraction: The system would then go ahead to extract information pertaining to the entities and relationships since it would have a structured layout and recognized text.
For instance, a multimodal model such as a Transformer trained on a large-scale dataset of documents may directly accept text and visual features in place of traditional OCR. In addition, fine-tuning multimodal models can be done to learn certain layouts and data types within documents.
Analysis: So currently the Document AI system does analysis of textual and visual information and interprets the content. It evaluates sentiment, discerns intent mapping relationships that exist between entities not to mention classifying documents by type. This might include sophisticated operations like semantic analysis, understanding of the context, and applying domain-specific rules for content review.
Output: The extracted information is then output in a format that can be used by downstream applications, such as data analytics tools, customer relationship management (CRM) systems, or other enterprise software.
Transitioning to Newer Models (RNN, CNN, Transformers)
Moving or the transition made to the newer models like RNNs, CNNs, and majorly Transformers can be evidenced from the fact of the ever-evolving nature within Document AI. RNNs find a particular application for sequential data. CNNs find applications in spatial pattern recognition capabilities.
In transformers, which is a more recent advancement in deep learning architecture, a relatively modern innovation, self-attention mechanisms are used to deliver unmatched context comprehension.
RNNs are particularly suited for sequential data that is a norm in text-based documents. They may accordingly capture context from the sequence of words and are useful in tasks involving understanding the flow of text as useful in sentiment analysis or content classification.
CNNs are adept at dealing with spatial data and can be used to extract features from images, including documents. It can detect typical patterns in the way a document is laid out, such as headers, footers, or more general paragraph structures, so it is also useful for partitioning a document into logical sections or when the visual formatting contains helpful discriminative information.
The most recent revolution that designs the architecture of neural networks, Transformers have outperformed both RNNs and CNNs for performing a variety of natural language processing tasks. Unlike the RNNs and CNNs, which either process data serially or through localized filters, Transformers use self-attention mechanisms for the weighing of parts of input irrespective of their position. This enables the more sophisticated understanding of context and relationships within the document which is critical for complex textual analytic tasks.
- Enhanced Feature Extraction: Transformers can process and understand the entire content of a document in one go, capturing intricate relationships between different parts of the text and layout.
- Improved Accuracy: These models bring a higher level of accuracy in tasks like entity recognition, information extraction, and content categorization, thanks to their ability to understand context deeply.
- Efficient Processing: While RNNs process data sequentially, which can be time-consuming for long documents, Transformers handle sequences in parallel, significantly speeding up the analysis.
By moving towards these state-of-the-art models, businesses and individual practitioners can benefit from the more nuanced, and efficient processing of documents. This transition is particularly relevant for those seeking to break free from the high costs associated with cloud-based Document AI services, as there are open-source implementations of RNNs, CNNs, and Transformers available that can be trained and customized for specific use cases.
PubLayNet: A Comprehensive Dataset for Training and Benchmarking Document AI Models
Recognizing the layout of unstructured digital documents is an important step when parsing the documents into a structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method for analyzing the layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset.
We can use PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central™. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.
The next section of state-of-the-art models is selected after using the PubLayNet dataset.
Deep Dive into Top 5 State-of-the-Art Models
The below image represents the state-of-the-art models which achieved significant results on the PubLayNet dataset.
1. VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations
A Holistic Approach: The VSR (Vision, Semantics, Relations) framework ranked third with a mAP of 95.7% and embodies a holistic approach towards document layout analysis. Integrating the visual cues along with the semantic context and inter-component relations, VSR as a system provides a comprehensive understanding of the structure of documents.
Semantic Precision: The semantic precision of VSR comes to the fore as it maintains 96.7% accuracy in the text detection, arising from its ability to parse through dense textual layouts and reclaim meaningful information - a testament of its robust semantics feature extraction.
Visual Recognition: Based on visual recognition, VSR performs creditable with a score of 94.7% in list detection and a 97.4% in table recognitions which is commendable for a system that indicates its ability to handle the structured document elements efficaciously.
Understanding Document Dynamics: Also of note in VSR is the system's discrimination of figures, at a score of 96.4% showing quite good ability to integrate visual understanding into the analysis.
Innovative Fusion of Modalities: Unlike the traditional CV-based and NLP-based methodologies, the model architecture incorporates an innovative fusion of visual and semantic features adaptively in a catering poor performed results, to overcome individual techniques' limitations. In doing so, VSR is empowered with ambiance being able to identify relations between different components but also understand them, a pertinent requirement towards intricate layout comprehension.
The Backbone of Relational Understanding: VSR effectively models the relations between layout components using a Graph Neural Network (GNN) embedding a ResNeXt-101 backbone that holds higher requirements towards document layout analysis.
The below image shows the analyzed image of a document by making use of the VSR model.
2. YOLO: Fast and Accurate Document Layout Detection
The first is the Yolo model which has been widely used on object detections. According to a research paper, The YOLO model still takes lead on document layout analysis.
What Makes YOLO Stand Out? The YOLO (You Only Look Once) model has been a revolutionary force in object detection tasks due to its speed and accuracy. Its application in document layout analysis is no exception. YOLO's ability to process images in real time allows it to quickly analyze document layouts, making it an exceptional tool for businesses that need to process large volumes of documents.
Recent Progress: Recent Advance: There have been great strides in architecture. Multi Convolutional Deformable Separation (MCDS) module picks up its feature extracting capability. The expressive difference is in text recognition where YOLO accounts for accuracy level 95.1% distinguishing in the field of analyzing document layouts.
Integration of Multiple Scale Features: This is in the introduction of Decouple Fusion Head (DFH) in YOLO, it will earn YOLO a weighting on scale feature fusion that will result to sharper predictive precision. This shines through with YOLO achieving a score of 98.6% in its table detection capabilities, showcasing its nuanced understanding over complex documents' structures.
Performance Credentials: With these upgrades, YOLO boasts performance across datasets that boast of a high mean Average Precision (mAP) standing at 95.1%. This underscores the robustness of this metric especially where the detection and classification of elements in document layouts are concerned – an accuracy rate of 95.1% for text analysis 87.6% for title identification 96% for list recognition a perfect score of 98.6% for table detection and impressive accuracy rate of 98.2% for figure identification. Such precise analysis across components is invaluable, to businesses that rely on comprehensive document analysis.
The below image shows the model visualization results on the Publaynet dataset
3. YOLOv8: Adapting Object Detection Excellence to Document Layout Analysis
Beyond Objects: Traditionally renowned for its object detection supremacy, YOLOv8 demonstrates that its core competencies can extend beyond typical use cases. With an mAP of 93.6%, it adapts seamlessly to the realm of document layout analysis, underscoring its versatility.
Text and Title Detection: Despite its roots in object detection, YOLOv8 excels in text recognition with a high accuracy rate of 93.9%, affirming that it can capture textual nuances within documents. While title detection may seem less impressive at 85.1%, it's important to note that this is in comparison to its outstanding performance in other categories, and it still represents a strong capability in detecting key document headings.
Mastering Structure with Tables and Lists: Where YOLOv8 truly shines is in its ability to recognize and differentiate structured elements such as lists and tables—achieving scores of 98.3% and 98.3%, respectively. This is particularly noteworthy, as these elements are crucial for data extraction and subsequent analysis.
Figuring Out Figures: The model also boasts a 97.6% accuracy in figure detection, which is essential for parsing and understanding the graphical content that is often pivotal in documents.
Speed Meets Accuracy: Leveraging the Darknet-53 architecture, YOLOv8 provides a balance of speed and precision, making it ideal for applications that demand real-time processing without sacrificing accuracy.
Adaptable and Robust: YOLOv8's robust performance across varied document elements suggests its adaptability to document layout analysis—a task that shares many similarities with object detection in terms of feature recognition and classification. This adaptability makes YOLOv8 a powerful ally for businesses and individuals looking for efficient document AI solutions.
4. BEiT: BERT Pre-Training of Image Transformers
Revolutionizing Vision with Transformers: Ranking fourth with a mAP score of 93.5%, BEiT is a state-of-the-art model that brings the transformative power of self-supervised learning from natural language processing onto the visual world. With a nostalgic attitude towards its NLP sibling BERT, BEiT is the abbreviation of Bidirectional Encoder representations from Image Transformers.
Novel Dual-View Representation: Unlike NLP tasks where the input text is processed effectively as a string of words, for each image, pre-training in the line of dual-view representation has been devised for BEiT where the images are considered as dense sequences of local patches as well as sets of visual tokens. Such dual perspectives help BEiT in capturing more of the richness of visual data, hence improving image understanding in a manner analogous to language understanding.
Masked Image Modeling (MIM): In the case of BEiT, Pretraining is Masked Image Modeling (MIM), i.e. it predicts original visual tokens from randomly corrupted image patches. This helps the model in catching the important visual patterns rather than focusing on minute details on a pixel level.
Vision for the Future: These figures look impressive, and while BEiT achieves a 93.4% in text detection and equally impressive numbers across title (87.4%), list (92.1%), table (97.3%), and figure (95.7%) detection, it's more than just these numbers. The BEiT approach moves in this direction of more generalist foundational models that can underlie a wide variety of visual tasks, proving to be analogous to how BERT has shaken up the NLP community.
5. SSRV: A Novel Document Object Detector Based on Spatial-Related Relation and Vision
Integrating Space and Vision: SRRV ranked fifth with a mAP of 92.2%, introduces a novel approach to document object detection by focusing on the spatial-related relationships among document objects. This method acknowledges the complexity and diversity of layouts and aims to capture the structural information and contextual dependencies that are often overlooked.
Strengthening Vision with Spatial Context: While the model's text detection stands at a solid 92.3%, it is the integration of spatial context that enhances its performance, allowing it to understand the layout with a more holistic perspective.
Rich Feature Extraction: The vision feature extraction network is at the core of SRRV, which augments the hierarchical feature pyramid to enhance information propagation. This process is critical for improving the precision in detecting various elements like lists (91.3%) and tables (93.4%).
Innovative Relation Feature Aggregation: SRRV's relation feature aggregation network takes document analysis further by employing a graph construction module to encode spatial relations and a graph learning module to aggregate these relations on a global scale. This dual approach allows SRRV to consider the document's structure more comprehensively, as reflected in its ability to detect figures with a 90.6% score.
Result Refinement for Precision: The innovative result refinement network in SRRV fuses vision and relation features for better feature distribution and relational reasoning, leading to more accurate detection and bounding box predictions. This refinement is essential for the precise detection of document objects, including the intricate details of tables and figures.
A Comprehensive Document Understanding: SRRV represents a significant step in document object detection by not just looking at the visual features but also by considering the spatial relationships between objects. This approach ensures a more comprehensive understanding of the document layout, which is vital for applications like information retrieval and document editing.
The below images are examples of document object detection tasks using SSRV. (red: title, green: text, blue: figure, yellow: table, cyan: list)
Industry-Leading Document AI Services and their Costs:
Document AI services streamline the extraction and analysis of data from documents. Here's how three major cloud services stack up:
Please note, these tables are simplified and do not capture all pricing details or scenarios. For the most accurate and comprehensive pricing, consult the respective service's pricing page.
Google Document AI
Google Document AI is a suite of tools that use AI to process and analyze documents. It includes a variety of features, such as document classification, optical character recognition, and document redaction. For high-volume needs, processing 6 million pages with the Form Parser could cost $375,000—$325,000 for the first 5 million, and $50,000 for the additional million.
Amazon Textract is a cloud-based service that uses AI to extract text, key-value pairs, and tables from documents. For 2 million pages using its Tables + Forms + Queries feature, the service may cost $125,000—$70.00 per 1,000 pages up to the first million, then $55.00 per 1,000 pages thereafter.
Microsoft Form Recognizer
Azure's Form Recognizer supports various formats like PDFs and images, focusing on text and table extraction. A volume of 500K pages processed through Custom Field Extraction would cost a flat rate of $12,500.
In conclusion, our exploration into Document AI has revealed a diverse landscape of services and state-of-the-art models, each with unique strengths tailored to different business needs.
While industry giants like Google, AWS, and Azure offer powerful, scalable solutions, they come with a price that may not align with every budget, especially when full accuracy is not a prerequisite.
As we navigate through a data-dense future, it's clear that machine learning and AI are invaluable in managing the vast sea of unstructured data, which dominates the digital realm.
The trade-offs between cost and precision necessitate a strategic approach to selecting Document AI tools, where open-source alternatives and the models like YOLO and VSR present viable options. These models potentially offer a more tailored and cost-efficient path for businesses seeking to harness the power of Document AI without significant investment.
 Layout Analysis of Scanned Documents Document Layout Analysis using YOLOv8