Skip to main content
Version: V3.16

Document Understand

Introduction

Document Understand is a comprehensive AI capability provided by Laiye IDP.

Using the existing OCR and NLP atomic capabilities and deep learning model on the Laiye IDP platform, assist robots to understand documents, like classify documents, extract key information.

Document type

We classify data in business into structured, semi-structured, and unstructured categories. Here are some of their characteristics:

  • structured
    • The document layout style is fixed, and there is no layout difference between different samples
    • For example, the information collection table for handling services has a fixed style, requiring users to fill in the information in the blank
  • Semi-structured
    • The document layout style is relatively fixed. Different samples need to extract the same content, but the location of the extracted content may be different
    • For example, when a company purchases goods from different suppliers, each supplier will have its format, but the delivery order will have the order number, product information, etc
  • Unstructured
    • The document does not have a significant layout style, and while extracting the same content, different expressions may be used
    • Contracts, resumes, etc. are almost always written in plain text

Function introduction

According to the documents processing task, we divide the capabilities in Document Understand into three types, Classification, Extraction and Comparison. The applicable scenarios of each AI ability are as follows:

  • Document Classification

    • A Document Classification model can be trained by annotating a small amount of data, which can make category-based recommendations on doc.
    • It is suitable for classifying the entire doc, and also supports category-based recommendations for each page of the doc.
  • Intelligent Template Recognition

    • An Intelligent Template Recognition model can be trained by labeling a small amount of data and the model will extract key information from the doc.

    • Ideal for handling structured or semi-structured docs such as delivery notes, non-standardized notes, etc.

    • A document extraction model can be trained by just annotating the key information of the documents.

    • We offer two types of training

    • Template self-training is suitable for the scenarios with less training data and structured or semi-structured documents, such as delivery orders and non-standardized bills.

    • Document self-training is suitable for scenarios where there is a lot of training data to be processed, and it is also very suitable for unstructured documents, such as contracts, tender announcements, resumes, etc.