Skip to main content

Document Extraction

Business scenario description

Document extraction is an out-of-the-box intelligent document understanding capability provided by Laiye IDP, which can assist users to complete the intelligent extraction of key document information and realize the transformation of unstructured long text into structured data.

We know that RPA robots can simulate mechanical and repetitive tasks in human processes according to pre-designed rules, and assist in completing a large number of "fixed rules, high repeatability, and low added value" tasks. For some processes that involve document processing, there is still a lot of human involvement. Take the following two business scenarios as examples:

In the red-headed file archiving scenario of the government or state-owned enterprise, the archivists need to scan the files into electronic versions and enter the information about red-headed files (such as file title, issuing number, and issuing authority) into the archival system according to the archiving requirements.

in some equipment sales enterprise bidding monitoring scenario, the business development specialist need to browse a variety of different bidding website every day, get the latest tender announcement, preliminary screening was carried out on the bidding information, will tender announcement information recorded in the CRM system, if the enterprise qualifications meet the requirements of the tender announcement by the sales staff to further contact with the customer, bidding for subsequent operations.

In the corporate recruitment scenario, the HR/ HR specialist needs to input the employee information into the talent pool after receiving the resume.

In the above three business scenarios, THE RPA can handle the mechanical, repetitive tasks of the process, but it cannot intelligently extract key information from the document. Take the tender announcement as an example. RPA robot can capture the tender announcement from different websites, but without the assistance of AI, business personnel still need to read all the tender announcement captured and manually extract the information of the announcement into the CRM system.

The addition of document extraction AI capability to the process can help business personnel quickly process business documents, improve work efficiency, and free business personnel from repetitive mechanical reading and typing work, and put them into more high-leverage work.

Characteristics

Document extraction has the following characteristics:

  • Intelligent extraction : The result of document extraction is not entirely from the original document, and the model has different treatments for different fields. For example, the model structured the address of the extraction result of the attribution place, and the business type in the bidding announcement came from the classification model.
  • Easy to use : extract results through different colors of the annotation, if the extraction results from the original text, support to click the extraction results quickly positioning.
  • Multiple formats : supports jpeg, jpg, png, bmp, tiff, pdf, docx/doc formats.

Instructions

Text version

Create a new model

1 Log in to the Laiye IDP platform and go to the document extraction model page by Pretrained Capability/Text Understand/Document Extraction.

docExt1

2 Create a document understanding model and select the OCR engine and document type based on service requirements.

docExt2

Test

1 Click Test of document extraction model to enter the test page of the model.

docExt3

2 If you need the test sample, you can click obtain the test sample; If no, skip this step.

docExt4

3 Upload a document, click Start testing, and obtain the extraction results.

Note: When the number of pages is too large, please wait patiently, the page will display the extraction progress in real-time.

docExt5

Extract results

Different document types correspond to different document extraction models. After the test, the extraction results of all fields supported by the current model will be displayed on the visual page.

docExt6

There are three main results:

  • Extraction of the original text
    • Extract results from the original text of the test document
    • Support to update the document preview view and highlight the corresponding annotation area after clicking the content of the field in the list
  • Non-textual extraction
    • Extraction results are derived from the model's understanding of the test documents, possibly using classification models, normalized processing, etc
    • There is no mark in the document preview area. The document preview view will not be updated after clicking the field content in the list
  • Not extracted
    • The extraction result is -
    • The model did not extract the current field in the test document

Extraction fields of different type

Type:Invoice

IndexField namekey
1Invoice Numberinvoice_number
2Vendor Namevendor_name
3Vendor Addressvendor_address
4Invoice Issued Dateinvoice_issued_date
5Invoice Due Dateinvoice_due_date
6Payment Termspayment_terms
7Descriptiondescription
8Quantityquantity
9Unit Priceunit_price
10Subtotalsubtotal
11Currencycurrency
12Tax Amounttax_amount
13Total Amount Duetotal_amount_due

Type:Purchase Order

IndexField nameKey
1PO Numpo_num
2PO Datepo_date
3Delivery Datedelivery_date
4Vendor Codevendor_code
5Vendor Namevendor_name
6Vendor Addressvendor_address
7Vendor E-mailvendor_email
8Vendor Phonevendor_phone
9Customer Codecustomer_code
10Customer Namecustomer_name
11Customer Addresscustomer_address
12Customer Buyer Namecustomer_buyer_name
13Customer Delivery Namecustomer_delivery_name
14Customer Delivery Addresscustomer_delivery_address
15Customer Delivery E-mailcustomer_delivery_email
16Customer Delivery Phonecustomer_delivery_phone
17Customer Billing Namecustomer_billing_name
18Customer Billing Addresscustomer_billing_address
19Customer Billing E-mailcustomer_billing_email
20Customer Billing Phonecustomer_billing_phone
21Currencycurrency
22Term of Paymentpayment_term
23Line Numberline_number
24item Nameitem_name
25item Codeitem_code
26item typeitem_type
27item descriptionitem_description
28item Unititem_unit
29item Quantityitem_quantity
30item Unit Priceitem_unit_price
31item Discountitem_discount
32item Amountitem_amount
33item Delivery Dateitem_delivery_date
34Totaltotal