Skip to main content
Version: latest

Multi page extraction model

Brief introduction

The multi page extraction model is suitable for processing unstructured multi page document, such as contracts, bidding announcements, resumes, etc. Creating a multi page extraction model can:

  • Evaluating Document Extraction The effect of the pre training model provided in on real business data
  • Define verification rules according to business requirements to realize automatic approval of document

Business scenario description

For example, telemarketing of a construction company needs to screen business opportunities from various channels every day and review whether it is suitable for the company to bid. For example, a bidding announcement is collected and the following conclusions are expected:

For projects in Beijing, the investment is more than 2.5 million and the construction period is required to be more than 30 days, which is suitable for further tracking.

It's simple to say, but there are many things in it. A simple analysis requires the following four steps:

  1. Obtain the latest bidding announcement in real time on various government open platforms and upload it to the enterprise knowledge base
  2. After receiving the new bidding announcement, business person a will fill the key information of the bidding announcement into the system
  3. Business person B reviews whether the project is suitable for further investment according to the key information
  4. Push the information to the specific salesperson, who will connect with the bidding enterprise and complete the subsequent bidding work

We can use RPA to complete steps 1 and 4, and use document understanding to complete steps 2 and 3. Configure the extraction model and verification rules for machine audit in the multi page extraction model, and realize manual secondary audit in the The Hub or other business systems.

Characteristic

The multi page extraction model has the following characteristics:

  • Easy to use : Each step is equipped with guidance. The whole Process of "data management - > annotation - > evaluation - > verification configuration - > online" is completed without code. You can learn how to create a document understanding agent that can be used in the production environment by hand.
  • Flexible configuration : Verification rules can be configured by preset and code. Users can configure personalized verification logic according to business needs.

Usage method

Next, take the bidding announcement screening in the business scenario description as an example to create an intelligent review model of bidding announcement. Use pre training AI Capability Document Extraction The pre training model of bidding announcement provided is extracted and checked by adding user-defined rules:

  1. The project is in Beijing
  2. The investment amount is more than 2.5 million yuan
  3. The construction period is more than 30 days

If the business personnel confirm that the bidding announcement meets the above three conditions, they can mark the bidding announcement as suitable for tracking.

Create model

1) After logging in to the platform, click the on the left navigation bar Document understanding get into Multi page extraction model

2) Click New model , Create a training model for pre bidding Intelligent review of bidding announcement Model.

  • OCR engine: because the input of the model is non plain text information such as pictures and PDF, OCR recognition is required; The effect of OCR will affect the effect of subsequent extraction. Please evaluate the effect of OCR recognition before selecting
  • Pre training model: at present, the platform does not support user training model. If you need to extract, please select the corresponding pre training model

semanticModel1

3) Click start or to configure Enter model configuration

semanticModel2

4) After entering the model configuration, you will see the in the upper right corner Work progress For guidance, please complete the following steps according to the guidance of work progress.

semanticModel3

New field

1) Click step 1 in the work progress to create a new field, and click Go to new , Enter field paging.

2) Create the field you want the model to extract from the file. Here, we add a field named Suitable for tracking Field for.

  • If the pre training model is selected, the model will automatically create all fields supported by the pre training model.
  • The field name cannot exceed 20 words
  • The field type can be string or array. The type cannot be modified after the field is created
  • Please be careful to delete the field from Model creation Because the pre training model will return the results through the field name; If you accidentally delete the field [project name] created by the model, you can create a new field [project name] with the same name

semanticModel4

Upload data

1) Click step 2 in the work progress to upload data and click Go upload , Enter data paging.

2) Upload relevant business data to data management. After the data is uploaded, OCR identification will be carried out automatically, and labeling can be carried out only after the identification is completed.

  • The data in data management can be used for the evaluation of pre training model and the configuration of verification rules
  • If you want to try but do not have the appropriate document, you can use the Document Extraction - > Document Extraction test - > obtain test samples Obtain the test samples of the pre training model

semanticModel5

Label data and build evaluation set

1) Click step 3 in the work progress to mark the data, and click De label , Enter data paging.

2) Click on any piece of data tagging , Enter the annotation page.

  • If the pre training model is selected for the model, the data will be pre labeled through the pre training model after uploading

3) The tagging page provides two tagging methods: word marking and box selection. After selecting the field value area, the system will automatically pop up the tagging pop-up window, where you can modify the tagging results and select fields. Finally, click OK to save the tagging content.

  • If the field type is array, you can label multiple values
  • If the field type is string, the second annotation result will overwrite the last annotation result
  • If a field does not appear in the document, mark the field as Not present
  • After marking all fields of a piece of data, the data status will change to Marked

4) On the annotation page, the annotated data is added to the evaluation set to evaluate the effect of the model.

semanticModel6

New version

1) Click step 5 in the work progress to create a new version and click Go to new , Enter version paging.

2) Click New version , Create a file named V1 Version of.

semanticModel7

Evaluate the effect of pre training model

1) Click step 6 evaluation in the work progress and click To evaluate , Enter version paging.

2) Click V1 Version evaluating , Initiate model evaluation.

  • The effect of the pre training model provided by Document Extraction is evaluated here

3) After the evaluation is completed, click the version Last evaluation F1 Value, you can download this evaluation report.

  • The evaluation report has four sheets: result overview, field extraction statistics, All extraction results, and Document Extraction details. You can view the effect of the model from different dimensions.

semanticModel8

Configure verification rules

1) Click step 7 verification in the work progress and click De configuration , Enter version paging.

2) Click V1 Version check , Enter the verification rule configuration page.

semanticModel9

3) After entering the verification rule configuration page, the left side is the document preview area, and the right side has three property.

  • Recognition result: display OCR recognition result of document
  • Rule configuration: configure verification rules
  • Test results: display the test results of verification rules
  • The document previewed here is from the model data Paging, and the data status is Marked / not marked Data

4) The recognition results are used to show the results of OCR recognition.

semanticModel10

5) The field test results in the rule configuration come from the extraction results of calling the pre training model when uploading data.

  • If the pre training model is not selected, the test results of the fields are nothing
  • You can directly fill in the existing test results or modify the test results of the page
  • Changes here do not affect the annotation of the data

semanticModel11

6) Create preset rules.

  • Tick field Required , A verification rule will be generated automatically

semanticModel12

7) Create custom rules.

  • Click Add rule , Enter the name of the rule, select the field to be verified, and click next
  • In the verification content, write the verification code according to the comments, and click OK to submit
  • The verification code uses JavaScript programming language. Please refer to JavaScript tutorial
  • We also provide two examples of verification rules in the FAQ section for reference.

semanticModel13semanticModel14

8) Click Start testing , The model will take the field test results of the rule configuration page as the input to test whether the verification rule passes.

semanticModel15

9) Change the test result of the field to test whether the rule configuration is correct.

semanticModel16semanticModel17

Release version

1) Click step 8 release in the work progress, and click To publish , Enter version paging.

2) Click V1 Version release , Publish the current version.

3) After publishing, you can click the of the model in the model list Test effect , Whether the test results meet the expectations.

Common problem

Times deduction logic

Document understanding is a complex comprehensive AI Capability, which may call other AI Capability such as OCR character recognition, OCR table recognition, pre training extraction model, etc. the platform will deduct the number of document understanding AI Capability by the number of pages +AI Capability. The following is the number of deductions related to model management:

  • Data management: after uploading the data, the model will automatically carry out OCR character recognition + OCR form recognition + pre training to extract the model, and press Pages x3 Deduction times
  • Version > evaluation: after clicking evaluation, the model will automatically call and extract the model according to all files in the evaluation set Pages x1 Deduction times
  • Model > test: click test. The model needs OCR character recognition + OCR form recognition + pre training model. Press Pages x3 Deduction times

What is the difference between a template and a Custom Template Recognition

The Custom Template Recognition deals with single page, fixed layout structured or semi-structured document. Document understand that it can deal with multi page, unstructured document without fixed layout.

Verification rule example

Example 1: the investment amount is more than 2.5 million

/**
* @param {
* "投资额": string;
* } data
* @returns {success: boolean, message: string} result
*/
const check = function (data) {

//初始化返回结果
const result = {
success: false, message: "未定义"
}

if(data["投资额"] == ""){
result.success = false;
result.message = "错误,该字段不能为空"
return result
}
let temp = parseFloat(data["投资额"])
if (temp > 250) {
result.success = true;
}else{
result.success = false;
result.message = "错误,投资额<250万"
}

return result
};

Example 2: the project must be in Beijing

/**
* @param {
* "归属地": string;
* } data
* @returns {success: boolean, message: string} result
*/
const check = function (data) {
const result = {
success: false, message: "未定义"
}
if(data["归属地"] == ""){
result.success = false;
result.message = "错误,该字段不能为空"
return result
}

if (data["归属地"].includes("北京")) {
result.success = true;
}else{
result.success = false;
result.message = "错误,工程不在北京市内"
}
return result
};