ABBYY has long been engaged in the field of natural language processing (NLP). These technologies are at the heart of ABBYY’s many solutions for analyzing and extracting data. We use these technologies to help industrial giants like Npos search document assets accumulated over a century, and one of Russia’s largest banks has been using our technology to monitor news feeds and manage risk.

In this article, we will explain how NLP techniques can extract information from text. Instead of talking about words in tables and well-structured forms, focus on multi-page unstructured documents, such as lease agreements, medical records, etc.

Then, we’ll show a practical application, for example, how to extract X entities from 200 pages of banking protocols in X minutes; How to ensure the accuracy of legal contracts, or how to quickly obtain information about rare adverse reactions from a large number of medical articles. Our experience shows that companies need to get this data right and fast, because it’s what businesses and people’s well-being depend on.

At the end of this article, we’ll discuss some of the difficulties we’ve encountered with such projects and the solutions we’ve taken.

So, what did we do?

In general, natural language processing and analysis are ubiquitous, and can be used to filter spam from email inboxes, create machine translation systems, recognize speech, and develop and train chatbots. With ABBYY’s NLP technology, banks, industry, and other organizations can quickly extract and build large amounts of information from business documents. Large companies have long automated, or at least tried to reduce, routine operations, such as searching paper documents for dates, names, invoice numbers and so on, and then feeding that data into corporate information systems for verification.

Notice that initially we were able to extract text from the document based on geometric features, especially the structure and arrangement of lines and fields. It is still convenient to work with structured layout information, such as information in documents such as forms, questionnaires, application forms, census tables, and so on.

However, important information is not only stored in forms, so we have trained the NLP solution to extract data from unstructured or extremely complex documents. You may remember ABBYY Compreno, which was created for analyzing and understanding natural languages. Developed and refined, this technology now forms the basis of many of our NLP solutions.

Another aspect that we’ll discuss next is extracting information from unstructured documents such as contracts, medical records, and news feeds.

How NLP technology works

Conceptually, the process of a document from data input to data extraction is as follows:

Suppose you want to extract from a 50-page contract the date and place of the contract and the name of the company that signed the contract. How to find the page the customer is looking for? With this technique, we can proceed in the following stages.

Segmentation stage

By segmenting the document, that is, we narrow down our search, not dealing with all 50 pages, but only parts of it, such as five paragraphs per paragraph that might contain the date we are looking for. This makes the algorithm easier to manipulate and easier to distinguish the desired date from all other dates.

All the stages on the right of the sections in the figure constitute the operational flow of the NLP algorithm, that is, the detailed study, reading and understanding of the text. These processes take 10-20 times more time than sorting and segmentation, so it doesn’t make sense to run them on an entire multi-page document. They are better for lightweight text.

How NLP Parser and BI-LSTM work

With the NLP parser and BI-LTSM, attributes (features) can be extracted from each sentence of the text. This is thanks to ABBYY Compreno’s technology, which is part of FlexiCapture. The engine can read text in detail and extract a large number of general features from it. It can understand not only the facts given in a sentence, but what it actually means.

Feature extraction is a long step because there are some high-level attributes. Roughly speaking, they show that in this fragment, something that looks like a name does something that looks like an action with an object that looks like a semantic class. Next, a fairly simple and traditional ML-GBM (Machine learning gradient elevator) method is applied to these extracted advanced features. This is a decision tree collection that provides a general solution and highlights the extracted fields. In order for GBM to learn quickly and extract high-quality information, there must be sufficient documentation for training. The fewer documents used, the worse the quality of data extraction: for a few cases, the machine is less able to generalize (that is, to distinguish individual cases from high frequency cases).

Application areas of NLP

Here are some examples from our practice, i.e. implemented projects, pilots and concept cases. NLP for financial organizations

Customers often ask us to process invoices and purchase orders. Some fields and text blocks can be extracted by keywords using traditional methods. But to see inside a block of text, you need NLP. Perhaps you don’t need the entire address block, but only parts of it, such as street, state, city, zip code, and country. Perhaps the invoice is discounted if paid before a certain date, you need to parse the date and discount rate for which this special offer applies. Our technology helps take into account variables such as how much an order costs if paid in advance or in bulk; If the payment is delayed, how much will it cost?

We also assist the legal departments of large companies in extracting important data from service contracts, progress reports and confidentiality agreements. One of our projects involved a commercial lease agreement that included multiple 30-page documents that the legal team had previously had to process manually. It usually takes about an hour to process one. With FlexiCapture, it takes less than two minutes, and according to our clients, we’ve saved them 5,000 man hours per year.

Another aspect is dealing with loan agreements. Loans are made not only to individuals, but also to large corporations, which means we’re talking about a $100 million mortgage. In order to obtain the loan, the company provides the bank with a large number of documents, which are then required to extract 50-70 entities or conditions from each 250-page document. If done manually, each document takes 2-3 hours. With FlexiCapture, it takes nine minutes: not instant results (due to high text density), but much faster.

Loan applications — preliminary questionnaires sent to customers by banks — often need to be processed, too. The larger the loan, the more questions the questionnaire has. For example, regarding your workplace, the bank may ask for the unit’s identification number and legal address. Banks often ask for information about marital status, loans, utilities debts, alimony and other related matters that might interfere with loan repayment. Sometimes the problems in loan applications are so complex that some companies help customers translate them from “legalese” into plain English.

The main difficulty in working with such documents is the large number of fields (105 in our case). Bank employees can easily be lost or confused, but that’s a piece of cake for technology. FlexiCapture can process a document in five minutes, compared with two to three hours by hand. Feel the difference!

Health care

Many of ABBYY’s projects involve extracting data from medical records.

With NLP, you can process abstracts of medical articles. One area of pharmacology is called pharmacovigilance, which is used to investigate possible side effects of new drugs. Medical institutions collect information on critical cases from patients and draw up detailed case safety Reports (ICSR). If a new drug causes harm to humans, manufacturers must report it quickly to regulators or face hefty fines. To avoid this, pharmaceutical companies arrange for their highly qualified employees to read the ICSR extensively. This is a rather tedious job.

With technology, the job is much easier. In one of the pharmacovigilance trials, our technology is used to extract data from medical articles, such as patient gender and age, side effects and drug names. Everything is extracted using machine learning, but for drug names we take an easier approach, which is a dictionary (which is also part of FlexiCapture NLP). The client requested that only certain drugs be extracted from a list of 80 names. In this case, dictionary matching comes in handy: morphology can be used to look up the name of a drug, and the register of the word doesn’t matter (especially since English doesn’t have much of a register).

Because there is so much information to review, medical records are processed automatically, including forms, receipts, and descriptions determined by the insurance company. One American regulator, for example, accepts insurance complaints from patients. Insurance companies sometimes refuse to pay for treatment. The reasons may vary, but patients have the right to investigate the decision and appeal it to a government agency. And the regulator must analyse and decide: is it really reasonable for an insurance company to withhold payment?

While you can easily process tables through FlexiLayout, blocks of text with insurance decisions are more difficult to parse. To extract the decision itself and the reasons behind it, we used NLP.

When patients are transferred from one hospital to another, medical records need to be carefully analyzed. Given that we are in the midst of an epidemic and sometimes have a lot of these patients, it is difficult for hospital staff to cope with excessive paperwork. The cases we are involved in are not COVID-19 related, but our experience is still potentially valuable.

The real estate

One of our potential clients rents a lot of land for construction and office use. Accordingly, the company signs many lease agreements and needs automatic processing, such as extracting dates to monitor payments, contract renewals and overall costs.

Construction companies also have specialists called portfolio analysts. They analyze contracts and assess the costs and profitability of specific properties, similar to a bank score. Information in these contracts can be extracted with or without the help of NLP. Table data can be extracted using FlexiLayout, and all other fields are paragraphs extracted by the splitter, or fields in paragraphs extracted by the extraction model.

The advantage of NLP technology is that it is another mechanism for handling more types of fields and documents.

One of our clients is a homeowners’ committee. When new owners join, they receive nine types of documentation, including deeds, sales and purchase agreements. Once the data for these documents is filled in, it must be processed, verified and entered into the information system. Of the nine documents, eight are structured and can be processed using FlexiLayout. But type 9 is tricky. In order to complete this project, our company also needs to deal with unstructured documents.

This is where NLP comes in. On the one hand, the documents themselves are not very large, only 1-2 pages; On the other hand, they vary in content and are of poor quality. Nevertheless, our solution was able to extract the required information. The project was interesting in that it required very little NLP processing, but it was also critical because without it, the project could not have been completed.

NLP is also critical to automating contract approval. Companies often sign master agreements, or framework contracts, that set general conditions for many future business operations, such as time frames, performance requirements, and sanctions for delays.

The contract approval automation process looks like this: We extract a certain number of fields and conditions (terms) from the document. A field is one or more words, while a clause is one or more paragraphs, possibly with lengthy descriptions. Companies need to extract fields for indexing, archiving, and future searches. This technique compares extracted terms with those in the master agreement. If everything fits, there is no risk to the company, and you can automatically approve the contract and file it into the database. This makes lawyers’ jobs easier, as they no longer have to deal with the same type of contract and move on to more important tasks. The contract should be reviewed only if the system finds inconsistencies with the relevant documentation.

Ways to associate NLP projects

In a structured document, you can quickly find the required fields, and the document itself is only one or several pages long. In unstructured documents, it can be difficult to determine what data to extract and where it resides, which can be done using NLP. In addition, the document itself can be 100-200 pages long. In the requirements development phase, we first ask the customer to compile a list of dozens of fields that need to be retrieved. Such projects require subject-matter experts — experts in the field — to answer questions about what needs to be extracted from the document and what nuances need to be noted.

Sometimes, customers require that hundreds of fields be extracted from a document at a time. This approach is not constructive and takes a long time to discuss project requirements. So, in general, we don’t start with hundreds of fields, we start with 10 fields, which helps us clarify requirements and show how everything works. In this way, both the client and us will learn more about the phases of the project and its milestones.

In addition, any machine learning project, including NLP, requires a representative sample to train the system. The bigger the sample, the better.

conclusion

The examples above show how NLP technology can help us save valuable time. It’s a win-win situation, with robots taking on repetitive tasks and employees working on smarter and more interesting projects. By replacing experts with machines for routine tasks, companies can improve customer interactions, avoid errors in processing, and increase profits more quickly.