Introduction: In the context of the digital wave, financial asset management industry pioneers are actively exploring the use of artificial intelligence, big data and other advanced technologies to build future-oriented intelligent investment research platform. This paper will start with the demand for data intelligence in financial asset management and introduce the typical practice of natural language processing technology in financial asset management in detail. Aiming at the information mining scene of massive texts, we use Transformer, CNN and other latest research results as well as tag2VEC and other technologies developed by the team to build an end-to-end text big data analysis system, which includes the whole process from intelligent collection of massive texts, text data structuring to assisting investment decisions. To achieve the acquisition and rapid analysis of tens of millions of text data, and then help customers to make rapid and accurate industry analysis and investment decisions. For text data monitoring scene under less samples, we provide China based on entropy Jane NLP technology stratification technology architecture, using text enhancement technique, less sample learning, migration, such as thought, the less sample scenarios to build highly efficient financial public opinion monitoring system, financial information to help customer to leap from the data sheet to the data assets, Gain forward-looking business insight to win the first opportunity.
The main contents of this paper include:
1. Background and technical architecture
2. End-to-end bidding text analysis system
3. Financial public opinion monitoring system in a small sample scenario
4. Summary and outlook
01 Background and Technical Framework
1. Rapid growth of unstructured data
Information asymmetry is the essential feature of financial industry and the focus of competition. Above, a report from IDC shows that 80 percent of the new data added globally in recent years has come from unstructured data. Therefore, a large amount of timely and effective information is distributed in unstructured text data, such as research reports, news and information, and Twitter. Financial institutions need to use natural language processing technology to efficiently and accurately mine structured information and obtain forward-looking business insights from it.
Using the latest ideas and technologies in the field of artificial intelligence, such as transfer learning, small sample learning and unsupervised learning, our team has built a sound natural language processing technology framework to provide end-to-end mass text analysis and monitoring system to help financial asset management customers bridge the gap from unstructured text to structured data. Then assist customers to make rapid industry analysis, investment decisions.
Let’s take a look at how NLP technology can be embedded in industry analysis and investment decisions:
2. Intelligent investment and research process
Intelligent investment and research process includes:
Data layer: The core task of this phase is data acquisition. It includes structured and unstructured data, among which unstructured data includes research reports, news and information, etc.
Data center: The core task of this stage is to transform the original data into index data that can be directly used in investment research. On the one hand, the system uses NLP technology to transform unstructured text data into structured data. On this basis, the system uses big data, machine learning and other technologies to model and analyze the structured NLP data and other originally structured data, and further refine the data into knowledge.
Knowledge graph: The core task of this stage is to translate the knowledge and facts from the previous step into investment advice. The machine uses the knowledge graph that has precipitate the analyst’s research framework to analyze and reason a large amount of knowledge acquired in the last stage through logical reasoning and risk control, and finally forms research information with reference value for decision making.
Finally, these three stages form a complete chain from data acquisition -> data processing -> data modeling -> logical reasoning. This chain constitutes a fully automated, industrialized, 24-hour non-stop intelligent research and development system.
To achieve the goal of intelligent research system, let’s take a look at the architecture of natural language processing technology:
3. Natural language processing technology architecture
Our natural language processing technology architecture is divided into application layer, component layer and expectation layer.
Application layer: Directly connects with business logic. Currently, entropy Simple Technology’s 30+ end-to-end text analysis system serves 20+ institutions in the field of financial asset management and consulting, with a total of 30+ business application scenarios.
Component layer: it provides basic algorithm components in natural language processing, including intelligent phrase segmentation, part-of-speech tagging, dependency parsing, word vector, semantic similarity and named entity recognition components.
Corpus layer: The corpus layer provides training and testing corpus for each algorithm component in the base layer and each algorithm module in the application layer.
Universal corpus of basic components, such as universal text corpus, universal named entity recognition corpus, etc.
Domain related corpus, such as financial dictionary database, research newspaper classification database, listed company information database, etc.
Natural language processing architectures built this way have two obvious benefits:
We can quickly build the upper level of the business system by isolating the common components
It is well-organized, and each component performs its own duties. It is friendly to technical and business students and easy to use
Next, two typical application scenarios are introduced: bidding text analysis system and financial public opinion monitoring system.
Among them:
Bidding text analysis system is characterized by end-to-end and massive text
The scenes corresponding to the financial public opinion monitoring system are mainly the scenes with few samples
Through these two typical financial application scenarios, we share some problems and solutions encountered in the process of practical practice.
02 End to end bidding text analysis system
What is bidding data?
When purchasing hardware and software, the company will generally issue a bidding announcement. After the suppliers see the announcement, they will write and submit their own bidding documents. After the evaluation, Party A will issue the bid-winning announcement to inform everyone who wins the bid.
Why is bidding data important?
For a listed company, if its main business is the toB model, we can predict the company’s operating income through bidding data. For example, if a company wins a large order, we can predict the company’s operating income in advance through bidding data.
In the case above:
On the left is the bid-winning announcement disclosed by a company, with the bid amount of 650 million yuan. The announcement was released on October 17, 2017. We in the Internet is among public data collected in the bidding announcement, whether the project name, the winning unit, the bid amount, and to the left of the content is consistent, the only difference is the time, we collected data of time than the disclosure of the time, early for 16 days, this can help us have the advantage in access to key information.
1. Technical architecture diagram of bidding big data analysis system
In order to realize the bidding data monitoring of the whole network, we developed an end-to-end intelligent bidding text analysis system, and realized streaming processing of tens of millions of bidding texts. It mainly includes: intelligent web page extraction system, bidding text analysis service and data display. First our bidding text analysis system from the outside of the bidding website to collect the original bidding tender, then use the bidding tender for structured text analysis service, the processing of the extracted one of the most critical information, finally using panel data display and analysis to the secondary data analysis and display, convenient for business people to use.
Here are two of the most core algorithm components, intelligent web page extraction system and bidding information extraction model.
2. Intelligent webpage extraction system
Routine data acquisition steps include:
Write page collection rules
Task delivery, download implementation
Extract the content according to the rules
Due to the need to collect a lot of websites, need a lot of manpower, resulting in a very high cost, low efficiency. So we need a set of intelligent information extraction engine. The text fragment of specific area and specific purpose can be automatically extracted from the massive web page text data, that is, the bidding title and the text of the bidding document can be extracted from the bidding website data.
Difficult points:
There are more than 100,000 domestic information websites, with a wide variety of webpage types and countless templates, which cannot be processed by unified rules;
Web content is organized and laid out as a tree (two-dimensional) based on HTML tags, whereas traditional text is a one-dimensional sequence.
Mathematical model of web page extraction:
Every web page can be equivalent to a tree with all kinds of information. The text, pictures and hyperlinks of news body are distributed in each red node of the tree. Therefore, irrelevant nodes need to be removed and then serialized according to node location information.
Build Tag embedding:
The first problem we have to solve is the numerical representation of htML-encoded tags and attributes in web pages. To solve this problem, we proposed the algorithm model of Tag embedding inspired by the skpp-gram thought of Word2VEc, and the objective function is shown as above. The key idea is to use the current node’s tag to predict the parent node’s tag and child node’s tag.
Features of Tag embedding model:
Unsupervised training can be carried out on large-scale data sets to learn the semantic association and primary and secondary relationship between tags
10 million + original web page data to participate in training
The generalization ability of subsequent classification models was significantly improved
Significantly reduce the classification model for the amount of annotation data requirements, only tens of thousands of annotation data can achieve high accuracy
Binary classifiers based on fully connected networks:
After Tag embedding, we further propose a binary classifier based on three-layer feedforward neural network, which is mainly used to judge whether the node is retained.
As shown in the figure above, the input features mainly include the label information of the parent node, the label information of the child node, the label information of the current node, and other features of the current node, such as the length of text contained by the current node and the number of hyperlinks.
Model performance:
Amount of training data: 40,000 labeled data from 100 bidding websites
Tests were carried out on 1000 websites, and the accuracy of headline extraction was 98% and text extraction 96%
The relatively simple three-layer feed-forward neural network is mainly used for the following reasons:
Our application scenario requires real-time processing of massive web data, so the computational efficiency is very high
Benefited from the previous Tag embedding, unsupervised large-scale training has been carried out, and the three-layer neural network can achieve good performance
At the same time, the idea of this model can be extended to other tasks:
Web page type determination: catalog page, text page, advertising page, picture page
Other key information: directory link extraction, author information extraction, etc
At present, we have realized the collection of massive bidding texts. Next, we need to structure the text data to get the data fields we want.
3. Bidding information extraction model
① Extraction target:
The target of our bidding information extraction model is to extract key information, such as tendering unit, winning unit, winning amount, product type and so on, from massive bidding documents.
The difficulty is that the bidding documents are completely drafted by the writer, there is no standardized and unified format, and they cannot be processed by unified rules:
The expressions of bid-winning units are varied: contractors, suppliers, etc.;
The first bid-winning bidder gives up;
The tendering unit appears in the title;
Multiple winning prices coexist;
…
② Specific entity class extraction scheme:
After we abstract this task, it is very similar to named entity recognition. In our processing framework, it is defined as specific entity class extraction, and its structure includes: pre-processing layer, entity extraction layer, entity discrimination layer, and election decision layer. The entity extraction layer and entity discrimination layer are mainly introduced here:
Entity extraction layer: it integrates extractors based on external entity library, named entity recognition components and regular entity extractors to extract institutions and amount entities, and finally obtain entity set and context features.
Entity discrimination layer: determine whether the entity is the target entity according to its context features. Here, we integrate entity judgment based on artificial rules, entity judgment based on keyword matching and entity judgment based on convolutional neural network.
Through this two-stage process, multiple models are fused. The first stage does not rely on domain corpus, and uses general named entity recognition corpus for training. In the second stage, professional corpus training can be conducted in a small amount of bidding. High recall and high precision are achieved at the same time.
Next, the core modules of the two stages, general named entity recognition and CNN discriminator, are introduced in detail.
③ Named entity recognition based on improved Transformer
For the general named entity recognition component, our team has iterated several versions successively, and the latest scheme refers to the model proposed by Professor Qiu and his team in Fudan University in nineteen nineteen. In this model, we mainly take the improved Transformer model as the main feature extractor, combined with CRF layer, introduce global constraint information to realize the named entity recognition task. The picture on the left shows the structure of the whole solution, while the picture on the right shows the native Transformer structure for comparison.
Our solution has two major improvements over native Transformer:
Embedding: Native Transformer only uses token Embedding, but our scheme uses both single word Embedding and Bigram Embedding to effectively increase the ability of input text expression. At present, many literatures have shown that bi-gram embedding can generally improve the performance of named entity recognition model. For detailed experiments, you can refer to the above article from 18.
The original self-attention layer of Transformer structure is improved. By adjusting the coordinate points of the original structure position, the feature of direction information and relative position information can be captured simultaneously.
The specific experimental results are as follows:
④ Entity decision based on convolutional neural network
TextCNN is adopted as the core component here, and the whole network is composed of Embedding layer, convolution layer and forward network layer.
Embedding: Word Embedding which represents the semantic information of vocabulary and Position Embedding which represents the relative Position information of each Word can effectively capture the context location information.
Convolution layer: in the convolution layer, convolution Windows of different sizes are adopted to capture features of different distances. Meanwhile, we change the maximum pooling into top K pooling to ensure the robustness of the model by retaining some weak features of the model.
Test results of bidding information extraction model:
Our test results on 5000 bidding data are as follows:
High recall entity extractor: the hybrid architecture of named entity recognition based on TENER + entity extraction based on external information base + disambiguation model makes the average recall rate of the three types of entities 0.97.
High precision entity discriminator: Federal campaign architecture based on CNN + manual rules + keyword matching, achieving high robustness with an average accuracy of 0.96.
Parallel computation, the model is light weight, high computational efficiency. The prediction rate is 20 times that of BERT under the same hardware condition, and the calculation efficiency is very high.
4. End-to-end big data analysis system for bidding
Based on the previous results, we can build up the bidding big data analysis system, this system includes the massive bid intelligent acquisition, structured text data to the secondary investment decision-making of the whole process, the implementation must level text data collection and rapid analysis, help customer forecast track toB industry and the development of the company and the competitive landscape.
Intelligent collection of massive bidding documents: covering 700+ bidding websites and 50 + million bidding documents, 60% of which are government websites, 20% of which are websites of central and state-owned enterprises, 20% of which are tendering and bidding publicity platforms for hospitals, schools and other public institutions and subsectors.
Text data structuring: Real-time processing of massive bidding documents, extracting key information such as bid-winning amount, tendering unit and bid-winning unit, providing multi-dimensional analysis of customers, regions and time.
5. Display of some functions of bidding big data analysis system
It shows how to use bidding data to analyze hikvision’s development and forecast its performance. For example, through historical data backtesting, we found that the winning bid data is highly correlated with the company’s regularly reported quarterly revenue, so this data can be used as an important reference for future performance forecasts. In addition, by using regional analysis, we can understand hikvision’s competition pattern and revenue status in different regions, so as to have a deeper understanding of the company’s operating status.
6. Section
A Tag embedding algorithm is proposed to realize distributed representation of HTML tags. On this basis, combined with other features of web pages, we build an automatic web content extraction system based on feedforward neural network, and realize the automatic collection of 700+ bidding websites and tens of millions of bidding documents.
A two-stage bidding information extraction system is constructed. In the first stage, the improved Transformer network as the core to achieve F1 up to 0.97 entity extraction. In the second stage, the CNN network with Position embedding as the core finally realizes the overall system performance with F1 value close to 0.96.
An end-to-end big data analysis system for bidding and tendering, which is based on automatic extraction of web content and extraction of bidding and tendering information, has been constructed to collect and quickly analyze 50 million + text data, which can help customers predict and track the development status and competitive pattern of TO B industry and company.
Financial public opinion monitoring system in a small sample scenario
1. Financial public opinion monitoring system
In finance, there are two types of institutions: buyers and sellers. The buyer operates the buying and selling of stocks directly generally, such as public fund, private fund, etc. The seller mainly conducts stock analysis and research, and provides consultation and advice to the buyer, mainly including brokers and independent research institutions. Usually, a buyer’s organization will connect with more than one seller’s organization to serve it. As we know, wechat has become a working platform, so wechat group has become an important scene of seller services. An analyst often has dozens of seller service groups, and may receive messages from these groups at any moment. The main pain points of this scene are:
Message omission: There are a large number of wechat groups, so messages cannot be checked in time. Some files cannot be checked when their validity period expires
Excessive noise: there are different types of messages in the group, and it is difficult to find useful information due to excessive information noise
Fragmented information: it is not possible to gather all the information to understand the overall trend of the seller’s view
In view of these pain points, we put forward the solution of financial public opinion monitoring system, which can be achieved:
No omission: automatically summarize all research materials in the seller group, including all kinds of research invitations, article sharing, wechat messages and PDF files
High efficiency: multi-dimensional screening of industry, company and information category can be carried out to accurately locate useful information
Sustainable: Able to subscribe information in accordance with wechat group or spokesperson, and continue to pay attention to the information of specific securities firms and specific teams
Analyzable: Aggregate all information in a specific period of time, conduct multi-dimensional hot spot analysis, and push hot information to users
Process of financial public opinion monitoring system:
Firstly, the information in wechat group, such as text information, links and documents, is extracted into three types of labels, such as company, industry and institution, and then business classification is carried out. At present, there are four categories and 11 sub-categories. At the same time, our system will also extract structured text, such as article author and meeting time. In this way, many valuable applications can be made, such as hotspot tracking, categorization, report retrieval, event discovery, research calendar and so on.
2. Technical architecture diagram of financial public opinion monitoring system
The technical architecture of the financial public opinion monitoring system includes three layers of services: financial public opinion text analysis service, data cleaning service and display service.
Among them, the three most critical components of financial public opinion text analysis service are: information type classification, first-level industry classification and specific entity extraction.
3. Small sample dilemma
In practice, many problems in the financial field are related to specific scenarios. Financial companies usually face the dilemma of small sample mainly includes:
High data collection cost: The amount of data that can be collected is small and the time cost of data collection is high.
High difficulty in data annotation: problems in the financial field require business personnel and even financial analysts to participate in annotation.
For the small sample dilemma, the commonly used paths include transfer learning, data enhancement, unsupervised learning and semi-supervised learning. Next, we will share our ideas to solve the problem of small samples by introducing the implementation methods of two main algorithm components in financial public opinion.
4. Wechat information classification model
Objective: WeChat information classification model for WeChat group of text information, files, and links to news news classification, divided into company depth, reviews, depth, industry review, macro strategy report, solid, research summary, meeting minutes, survey invitation, meeting invitations, and other, a total of 11 categories.
The whole model is based on TextCNN and Fasttext as two basic models, and the two models are integrated by XGBoost. TextCNN used here is basically the same as the previous bidding network, except for Embedding, we remove the position vector. Its benefits include:
High robustness: XGBoost is used to integrate multi-layer CNN network and Fasttext network, integrating the advantages of deep model and shallow model to provide algorithm robustness.
The model is light weight and high computational efficiency.
5. Text enhancement
Text enhancement is a kind of low-cost data lever that can effectively leverage model performance without introducing new data, especially in small sample scenarios.
Common scenarios include:
Small sample scenario: expand the original sample set, cost-effective, fast and economical.
Sample unbalance: expand small proportion category, balance sample set and improve model performance.
Improved robustness: Noise was introduced into the training set to improve the robustness of the model.
Semi-supervised training: Used to construct sample pairs of semi-supervised training without label data.
In general, because text enhancement technology can improve the robustness of the model, unless the data volume is very rich, text enhancement technology can usually be used to try, and generally has a positive effect.
Typical text enhancement techniques are:
The basic idea is that original language 1 -> language 2 -> language 3 -> original language 1.
EDA: The basic idea is to perform four kinds of random operations on the original text, including synonym replacement, random insertion, random exchange and random deletion respectively.
Non-core word replacement: The basic idea is to use TF-IDF to evaluate the importance of each word in the sentence, and replace the non-core word in the sentence with the dictionary.
For a detailed introduction to text enhancement, please refer to our team’s previous article:
https://zhuanlan.zhihu.com/p/111882970
6. Experimental results of data enhancement
Sample set:
Sample set: The dataset contains about 2200 valid samples. Among them, the company review category sample is the largest, about 500. Fixed income reports and research invitations were the least, between 150 and 200.
Test set: randomly select about 100 pieces of data in various categories, about 900 pieces in total
Training set: except for the test set, there are about 1300 pieces of remaining sample data, and about 150 pieces of various samples
The experimental results are shown on the right and summarized as follows:
Data leverage: using multi-dimensional text data enhancement methods such as translation back, EDA, and non-core word substitution, all three techniques can bring 6-9 percentage points of improvement.
The effect is significant in small sample scenarios: the effect of text enhancement technology is the most significant in the small sample of 20% (average 30 samples of all kinds), which can achieve an improvement of about 9 percentage points.
The results of the three methods are similar: all three methods can effectively improve the performance of the model, and all three methods can improve about 5 percentage points under the full data set.
The final experimental results are shown in the figure above. Through the enhancement technology and some other methods in this paper, we have basically solved the problem of small samples.
How can text enhancement work so well without introducing additional data?
Here’s our thinking:
Regularization: In essence, the designer expresses a model preference or imposes a strong prior distribution assumption on the distribution of the model. For example, EDA expresses a preference that models should be insensitive to text local noise.
Transfer learning: The effectiveness of translation comes from externally trained translation models. It can be understood as transferring the information or knowledge learned from other places of the externally pre-trained model to the current task, improving the information capacity of the overall data, and thus better guiding the learning of the current model.
Improve robustness models: EDA, non-core word replacement technology in addition to from the semantic level of noise, at the same time can also be seen as applying generalized noise for the input data (has nothing to do with the specific task), and realize the function similar to dropout layer, and various research proves that this approach has been to a certain extent improve the robustness of the model.
Manifold: According to the manifold distribution law, the text of the same type of label tends to concentrate on a low-dimensional manifold in the high-dimensional text space, so effective text enhancement techniques should ensure that the newly generated text is still a point on the manifold.
7. Wechat industry classification model
(1)).
Model objective: To classify the text information, documents, link messages and other messages in wechat group by industry, with citic Level 1 industry classification as the classification benchmark, including 29 industry categories such as catering and tourism, trade and retail, textile and garment, agriculture, forestry, animal husbandry and fishery, construction, petroleum and petrochemical, communication and computer. Here, we still try to solve the problem with TextCNN as the baseline. As long as there are enough samples, good results can be achieved.
But the problem is that the sample size is too small:
Sample set: The dataset contains about 1200 valid samples
Test set: randomly select 600 items of 10 to 30 items of each category
Training set: except for the test set, there are about 600 pieces of remaining sample data, ranging from 10 to 25 pieces in each category
To solve this problem, we propose a three-stage training model optimization algorithm. The whole process is as follows:
Step1: Word vector pre-training. Pre-training is done on billions of massive corpora and second training is done on 100,000 research report data to improve the distribution of field-related vocabulary.
Step2: Train the preliminary model on the original samples, and use this model to classify 100,000 research report data by industry, reserving the research report samples whose category confidence is greater than a certain threshold. These research samples were used to train the model twice. Iterate over the process several times.
Step3: take the model parameters in step2 as the initial values and train on the original samples to obtain the final model.
The whole process can be likened to: the first stage is high school learning, mainly to master general knowledge. The second stage is undergraduate, mainly master the basic knowledge of the major. The third stage is graduate study, mainly to solve a certain kind of segmentation problems in the professional field.
The core idea is to transfer the distribution information about industry categories from massive external research reports to the task of wechat through this three-stage training method.
The results are shown on the right of the image above:
The effect of the three-stage training method is significant: the three-stage training method can effectively improve the model performance, especially in the small sample size (about 5-8 items per category), and the model F1 value can be improved by 48 percentage points thanks to transfer learning.
Effective reduction in sample requirements: The model outperforms the baseline at 100% data by 3 percentage points even at 60% data volume.
② Industry memory network
Since the 3-stage training method is effective, can we solidify the knowledge learned from external research reports on the Internet alone? Based on such an idea, we propose the structure of industry memory network, which constitutes a hybrid model together with TextCNN network. For input text, on the one hand to convolution algorithm and feature extraction, on the other hand will be sent to memory in the network, the network first learned to the original text in network industry embedding coding, using multilayer attention mechanism, effectively capture the industry attribute of a text, eventually in the field of 30 different training sample set, Captures the industry characteristics of the same word in multiple scenarios.
The transfer learning ideas of the whole hybrid model include:
Word embedding pre-trained with massive universal corpus learns word co-occurrence information, namely word meaning information
The pre-trained industry embedding of 100,000 research papers is used to learn the industry information of words
Implementation principle:
The basic idea is to use massive external corpus to learn the industry and field information of vocabulary:
Step1: Divide the 100,000 research newspaper corpus into 29 sample sets according to industry characteristics. Each sample set contains one category of industry research papers as a positive example. The other 28 kinds of research reports were randomly sampled to form counterexamples.
Step2: Train 29 SVM models for 29 sample sets. The target task is to classify research reports and judge whether they belong to the target industry.
Step3: Extract the coefficients of each word in 29 SVM models to form the industry embedding of words:
Vword=[w1, w2, w3… w29]
Finally, the distributed representation of vocabulary industry information is realized. The right picture shows the visual effect of Industry Embedding. The picture on the right shows the visual effect of Industry embedding dimension reduction. The words in the red circle at the bottom are sea and sky, beer, food and drink. Students who often go to the kitchen know that Haitian is a well-known seasoning manufacturer in China, and it should belong to the same industry as beer and food and beverage in terms of industry attributes. It can be seen from the visualization result that the algorithm of industry embedding has learned the industry information of vocabulary.
Experimental results of industry memory network:
Finally, we use a combination of industry memory network and text enhancement. The details are as follows:
Sample set:
Sample set: The dataset contains about 1200 valid samples
Test set: randomly select 600 items of 10 to 30 items of each category
Training set: except for the test set, there are about 600 pieces of remaining sample data, ranging from 10 to 25 pieces in each category
Experimental results (see left) :
Industry memory network has significant effect: Industry memory network can effectively improve model performance, especially in small samples (about 5~8 items per category). Combined with data enhancement technology, model F1 value can be improved by 50 percentage points.
Effective reduction in sample requirements: Combined with industry memory networks and data enhancement techniques, the model outperformed the baseline at 100% data by 6 percentage points even at 60% data volume
Parallel computing, lightweight model, high computational efficiency: the model is composed of CNN network and industry memory network. Among them, CNN network and industry memory network both support parallel computing, and the model is lightweight, so the calculation efficiency is high.
It is worth mentioning that, compared with the previous three-stage training algorithm, an advantage of the industry memory network is that the network is task independent. Because it is essentially industry information stored in terms, it can be easily used in other natural language processing tasks involving industry analysis.
8. Specific entity extraction model
The specific entity extraction model here is basically the same as that in the previous bidding, the only difference is that we introduce the entity disambiguation module in the entity extraction layer.
9. Function demonstration of financial public opinion monitoring system
Finally, briefly show some functions of the financial public opinion monitoring system, such as hot spot tracking and list view. On the hot spot tracking page, users can get the focus of the organization in the first time, know the seller’s research hot spots and company clues. In the list view page, we can view history information in multiple dimensions by event, company, industry, message type, and so on.
10. Section
A transfer learning algorithm based on industry memory network is proposed to realize the distributed representation of vocabulary industry information. The algorithm helped the wechat industry classification model achieve a 25 percentage point improvement in a small sample scenario with only 200 pieces of data.
A three-stage model optimization algorithm is proposed to complete the migration of industry knowledge from external mass research and reporting of wechat industry classification tasks, and achieve a 48 percent increase in the case of only 120 small samples.
Several typical text enhancement techniques are reviewed, and it is proved that text enhancement is a kind of low-cost data lever, which can effectively improve model performance on the basis of original data sets. Based on text enhancement technology, we have achieved 6-9 percentage points improvement in wechat category classification model and 3-30 percentage points improvement in wechat industry classification model.
Based on the hierarchical technical architecture of entropy simplified NLP, we built a financial public opinion monitoring system, which is used for multi-dimensional information extraction and analysis of sellers’ wechat group messages, which can help financial asset management customers to realize the leap from data liabilities to data assets, obtain forward-looking business insight, and win the first opportunity.
04 Summary and Outlook
1. Summary
Information asymmetry is the focus of competition in financial industry. A large amount of timely and effective information is hidden in unstructured text. Financial institutions need to use NLP technology to understand content and mine information, so as to gain a competitive advantage in key information.
Natural language processing is a cognitive technology and the jewel of artificial intelligence. At present, there are still many key problems and theories in this field that have not been broken through, which is far from the level expected by people. Our experience is that the current NLP technology is not suitable as a generalization tool. Technical experts and business experts work together to find the business scenarios that best leverage the technical advantages. For example, bidding big data analysis system, financial public opinion monitoring system.
2. Looking forward to
In-depth interaction with business experts to dig out more solid scenarios.
Try more cutting-edge technologies and ideas. For example, try text enhancement technology based on GPT class generation model, try to introduce new ideas and new algorithms in CV field.
05
The interactive link
Q: In the consumer loan scenario of cars, the sample size of post-loan collection warning is generally small. How can a small sample be learned reliably? (Baseline is the rule)
A:
Data enhancement: translation, EDA, non-core word replacement, context replacement, for text classification effect is obvious
Transfer learning: Corpus with similar text distribution in other scenarios. Three-stage learning, domain vocabulary transfer and so on
Q: Now NLP has a great demand for better model resources, such as Bertlarge and XLnet. When it is actually implemented, should it ensure the effect of heap resources or simplify the model?
A: For the vast majority of problems, there is no need for heap resources, as performance comparable to, or better than, bert-like models can be achieved on lightweight models.
The approximation theory of neural networks has shown that, as long as the network is wide enough, a two-layer neural network can approach any continuous function indefinitely. For the most part, we don’t really need BERT as a heavy-duty model.
In fact, some financial customers need to be privatized, so BERT is a heavy burden for them
The results of BERT training can be used as baseline to optimize the lightweight model
That’s all for today’s sharing. Thank you.
Dr Li
Entropy technology | Jane co-founder
Dr. Wang graduated from the Department of Electronic Engineering of Tsinghua University. He has published more than 10 academic papers as the first author and applied for 6 patents. He is committed to applying advanced natural language processing and deep learning technology to the financial asset management field and enabling the technology to empower the industry. At present, I am responsible for the construction of NLP technology center of Entropy Simple Technology, including hierarchical hierarchical architecture, big data extensive acquisition system, background support for continuous deployment and implementation of cutting-edge algorithms in the field, etc., providing underlying technical support and implementable solutions for each major business line of Entropy Simple Technology.
For more technical articles, please pay attention to the public number “Entropy Jane Academy” of Entropy Jane Technology.
— the END —