I read the article “Template-free Prompt Tuning for Fee-shot NER” published by Fudan University, which is a new method for Prompt to do NER by using small sample ideas. Unfortunately, there is no open source code, which leads to some small problems that I don’t understand, and I figured it out by sending emails to the author.
Directory:
1. The original mode of doing NER with Prompt and its disadvantages;
2. The idea of the article;
3. Solutions to the main problems in this paper;
4. My thoughts on the article.
1. The original mode of Doing NER and its disadvantages:
In the classic prompt mode, the original text is a slot in the new text template, and the language model is used to predict the other slot. The value of the slot is a pre-determined word that can represent the label, mostly an English word or less (such as 2 characters) In Chinese. Because it is predicted by language model and relies on pre-trained data, it is theoretically possible to do no training at all or train with fewer samples. So it’s called small sample learning, or zero sample learning.
The same is true for prompt as ner, as shown in the figure.
The difference with prompt classification is:
-
Prompt-ner needs to put the candidate slot combination into the template as a slot to determine what words are the entities in the text;
-
As a result of the previous item, each possible entity selection is put into the template separately to make a prediction separately. (In n-gram terms, it is equivalent to trying everything “1 to the length of text” -gram.) The huge complexity will lead to huge amounts of prediction, and the time consuming will certainly be unbearable in the real work. Disadvantages –
Therefore, the improvement method needs to get rid of the idea of “adding the candidate slot combination to the template as a slot”, that is, “the candidate slot combination cannot be a slot of the template”.
2. Ideas of the article:
-
Instead of using templates at all, language models act as masks to predict each word individually;
-
The predicted word is not the word in the original text, but one of the preset N label words. The label Word corresponds to the slot type.
-
Time: N predictions for a text of n words (not mentioned in the article, I thought). Can you accept this time in practice?
-
The article example is an entity composed of a single word. If multiple words constitute an entity, how many mask bits should be predicted? — After asking the author, the author replies:
3. Solutions to the main problems in this paper:
Classic Prompt has two problems: how templates are built, and how label word is set up. There is no template in this article, so it only involves how to set the label word.
It should be noted that since it is a small sample, the sample size of the task data set is very small. In order to find a more appropriate label word, words and texts of “corresponding type” are selected from “open data”. This results in the inability to find the Label word in the following manner if there is no such type in the open data. For example, “location”, “name” and other general types of data are easy to find; However, if the entity type is unique to the vertical class, or even the entity type defined by the business itself, there is no corresponding open data, then the following method is invalid.
Open data refers to:
This paper selects label word by the combination of “data set distribution” and “language model prediction”.
The general idea is:
-
“Data set distribution” refers to the words with the highest frequency under this entity type;
-
“Language model prediction” means that the words in the slot type are masked by the language model and the high-frequency words are selected from the predicted words.
-
In fact, the combination of the two items is selected. The specific idea is to make the probability of the above two items high by selecting appropriate label word. (I won’t write the formula.)
4. My thoughts on the article :(there are modifications below, but I missed them from reading the article before)
Advantages:
-
A word can only be predicted once, can be used in practical work;
-
Do not do template, workload relative decline;
-
Leverage open source data. (Depending on the slot category, see “disadvantages” below.)
-
The idea of selecting Label Word is independent of the main idea of “no template do Prompt”, so the idea of selecting Label Word can be completely replaced, even if I set it as human.
Disadvantages:
- For vertical tasks, the lack of relevant open source entity data makes it impossible to select a better label word from large-scale open source data. — However, since the idea of selecting label word is independent of the main idea, it can be replaced by a similar scheme adapted to local conditions in vertical data.
Point of uncertainty:
-
An entity consisting of multiple words, each mask bit predicting the same label word. Is there a better way?
-
Or in the case of the previous one, what about Chinese? — You can’t predict a Chinese name as a Chinese character, can you? — Or a mask of two or three words. — I’m not sure about the English solution. But another way I can think of is to rely on participles. But it can’t be used directly, because for non-physical words that are masked, why should it be predicted? I haven’t decided yet.
5. What I missed from reading the article before
According to language model thinking, a sentence needs to be predicted only once.
If you find it useful, please share it in moments! Finally, I welcome you to follow my wechat official account: Duibainotes, which tracks the forefront of machine learning such as NLP, recommendation system and comparative learning. I also share my entrepreneurial experience and life perception on a daily basis. Students who want to further communicate can also add my wechat account to discuss technical problems with me, thank you!