Friends who just started to do algorithms will encounter a lot of mistakes, such as obsessed with new models, ignoring the basics and so on. I strongly encourage you to take a deep dive and read this article by Dr. Jiwei Li of Shannon Technology, which will be very helpful to you. Here is the original text.

With more than 2,800 entries, it’s arguably the biggest ACL2019 in history. Driven by deep learning, natural language processing has gradually been pushed to the forefront of artificial intelligence.

Recently, many students, especially those who are new to JIN (KENG), mentioned the confusion of doing NLP research under the background of deep learning in their emails or messages on Zhihu.

Today, a model can be solved with dozens of lines of TensorFlow or PyTorch. People take pains to brush benchmark data sets, but because the threshold of model implementation is lower, SOTA is difficult to brush. Even though it was not easy to brush it up, because the model was nothing more than tinkering, the article was sent out because novelty was limited, so it was not satisfactory. Even if it is in the article, it seems that there is no too big new idea, already confused in the irrigation.

The popularity of deep algorithms will make researchers pay too much attention to these algorithms themselves, and the endless adjustment and improvement of model structure will make us confused. ** While **** talking about deep learning network structure becomes a cool thing, people’s vanity will make it happen to overlook several important points. ** Based on the detours and pits I have walked through over the years, this article makes a little summary. I hope it will be helpful to students who have just entered the FIELD of NLP.

1. Understand the basic knowledge of NLP

Jurafsky and Martin’s Speech and Language Processing is a classic textbook in the field, which includes the basic knowledge of NLP, linguistic literacy knowledge, basic tasks and solutions.

Reading this book will expose you to many of the most basic tasks and knowledge of NLP, including Tagging, various parsing, coreference, semantic role labeling, and so on.

This is extremely important for understanding the NLP domain globally. The knowledge in the book does not need to be thoroughly familiar with the heart, but brush once or twice, at least for the NLP task have a basic understanding, the next encounter to know where to find is very meaningful.

In addition, Chris Manning’s Introduction to Information Retrieval is also a book that can be used to scan the blind. Of course, I don’t think you need to remember all the details, but you need to know the outline.

Many of the basic algorithms in IR have a lot of overlap with NLP. Tell me about my own missteps. Part of Stanford NLP’s Qualification test is to read some chapters in Jurafsky and Manning’s book, and then the teacher asks relevant questions.

At the beginning, I was too lazy to see the contents, so the Qualification test was delayed. But when the doctor couldn’t drag it out in his last year, he found that if he had known these things earlier, he could have avoided many detours in his early years.

I’ll give you a few examples of why understanding the basics of NLP is important.

Recently, I worked with my classmates to do language modeling. Many of my classmates used LSTM or transformers to make language models. But implementing a Bigram or Trigram Language Model (LM) is stuck for most of the day because of OOV smoothing problems (those of you familiar with it, need Laplace smoothing or more sophisticated Kneser-Ney smoothing).

Why is bigram or Trigram LMS important? To do a language model problem, the first step before implementing the depth model is to actually write a Bigram or Trigram LM. Why is that? Because these N-gram models are simple to implement and robust. This simple implementation tells you the lower limit of the LM model for this dataset.

So we know that the neural network model should be at least as good as this model.

Because of the problems of neural network model such as hyperparameter and gradient explosion, sometimes it is not easy to determine whether the model is really bad, the parameters are not adjusted properly or the code is buggy. So by the lower limit given by n-gram LM, we can intuitively know whether the neural network is buggy or mistuned parameters.

The second example involves sending articles. I don’t know if any of you have thought about why the random replacement training LM in BERT makes the result better, what is the ghost of random replacement and how the result is better.

Actually, before BERT, Data Noising as Smoothing in Neural Network Language Models ICLR2017 (arxiv.org/pdf/1703.0257) This method is proposed for the first time and a theoretical explanation is given. This random replacement is essentially a interpolation based smoothing method in Language Modeling and LM based smoothing is described in Section 3.4.3 of Jurafsky’s book.

2. Understand the classic NLP model and papers in the early years

Compared with the simple and crude neural network model, the early NLP algorithm is indeed more complicated, but it does contain a lot of early scholars in the hard hardware conditions of wisdom crystallization.

Familiar with these models, can be integrated in the present neural network. I did a seminar in Renmin University last year. There are about 30-40 students attending the Seminar. In the Seminar, I asked if anyone knew what the IBM model in machine translation was, and about a fifth of you raised your hand.

I asked again if anyone could hand-write (or probably hand-write) the IBM model1. No one. There are many highly cited papers based on the ideas in Hierarchical Phrase-based MT based on the IBM model alone in recent years. The examples are endless:

  1. Incorporating structural alignment into an Attentional Neural translation Model (NAACL16) If a French word in the Target generated by English translation is attended by an English word in the source, then the English word in the target generated by English translation is attended by an English word in the source generated by English translation. The same English word in target should also attend the English word in source. In fact, this idea is completely similar to one of Percy Liang’s most famous works. As early as 2006 in Alignment by Agreement, you can guess the content of the article by the meaning of the title. Alignment should agree in forward and reverse translation. Percy: I don’t know if you’ve read it yet.

  2. Boring reply to handle a dialogue system, use reverse probability p (source | target) should do reranking is now the standard. Consider Rico Sennrich’s first work, fluent Data is Monolingual with the SEQ2SEQ model. In fact, this idea has been widely used in phrase-base MT. Neural before MT, need for a large N – best list made reranking MERT, reverse probability p (source | target) and the language model probability p (target) is a standard feature in reranking.

  3. EMNLP16 Best Paper Runner up by Sam Wiseman and Alex, Harvard NLP Group, Sequent-to-sequence Learning as beam-Search Optimization basically inherits Daume´ III and Daniel Marcu 2005 LaSO model, Adapt their ideas into a neural.

Attention, born of neural MT, is a neural network version of IBM’s model.

3. Understand basic models of machine learning

The simple violence of neural networks is effective. But from a scientific point of view, familiarity with basic machine learning algorithms is required. For example, Ng’s Machine Learning is a necessary choice.

I remember some time ago when I interviewed a young man, he was a very smart student at first sight, and he had a NAACL essay in a very short time. I asked him what the EM algorithm was. He said that he had never heard of EM and could not use EM for his scientific research.

I think this is actually a big mistake. When I think of myself, I have suffered many similar losses. As the foundation of mathematics was weak in the early years, and I did not have the determination to make up for it, EVERY time I saw the algorithms related to variational inference in the early years, I became confused. This kind of partiality lasted for a long time and limited the scope of scientific research.

Compared with the crude neural network, the inference from CRF and other models is indeed relatively complex (I did not fully understand it until I read it for many times at that time). But understanding this is essential for an NLP researcher.

Pattern Recognition and Machine Learning that book, especially some sections are really difficult (and expose the fact that the math foundation is poor), even just to go through it, it requires a lot of endurance to read it, let alone understand it completely.

I have given up many times myself, and there are still many chapters I don’t quite understand. However, I think many basic chapters are worth reading. In fact, you can form a study group of two or three people without too ambitious goals. You can spend a year or even two years to go through several important chapters.

NLP is an applied science, not a particularly mathematical one. But I think we need to understand the basic mathematical logic of the algorithms that we use every day, like dropout, like SGD, momentum, Adaboost, Adagrad, like the various Batch and layer normalization. In fact, this can save a lot of wasted time, not mistakenly cut wood.

Over the years, in the process of helping students tune bugs, I have met at least three to five students who trained and failed to scale each cell by 1-dropout (don’t laugh, it’s true). And then if YOU plot the dropout curve, the bigger the dropout, the worse it is. During the discussion, the students looked confused and did not know that scale was needed for test. The essence of dropout is not understanding the math behind dropout.

4. Read more papers in other NLP sub-areas

NLP has many subareas, INCLUDING MT, information extraction, Parsing, tagging, sentiment analysis, MRC, and many more. It is necessary to familiarize yourself with developments in other sub-areas.

In fact, the models used by different subdomains do not differ much. However, it may be a little difficult to look at problems in unfamiliar areas at first, because the formalization of the problem is not well understood. This may need to spend more time, more understanding students to ask.

Understanding formalization of different problems is also the best expansion of domain knowledge.

5. Understand the basic and significant progress in CV and data mining

After getting familiar with the points mentioned above (of course, it may take at least a year), I think it is very important to be familiar with the basic tasks and basic algorithms in THE CV field to open the horizon of scientific research.

But there is no denying that, because the field is different, writing style, terminology expression is very different, and because of the lack of background knowledge (the article will omit some basic knowledge, the default is that everyone understands. But cross-disciplinary people may not understand), it is not easy to read cross-disciplinary articles for the first time.

I once made a mistake when I directly said ftYL-RCNN in the discussion class. I thought I understood it, and then made a mistake. (Till now, Yu Is still teasing me about this every day.) However, the important thing is that some important articles in the field of NLP actually more or less borrowed ideas from CV, of course, CV borrowed from NLP.

The research on the visualization and interpretability of NLP neural network still lags behind the visualization of CNN in CV. So a lot of work borrowed a lot from similar work in CV. NLP uses GAN as a reference to CV. In fact, the two fields are very similar.

For instance, Without considering question Query, region proposal in detection in vision (find a specific region in a large image background), It’s probably the same as MRC’s Span extraction. To say nothing of image Caption generation and sequence-to-sequence models, there is hardly much difference in nature.

Reinforcement learning is in the generative field of generation. After sending MT(Ranzato et al., ICLR2016), send image Caption generation, and then return to summarization. Actor-critic model is similar. There are still many articles about generation Diversity.

Because it is difficult to understand cross-field, I recommend tutorial for the first time. It would be even better if there is a tutorial with Sudo code. It’s also a good idea to watch literacy videos like Stanford CS231n. In addition, it is also important to have an NLP group with a good CV knowledge, and vise Versa.

Graph embedding rises in the field of data mining in recent two years. Visual vision will be (or has been) widely used in many tasks of NLP. A few years ago, Deep Walk borrowed from Word2VEc, started in data Mining, and then seemed to rotate back to NLP.

First write here, welcome to add pat brick.

Finally, I welcome you to follow my wechat official account: Duibainotes, which tracks the forefront of machine learning such as NLP, recommendation system and comparative learning. I also share my entrepreneurial experience and life perception on a daily basis. Students who want to further communicate can also add my wechat account to discuss technical problems with me, thank you!