“Take you to NLP” is a series of practical projects based on PaddleNLP. This series by baidu many senior engineers, meticulously provided from the word vector, the training of language model, the information extraction, the emotional q&a, structured data q&a, text analysis and text translation, machine with transmission and dialogue system and so on practice project of the whole process, designed to help developers more comprehensive grasp clearly baidu fly blade framework in the field of NLP usage, Moreover, he can draw inferential lessons from other examples and flexibly use the paddle frame and PaddleNLP for NLP deep learning practice.
In June, Baidu Paddle & NATURAL Language Processing jointly launched 12 NLP video classes, in which the practical project was explained in detail. Watch the course replay please stamp: aistudio.baidu.com/aistudio/co… Welcome to our QQ group (group number :758287592)
Chatbots’ Past and Present Lives
Between 1964 and 1966, Joseph Weizenbaum, a German-American computer scientist at MIT’s Artificial Intelligence Laboratory, developed history’s first chatbot, Eliza. Eliza was named after the character in George Bernard Shaw’s play The Flower Girl, in which Eliza, a flower girl from a poor family, learns how to communicate with high society and becomes the coveted “Princess of Hungary” at embassy balls. As the world’s first chatbot, Eliza was given dramatic connotations by its authors. Eliza was one of the first programs explicitly designed to interact with humans, although there were already some basic digital language generators (programs that could output some coherent text). The user uses the typewriter to type in human natural language and get a response from the machine. As Weissenbaum explains, Eliza made “a conversation between a man and a computer possible.” Chatbots are getting smarter as deep learning technologies continue to evolve. We can complete some mechanical question-and-answer work through robots, and also can have a conversation with intelligent robots in our spare time. Their appearance makes life more colorful. A simple chatbot can now be created by combining the paddle with Wechaty. The following picture is a wechat chatbot demo based on PaddleHub + Wechaty. The messages received by wechat are obtained through Wechaty, and then the PLATo-mini model of PaddleHub is used to generate new dialogue text according to the context of the dialogue. Finally, the text is sent in the form of wechat message to realize small talk interaction.
Below is a wechat emotion recognition robot demo based on PaddleNLP + Wechaty. Wechaty is used to obtain the messages received by wechat, and then the TextCNN model of PaddleNLP is used to judge the emotion of the input text, and finally it is returned in the form of wechat message to realize the recognition of text emotion.
Interested students can follow this demo to realize an emotion recognition robot on their wechat.
Isn’t that interesting. If that’s not enough for you, please sign up for PaddlePaddle’s creative competition in collaboration with open source chatbot framework Wechaty and designer community MixLab. PaddleNLP provides you with a rich deep learning pre-training model. Wechaty also provides you with a convenient ChatBot building SDK. You can use PaddleNLP to automatically write poetry, rainbow fart, name, auto antiplet and other fun functions according to the existing demo. Competition registration links: aistudio.baidu.com/aistudio/co… Today we’re going to take you to a poem with PaddleNLP and a simple chatbot. Come on! A quick practice
PaddleNLP provides the generate() function for generative tasks, which is embedded in all of PaddleNLP’s generative models. Greedy Search, Beam Search and Sampling decoding strategies are supported. Users only need to specify the decoding strategy and corresponding parameters to complete predictive decoding, and obtain token IDS and probability scores of generated sequences.
2.1 Small example of GPT model using the generation API
1. Loading paddlenlp. Transformers. GPTChineseTokenizer used for data processing
Before the text data is input into the pre-training model, it needs to be transformed into Feature through data processing. This process usually involves steps such as word segmentation, token to ID, and add special token.
PaddleNLP already has built-in tokenizers for various pre-trained models, which are loaded by specifying the name of the model you want to use. Calling the GPTChineseTokenizer’s __call__ method turns what we say into acceptable input to the model.
from paddlenlp.transformers import GPTChineseTokenizer
Set the name of the model you want to use
model_name = ‘gpt-cpm-small-cn-distill’ tokenizer = GPTChineseTokenizer.from_pretrained(model_name)
Import paddle user_input = “A pot of wine among flowers, drinking alone without dating. Drink to the moon,”
Convert text to IDS
input_ids = tokenizer(user_input)[‘input_ids’] print(input_ids)
Take the converted ID and put it into tensor
input_ids = paddle.to_tensor(input_ids, dtype=’int64′).unsqueeze(0)
2. Use PaddleNLP to load the pre-training model with one click
PaddleNLP provides Chinese pre-training models such as GPT,UnifiedTransformer, etc., which can be loaded with one key by the name of the pre-training model.
GPT uses Transformer Decoder’s encoder as the basic network component and uses one-way attention mechanism, which is suitable for long text generation tasks.
PaddleNLP currently offers a variety of GPT pre-training models in Both Chinese and English. We used a small Chinese GPT pre-training model this time.
from paddlenlp.transformers import GPTLMHeadModel
One-click loading of Chinese GPT model
model = GPTLMHeadModel.from_pretrained(model_name)
Calls to the generate API are raised to text
ids, scores = model.generate( input_ids=input_ids, max_length=16, min_length=1, decode_strategy=’greedy_search’)
generated_ids = ids[0].numpy().tolist()
Use tokenizer to convert the generated ID to text
generated_text = tokenizer.convert_ids_to_string(generated_ids) print(generated_text)
Three people.
As you can see, the generation works pretty well, and the generation API is very easy to use.
2.2 UnifiedTransformer
The model and the generative API complete the small talk
1. Loading paddlenlp. Transformers. UnifiedTransformerTokenizer used for data processing
UnifiedTransformerTokenizer invocation style is the same as GPT, but data processing of the API is slightly different.
Call UnifiedTransformerTokenizer dialogue_encode method can into a model of acceptable input to what we say.
from paddlenlp.transformers import UnifiedTransformerTokenizer
Set the name of the model you want to use
model_name = ‘plato-mini’ tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name)
User_input = [‘ Hello, how old are you ‘]
Call the dialogue_encode method to generate the input
encoded_input = tokenizer.dialogue_encode( user_input, add_start_token_as_response=True, return_tensors=True, is_split_into_words=False)
2. Use PaddleNLP to load the pre-training model with one click
As with GPT, we can invoke the UnifiedTransformer pre-training model with one click.
UnifiedTransformer uses Transformer encoder as the basic network component, adopts flexible attention mechanism, and adds special tokens to identify different dialogue skills in model input, so that the model can support small talk, recommendation and knowledge dialogue at the same time.
PaddleNLP currently provides three Chinese language pre-training models for UnifiedTransformer:
Unified_transformer – 12L-CN This pre-training model is trained on a large scale Chinese session data set. Unified_transformer – 12L-CN-LUge This pre-training model is obtained by fine-tuning unified_Transformer – 12L-CN on a thousand word dialogue data set. The model uses billions of levels of Chinese chatty conversation data for pre-training.
from paddlenlp.transformers import UnifiedTransformerLMHeadModel model = UnifiedTransformerLMHeadModel.from_pretrained(model_name)
Next we pass the processed input into the generate function and configure the decoding policy.
Here we use the decoding strategy of TopK plus sampling. That is, the probability is sampled from the k results with the highest probability.
ids, scores = model.generate( input_ids=encoded_input[‘input_ids’], token_type_ids=encoded_input[‘token_type_ids’], position_ids=encoded_input[‘position_ids’], attention_mask=encoded_input[‘attention_mask’], max_length=64, min_length=1, decode_strategy=’sampling’, top_k=5, num_return_sequences=20)
from utils import select_response
Simply choose the best response according to the probability
result = select_response(ids, scores, tokenizer, keep_space=False, Num_return_sequences =20) print(result) [‘ I’m 23 years old ‘]
PaddleNLP example provides the code to build a complete dialog system (human-computer interaction), interested in you can go to the terminal to try oh ~ human-computer interaction address: github.com/PaddlePaddl… Wouldn’t it be fun to try it out? Xiaobian strongly recommends beginners to refer to the above code to tap, because only in this way, to deepen your understanding of the code. The corresponding code of the project: aistudio.baidu.com/aistudio/pr… More PaddleNLP information, welcome to GitHub star favorites: github.com/PaddlePaddl…
Baidu AI developer community ai.baidu.com/forum provides a platform for developers from all over the country to communicate, share and answer questions, so that developers will no longer “fight alone” on the road of research and development, and find better technical solutions through continuous communication and discussion. If you want to try all kinds of ARTIFICIAL intelligence technologies and develop application scenarios, please join baidu AI community. All your imagination about AI can be realized here!