I’m participating in nuggets Creators Camp # 4, click here to learn more and learn together!
The story background
A girl who was about to graduate from wechat asked for help on how to use Postman, so I answered her question.
I knew her major wasn’t computer related, so it was strange to use Postman.
It turned out that her graduation project was to make analysis, forecast the trend of future housing price and train models based on the topic data of housing price on Weibo in the last 10 years.
After some research, she decided to use Baidu’s [language processing technology] to achieve semantic analysis of the basic data, namely: emotion polarity classification results, 0 negative, 1 neutral, 2 positive.
The official demo is based on Postman, although it is very simple for us professionals, but there is a certain threshold for the little sister of liberal arts.
After TEACHING my little sister how to use Postman, I asked her a question:
Now that you know how to use Postman, you can query the semantic analysis results of every piece of data. But there are hundreds of thousands of pieces of data on housing prices on Weibo, and you can’t run them through Postman. ?
Little sister is confused
Tech Guy Spring
I told my sister not to worry, you can use programming tools to easily solve, such as Python, Java, PHP are ok.
But after communication, I found that my sister was not interested in programming. Although SHE had learned it before, it was difficult to realize the requirements in a short time.
It’s time to show the real thing.
I helped her build a batch semantic recognition system based on Baidu AI open platform, and also carried out the out-of-box test of Baidu [language processing technology].
Sunshine boy analysis
Considering the little sister is not very computer, so to the simplest way to achieve the requirements.
Minimize code and use tool software where you can.
The development language is PHP, which is easy to learn
Database tools use Navicat out of the box
The development environment uses the one-click installation tool LNMP to install packages
Just do it. Start right away
Fix the data source
Xiaojie has obtained 20W+ microblog data about housing price through a bao. What we need to do now is to obtain the judgment of the 20W+ data set on housing price trend based on semantic analysis (we used [language processing technology NLP] service provided by Baidu), and directly import the data source in Excel through Navicat.
- First, according to the data source and baidu semantic interface returned results, design mysql table structure.
- Considering the data has 200,000 +, the use of mysql visualization tool [Navicat] import data, also convenient little sister operation.
Note: Match the source and target fields of the table
- Select append directly for first import; Select update when repeatedly importing data for subsequent optimization model.
- Click Start to import Excel source data into mysql database
- After the import is complete, 231007 pieces of data are queried in Navicat console through the query command
Setting up the development environment
Considering that the ultimate goal of my sister is to train the model rather than learn programming, it is still as simple as possible to build the development environment. Therefore, I recommended her to use the LNMP one-click installation package, and she built the LNMP environment in about 10 minutes
Lu code
Key script codes and ideas:
Field Description:
Liuxx is the database name and semantic_analysis is the table name
Code design ideas:
Use do while loop, batch loop request Baidu AI semantic analysis interface, query positive_PROb =0 data (i.e. data without semantic analysis).
If no data can be queried, it indicates that all data has been successfully requested by Baidu Semantic analysis interface, and the returned result is updated to the data table.
Attention issues:
Each query will sleep for 1 second, because baidu free version semantic analysis request interface has QPS restrictions, to avoid invalid requests
The implementation process
Query data:
-
The query condition is positive_prob=0(indicates that this data is not requested by Baidu interface).
-
Query sort: in reverse order by ID
-
Search page turning: search 10 pages at a time
Process data, request Baidu interface:
- The queried data is processed by json_encode(), and then baidu interface is requested
Process baidu return results
-
Exception Handling: If the error_code returned by Baidu is 282131, the text content is too long, exceeding the word limit of Baidu semantic analysis.
-
Mysql will delete data sources that do not conform to Baidu semantic analysis and will not repeat the request
-
The output helps you query information and locate problems
Update the returned results to the data table
-
If the positive_prob field is not 0, the semantic analysis is successful and the result is returned
-
Update the returned results to the mysql table
Batch script core file code:
File name: batchprocessing.php
<? php ini_set('memory_limit', '256M'); // Memory management include '.. /include/ConfigLiuxx.php'; // Add data configuration file include '.. /include/Db.php'; // Include database '.. /include/Logger.php'; // Add the log file include '.. /include/Request.php'; // Introduce HTTP request file define('Index_table', 'semantic_analysis'); $db_liuxx = new Db($db_liuxx); /** * $access_token = "XXXXXXXXXXX "; / / baidu to provide token $url = 'https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify?charset=UTF-8&access_token='. $access_token; $limit = 10; $limit = 10; $offset = 0; do { $datas = $db_liuxx->get_all('select * from liuxx.semantic_analysis WHERE positive_prob = 0 order by id desc limit ' . $offset . ',' . $limit); foreach ($datas as $key => $value) { $id = $value['id']; $text = $value['text']; $params = ['text' => $text]; $bodys = json_encode($params); $response = request_post($url, $bodys); $res_data = json_decode($response, true); if ($res_data['error_code'] == 282131) { $db_liuxx->query('delete from liuxx.semantic_analysis WHERE id = ' . $id); Var_dump ($id. 'text too long to delete '); } echo 'id:'; Var_dump (' dump '); var_dump($res_data); $data = [ 'positive_prob' => $res_data['items'][0]['positive_prob'], 'confidence' => $res_data['items'][0]['confidence'], 'negative_prob' => $res_data['items'][0]['negative_prob'], 'sentiment' => $res_data['items'][0]['sentiment'], 'ctime' => time(), ]; if ($data['positive_prob']) { var_dump($data); $condition = 'id = '. $id; $res = $db_liuxx->query('update liuxx.semantic_analysis set positive_prob = ' . $data['positive_prob'] . ', confidence = ' . $data['confidence'] . ', negative_prob = ' . $data['negative_prob'] . ', sentiment = ' . $data['sentiment'] . ' where id = ' . $id); var_dump($res); } else {var_dump(' baidu did not return result '); }; } sleep(1); } while (! empty($datas)); // If you can find the data, keep loop. > /** * make AN HTTP POST request (REST API), * @param string $url * @param string $param * @return - HTTP response body if succeeds, else false. */ function request_post($url = '', $param = '') { if (empty($url) || empty($param)) { return false; } $postUrl = $url; $curlPost = $param; $curl = curl_init(); $curl ($curl, $curl, $postUrl); $curl, CURLOPT_HEADER, 0; $curl, CURLOPT_HEADER, 0; Curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); Curl_setopt ($curl, CURLOPT_POST, 1); curl_setopt($curl, CURLOPT_POSTFIELDS, $curlPost); $data = curl_exec($curl); $data = curl_exec($curl); curl_close($curl); return $data; }Copy the code
Execute batch scripts
Nohup: Logs and printout information generated by the script are exported to the nohup.log file
& : Indicates that the script is running in the background
nohup php batchProcessing.php &
Copy the code
Get the results
After the script is run, the data processed by Baidu semantic analysis interface can be queried in mysql. The result is shown in the following figure:
Export data
Through Navcat tool, xiaojie can easily export mysql data results to Excel.
conclusion
Programming is really too useful, for us this kind of everyday programming yank code technical people may not be so deep feeling.
But for people who are not computer majors, this semantic analysis batch processing system really helps her a lot.
In the words of the little sister, she was no longer bald.
Based on programming technology, in the simplest and most efficient way, to solve problems encountered in real life, this isSiege of the lion
Is the value of.
The last
Thanks for reading and welcome to like, favorites,coin(attention)!!