Sentiment analysis is to analyze subjective texts with emotional colors (positive and negative/positive and negative) to determine the views, preferences and emotional tendencies of the text. This paper will be based on customer review data of the hotel, modeling, and prediction through the model. Demonstrate common operations in sentiment analysis, including word segmentation, textual vectorization, and modeling and prediction using Naive Bayes.

The link to the hotel review data set used is:

Raw.githubusercontent.com/SophonPlus/…

Each record contains comments and a tag that indicates a preference. The tag has only two values: 1 for like and 0 for dislike. The following figure shows four pieces of data:

Next, we use Alink for analysis and modeling.

Dingdingscan joined Alink Technical Exchange Group

Python version Alink analysis example

Read the URL data using CsvSourceBatchOp as follows:

source = CsvSourceBatchOp()\ .setFilePath('https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_a ll.csv')\ .setSchemaStr('label long, review string')\ .setIgnoreFirstLine(True)Copy the code

Set the column name to label and review, and the data type to integer and string respectively. Because the first line of the CSV data stores the column name, you need to set the first line to be ignored when reading data.

Below, we choose 5 pieces of data to print and display to see if there is a problem with the data source:

source.firstN(5).print()
Copy the code

The results are as follows:

We then set up the Pipeline to encapsulate the entire processing and modeling process as follows:

pipeline = Pipeline(
    Imputer().setSelectedCols(["review"]).setOutputCols(["featureText"]).setStrategy("value").setFillValue("null"),
    Segment().setSelectedCol("featureText"),
    StopWordsRemover().setSelectedCol("featureText"),
    DocCountVectorizer().setFeatureType("TF").setSelectedCol("featureText").setOutputCol("featureVector"),
    LogisticRegression().setVectorCol("featureVector").setLabelCol("label").setPredictionCol("pred")
)
Copy the code

Explain the role of each algorithm component:

  1. Imputer: Missing value should be filled in the review column by filling string value “NULL” and writing the result to featureText column.

  2. Segment: Break the original sentence into words separated by Spaces. Since there is no input result column, the word segmentation result replaces the value of the input column directly.

  3. StopWordsRemover: Remove the stop word from the word segmentation result.

  4. DocCountVectorizer: Counts the words appearing in the “featureText” column and maps the sentences to a vector with a length of words, based on the TF value calculated, stored in the “featureVector” column.

  5. LogisticRegression: Use the LogisticRegression classification model. Classification predictions are placed in the “PRED” column.

Now, we can enter the model training phase. The PipelineModel (PipelineModel) model can be obtained through the Pipeline FIT () method, denoted as the variable model, the code is as follows:

model = pipeline.fit(source)
Copy the code

You can use Model to predict batch/streaming data by calling Model’s Transform () method.

model.transform(source).select("pred", "label", "review").firstN(10).print()
Copy the code

The running results are as follows:

Java version Alink analysis example

First, we need a Java project for Alink, with the relevant environment configured. The easiest way to do this is to use the Alink Example project, download the Alink Git code, and open the project with the Jave IDE, as shown below. You can see three examples already written: ALSExample, GBDTExample, and KMeansExample.

Let’s create a new Java file under com.alibaba. Alink package:

package com.alibaba.alink;

public class SentimentHotelSimpleExample {

  public static void main(String[] args) throws Exception {

  }
  
}
Copy the code

Read the URL data using CsvSourceBatchOp as follows. Set the column name to label and review, and the data type to integer and string respectively. Because the first line of the CSV data stores the column name, you need to set the first line to be ignored when reading data.

CsvSourceBatchOp source = new CsvSourceBatchOp()
  .setFilePath("https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets"
    + "/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv")
  .setSchemaStr("label int, review string")
  .setIgnoreFirstLine(true)

source.firstN(5).print();
Copy the code

The last line of code is to select 5 data to print and display, the result is as follows:

Label | review -- -- -- -- - | -- - | 0 ctrip order stated: "the room features: A 1.3 * 2, the other a 1.1 * 2 m, not all rooms have free broadband "extra bed, actually see the room after check-in is the minimum standard, I have seen two same size bed, on the opposite side of the bed against the wall, no extra space, nor even the luggage, not to mention what cabinet, and put forward with the order hotel rooms, They said the air conditioner in the room we booked was broken and we had to change it to this one for the same price. I don't know if this is the hotel's fraud or Ctrip's fraud, I want to be compensated. I have actual photos of the room if you need proof. In addition, it is recommended that we do not travel out of Taishan, almost half of the hotel will be irregular, without warning power failure. 0 | bathing unexpectedly has no hot water!!!!! Too depressed ~~ the next day to climb the mountain!! But speed can also be ~ 0 | I request in writing to a quiet room.. Who knows that day to live in the 6th floor, outside the wind, sad blowing, loud voice, an hour to fall asleep. Ask the hotel to change rooms. There e were no trees around the hotel and every room was loud, they said. Herr, is there a reason? Guess next time, we won't be able to stay when the wind blows. 0 | was for a long time before he remembered evaluation, remember to near the railway station is super, but convenient at the same time will feel more noisy. There are many Korean and Japanese tour groups staying here, but the reception service is cold. Two people living in a standard room, only given a room card, but also very provocative look at me. I'm not in the mood. Hotel feedback July 17, 2008: in view of the problems raised by the guests, the hotel has been seriously corrected, we hope that every one of you staying in The Bohai Pearl Hotel can be happy to stay, satisfied and return home. 0 | hotel near the railway and fire at nightCopy the code

We then set up the Pipeline to encapsulate the entire processing and modeling process as follows:

  Pipeline pipeline = new Pipeline(
    new Imputer()
      .setSelectedCols("review")
      .setOutputCols("featureText")
      .setStrategy("value")
      .setFillValue("null"),
    new Segment()
      .setSelectedCol("featureText"),
    new StopWordsRemover()
      .setSelectedCol("featureText"),
    new DocCountVectorizer()
      .setFeatureType("TF")
      .setSelectedCol("featureText")
      .setOutputCol("featureVector"),
    new LogisticRegression()
      .setVectorCol("featureVector")
      .setLabelCol("label")
      .setPredictionCol("pred")
  );
Copy the code

Explain the role of each algorithm component:

  1. Imputer: Missing value should be filled in the review column by filling string value “NULL” and writing the result to featureText column.

  2. Segment: Break the original sentence into words separated by Spaces. Since there is no input result column, the word segmentation result replaces the value of the input column directly.

  3. StopWordsRemover: Remove the stop word from the word segmentation result.

  4. DocCountVectorizer: Counts the words appearing in the “featureText” column and maps the sentences to a vector with a length of words, based on the TF value calculated, stored in the “featureVector” column.

  5. LogisticRegression: Use the LogisticRegression classification model. Classification predictions are placed in the “PRED” column.

Now, we can enter the model training phase. The PipelineModel (PipelineModel) model can be obtained through the Pipeline FIT () method, denoted as the variable model, the code is as follows:

  PipelineModel model = pipeline.fit(source);
Copy the code

You can use Model to predict batch/streaming data by calling Model’s Transform () method.

  model.transform(source)
    .select(new String[]{"pred","label","review"})
    .firstN(10)
    .print();
Copy the code

The running results are as follows:

Mr Pred | label | review - | -- - | | | 1 -- -- -- -- -- - 1 hotel service is really good, room is very clean and tidy. You can see the sea from the balcony, there are a lot of parked ships on the sea, quite a feeling. The breakfast buffet is very good. It's quite varied. 1 | 1 | washed the clothes hanging in the bathroom, wait for going out, has been put on AIRS hang on the balcony! Satisfied service attitude first-class, see all is smiling face! Happy room is very clean although the facilities are old! Peace of mind seascape is good! Great | 1 | very good geographical location, is a deluxe seaview room, open the window can see pier and seascape. I remember it was there a long time ago. It's been renovated. Overall satisfied, later will live | 1 | I also by comparing the user reviews the ctrip choose sea view, is the first time to live. The general feeling is very good, the room hardware is general but very clean, every day fruit, candy gift, the first day also sent dolls, children like it very much. In particular, the service of the hotel is very considerate, and the price of catering and chartered car is very reasonable. We had two dinners in the hotel and were very satisfied, with the average consumption of about 30 yuan (including drinks). With the friends are very satisfied, next time to Weihai will live seascape. The staff here are all polite and courteous, which is really valuable for a state-owned enterprise. For example, as soon as I sat down on the sofa in the lobby after breakfast, the waiter immediately brought me tea and a towel. We play cards in the public rest area on the second floor, the waiter see the sky will be dark, take the initiative to open the light...... | 1 | again at seaview garden hotel is still feeling kind, hotel staff enthusiasm, service is in place, I stay in weihai several other four-star hotel compared with it, the service is not a class, seascape garden is really good, next time I will check in. I want to bring, hotel guest room air conditioning not refrigeration, services cannot be in these places have defects, hope to get improved. | 1 | very good service, although the 4 star hotel, but not as poor as 5 star service, the entire hotel service is very warm, you can see everyone can take the initiative to say hello to you, the hotel also did very well on the details of the services, such as: When you sit down in the lobby, the waiter will bring you tea and towel immediately, which are all free of charge. When you get out of the taxi, the attendant immediately hands you a card with the taxi number on it, in case you need to find it (weihai's taxi doesn't have a number on its invoice). If I had to look for some of the hotel's downfalls, it would be that the restaurant is slow to serve and sichuan food is not ordered because the chef is not very good at it. Anyway, next time I go to Weihai, I will check in at this hotel. 1 | 1 | really good! The location is better, the sea view room is opposite the trestle and Xiaoqingdao, do not go out of the door, do not crowd in the tourist crowd, you can enjoy the pleasant sea view; Breakfast is rich and the environment is good, relative to its price, cost-effective; The simple meal at the coffee bar on the first floor is inexpensive, tasty and full (especially sandwiches and hot dogs). It is a good choice for residents and non-resident tourists to have a simple meal or snack. The dessert in the hotel corridor is more popular, variety, high quality, and added in time, loved by children; The bed is a little small. Overall, the overall feeling is good, recommended stay. | 1 | location convenient, cheap price, good service, beautiful environment! Thank you very much! | 1 | nice hotel, every time go to yancheng to live here, cheap quantity foot again! 1 | 1 | is good hotel, dining and travel is very convenient. The service attitude is also better.Copy the code

Appendix The complete Java code is as follows:

package com.alibaba.alink; import com.alibaba.alink.operator.batch.source.CsvSourceBatchOp; import com.alibaba.alink.pipeline.Pipeline; import com.alibaba.alink.pipeline.PipelineModel; import com.alibaba.alink.pipeline.classification.LogisticRegression; import com.alibaba.alink.pipeline.dataproc.Imputer; import com.alibaba.alink.pipeline.nlp.DocCountVectorizer; import com.alibaba.alink.pipeline.nlp.Segment; import com.alibaba.alink.pipeline.nlp.StopWordsRemover; public class SentimentHotelSimpleExample { public static void main(String[] args) throws Exception { CsvSourceBatchOp source = new CsvSourceBatchOp() .setFilePath("https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets" + "/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv") .setSchemaStr("label int, review string") .setIgnoreFirstLine(true); //source.firstN(5).print(); Pipeline pipeline = new Pipeline( new Imputer() .setSelectedCols("review") .setOutputCols("featureText") .setStrategy("value") .setFillValue("null"), new Segment() .setSelectedCol("featureText"), new StopWordsRemover() .setSelectedCol("featureText"), new DocCountVectorizer() .setFeatureType("TF") .setSelectedCol("featureText") .setOutputCol("featureVector"), new LogisticRegression() .setVectorCol("featureVector") .setLabelCol("label") .setPredictionCol("pred") ); PipelineModel model = pipeline.fit(source); model.transform(source) .select(new String[] {"pred", "label", "review"}) .firstN(10) .print(); }}Copy the code