Original link: tecdat.cn/?p=12203


introduce

Everyone likes to save money. We all try to make the most of our money, and sometimes it’s the simplest things that can make the biggest difference. Coupons have long been taken to supermarkets to get discounts, but using them has never been easier, thanks to Groupon.

Groupon is a coupon recommendation service that broadcasts electronic coupons to restaurants and stores near you. Some of these coupons can be very important, especially when planning group activities, as discounts can be as high as 60%.

 

data

The data was taken from Groupon’s New York City area. The layout of the site is divided into album searches for all the different Groupon, followed by in-depth pages for each specific Groupon. The site looks like this:

 

 

Neither page layout is dynamic, so a custom scrapy was created to quickly navigate through all the pages and retrieve the information to analyze. However, comments, important information, are rendered and loaded via JavaScript. Selenium scripts use Scrapy groupons urls, essentially mimicking human clicks on the “Next” button in the user comment section.

For url in url_list.url[0:50]: try: driver.get(url) time.sleep(2) # close any pop-up window # # if(driver.switch_to_alert()): try: close = driver.find_element_by_xpath('//a[@id="nothx"]') close.click() except: pass time.sleep(1) try: link = driver.find_element_by_xpath('//div[@id="all-tips-link"]') driver.execute_script("arguments[0].click();" , link) time.sleep(2) except: next i = 1 print(url) while True: try: time.sleep(2) print("Scraping Page: " + str(i)) reviews = driver.find_elements_by_xpath('//div[@class="tip-item classic-tip"]') next_bt = driver.find_element_by_link_text('Next') for review in reviews[3:]: review_dict = {} content = review.find_element_by_xpath('.//div[@class="twelve columns tip-text ugc-ellipsisable-tip ellipsis"]').text author = review.find_element_by_xpath('.//div[@class="user-text"]/span[@class="tips-reviewer-name"]').text date = review.find_element_by_xpath('.//div[@class="user-text"]/span[@class="reviewer-reviewed-date"]').text review_dict['author'] = author review_dict['date'] = date review_dict['content'] = content review_dict['url'] = url writer.writerow(review_dict.values()) i += 1 next_bt.click() except: break except: next csv_file.close() driver.close()Copy the code

The data retrieved from each group is shown below.

The company title

Classified information

Location of transaction function

Total score website

The author date

Comments on the site

There were about 89,000 user reviews. The data retrieved from each comment is shown below.

print(all_groupon_reviews[all_groupon_reviews.content.apply(lambda x: isinstance(x, float))])
indx = [10096]
all_groupon_reviews.content.iloc[indx]
            author       date content  \
10096  Patricia D. 2017-02-15     NaN   
15846       Pat H. 2016-09-24     NaN   
19595      Tova F. 2012-12-20     NaN   
40328   Phyllis H. 2015-06-28     NaN   
80140     Andre A. 2013-03-26     NaN   

                                                 url  year  month  day  
10096  https://www.groupon.com/deals/statler-grill-9  2017      2   15  
15846         https://www.groupon.com/deals/impark-3  2016      9   24  
19595   https://www.groupon.com/deals/hair-bar-nyc-1  2012     12   20  
40328     https://www.groupon.com/deals/kumo-sushi-1  2015      6   28  
80140  https://www.groupon.com/deals/woodburybus-com  2013      3   26  
Copy the code

Exploratory data analysis

One interesting finding is that the use of groups has increased greatly in the last few years. We found this out by checking the dates provided by the reviews. This conclusion becomes apparent when you look at the image below, where the X-axis represents months/years and the Y-axis represents counts. The last slight decline was due to the fact that some groups at the time may have been seasonal.

 

 

One interesting finding is that the use of groups has increased greatly in the last few years. We found this out by checking the dates provided by the reviews. Look at the image below, where the X-axis represents month/year and the Y-axis represents count. The last slight decline was due to the fact that some groups at the time may have been seasonal.

Pie_chart_df = groupons.groupby ('categories').agg('count') plt.rcparams [' fig.figsize '] = (8,8) sizes = list(pie_chart_df.mini_info) labels = pie_chart_df.index plt.pie(sizes, shadow=True, labels = labels, Autopct ='%1.1f%%', startAngle =140) # plt.legend(labels, loc="best") plt.axis('equal')Copy the code

 

Finally, since most of the data is through the text: Price (original price), a regular expression is derived to parse the price information, as well as the number of transactions they offer. This information is displayed in the following bar chart:


objects = list(offer_counts.keys())
y = list(offer_counts.values())
tst = np.arange(len(y))

plt.bar(tst,y, align = 'center')
plt.xticks(tst, objects)
plt.ylabel('Total Number of Groupons')
plt.xlabel('Different Discounts Offers')
plt.show()
Copy the code

 


plt.ylabel('Number of Offerings')
plt.xticks(ind, ('Auto', 'Beauty', 'Food', 'Health', 'Home', 'Personal', 'Things'))
plt.xlabel('Category of Groupon')
plt.legend((p0[0], p1[0], p2[0], p3[0], p4[0], p5[0], p6[0], p7[0], p10[0]), ('0', '1', '2', '3', '4', '5', '6', '7', '10'))
Copy the code

 

 

sns.violinplot(data = savings_dataframe)
Copy the code

 

Finally, use user comment data to generate a word cloud:

Plt.rcparams ['figure.figsize'] = (20,20) wordcloud = wordcloud (width=4000, height=2000, max_words=150, background_color='white').generate(text) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off")Copy the code


 

Topic modeling

The two most important software packages used for topic modeling are Gensim and Spacy. The first step in creating a corpus is to delete all stop words, such as “, “and so on. Finally, create trigrams.

The model of choice is Latent Dirichlet Allocation because of its ability to distinguish topics from different documents and the presence of a package that clearly and effectively visualizes the results. Since this method is unsupervised, the number of topics must be selected in advance, and the optimal number is 3 in the 25 consecutive iterations of the model. The results are as follows:

 

The visualization above projects themes onto two components, where similar themes are closer together and dissimilar themes are farther away. The words on the right are the words that make up each topic, and the lambda argument controls the exclusivity of the words. A lambda of 0 represents the most repellent word around each topic, while a lambda of 1 represents the most frequent word around each topic.

The first topic represents the quality of service and reception. The second topic has words that describe exercise and physical activity. Finally, the third topic has words that belong to the food category.

conclusion

Topic modeling is a form of unsupervised learning and the scope of this project is to briefly examine the capabilities of discovering patterns behind the underlying words. While we think our reviews of certain products/services are unique, the model makes it clear that certain words are, in fact, used by the entire population.