Original link: tecdat.cn/?p=12203
introduce
Everyone likes to save money. We all try to make the most of our money, and sometimes it’s the simplest things that can make the biggest difference. Coupons have long been taken to supermarkets to get discounts, but using them has never been easier, thanks to Groupon.
Groupon is a coupon recommendation service that broadcasts electronic coupons to restaurants and stores near you. Some of these coupons can be very important, especially when planning group activities, as discounts can be as high as 60%.
data
The data was taken from Groupon’s New York City area. The layout of the site is divided into album searches for all the different Groupon, followed by in-depth pages for each specific Groupon. The site looks like this:
Neither page layout is dynamic, so a custom scrapy was created to quickly navigate through all the pages and retrieve the information to analyze. However, comments, important information, are rendered and loaded via JavaScript. Selenium scripts use Scrapy groupons urls, essentially mimicking human clicks on the “Next” button in the user comment section.
For url in url_list.url[0:50]: try: driver.get(url) time.sleep(2) # close any pop-up window # # if(driver.switch_to_alert()): try: close = driver.find_element_by_xpath('//a[@id="nothx"]') close.click() except: pass time.sleep(1) try: link = driver.find_element_by_xpath('//div[@id="all-tips-link"]') driver.execute_script("arguments[0].click();" , link) time.sleep(2) except: next i = 1 print(url) while True: try: time.sleep(2) print("Scraping Page: " + str(i)) reviews = driver.find_elements_by_xpath('//div[@class="tip-item classic-tip"]') next_bt = driver.find_element_by_link_text('Next') for review in reviews[3:]: review_dict = {} content = review.find_element_by_xpath('.//div[@class="twelve columns tip-text ugc-ellipsisable-tip ellipsis"]').text author = review.find_element_by_xpath('.//div[@class="user-text"]/span[@class="tips-reviewer-name"]').text date = review.find_element_by_xpath('.//div[@class="user-text"]/span[@class="reviewer-reviewed-date"]').text review_dict['author'] = author review_dict['date'] = date review_dict['content'] = content review_dict['url'] = url writer.writerow(review_dict.values()) i += 1 next_bt.click() except: break except: next csv_file.close() driver.close()Copy the code
The data retrieved from each group is shown below.
The company title
Classified information
Location of transaction function
Total score website
The author date
Comments on the site
There were about 89,000 user reviews. The data retrieved from each comment is shown below.
print(all_groupon_reviews[all_groupon_reviews.content.apply(lambda x: isinstance(x, float))])
indx = [10096]
all_groupon_reviews.content.iloc[indx]
author date content \
10096 Patricia D. 2017-02-15 NaN
15846 Pat H. 2016-09-24 NaN
19595 Tova F. 2012-12-20 NaN
40328 Phyllis H. 2015-06-28 NaN
80140 Andre A. 2013-03-26 NaN
url year month day
10096 https://www.groupon.com/deals/statler-grill-9 2017 2 15
15846 https://www.groupon.com/deals/impark-3 2016 9 24
19595 https://www.groupon.com/deals/hair-bar-nyc-1 2012 12 20
40328 https://www.groupon.com/deals/kumo-sushi-1 2015 6 28
80140 https://www.groupon.com/deals/woodburybus-com 2013 3 26
Copy the code
Exploratory data analysis
One interesting finding is that the use of groups has increased greatly in the last few years. We found this out by checking the dates provided by the reviews. This conclusion becomes apparent when you look at the image below, where the X-axis represents months/years and the Y-axis represents counts. The last slight decline was due to the fact that some groups at the time may have been seasonal.
One interesting finding is that the use of groups has increased greatly in the last few years. We found this out by checking the dates provided by the reviews. Look at the image below, where the X-axis represents month/year and the Y-axis represents count. The last slight decline was due to the fact that some groups at the time may have been seasonal.
Pie_chart_df = groupons.groupby ('categories').agg('count') plt.rcparams [' fig.figsize '] = (8,8) sizes = list(pie_chart_df.mini_info) labels = pie_chart_df.index plt.pie(sizes, shadow=True, labels = labels, Autopct ='%1.1f%%', startAngle =140) # plt.legend(labels, loc="best") plt.axis('equal')Copy the code
Finally, since most of the data is through the text: Price (original price), a regular expression is derived to parse the price information, as well as the number of transactions they offer. This information is displayed in the following bar chart:
objects = list(offer_counts.keys())
y = list(offer_counts.values())
tst = np.arange(len(y))
plt.bar(tst,y, align = 'center')
plt.xticks(tst, objects)
plt.ylabel('Total Number of Groupons')
plt.xlabel('Different Discounts Offers')
plt.show()
Copy the code
plt.ylabel('Number of Offerings')
plt.xticks(ind, ('Auto', 'Beauty', 'Food', 'Health', 'Home', 'Personal', 'Things'))
plt.xlabel('Category of Groupon')
plt.legend((p0[0], p1[0], p2[0], p3[0], p4[0], p5[0], p6[0], p7[0], p10[0]), ('0', '1', '2', '3', '4', '5', '6', '7', '10'))
Copy the code
sns.violinplot(data = savings_dataframe)
Copy the code
Finally, use user comment data to generate a word cloud:
Plt.rcparams ['figure.figsize'] = (20,20) wordcloud = wordcloud (width=4000, height=2000, max_words=150, background_color='white').generate(text) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off")Copy the code
Topic modeling
The two most important software packages used for topic modeling are Gensim and Spacy. The first step in creating a corpus is to delete all stop words, such as “, “and so on. Finally, create trigrams.
The model of choice is Latent Dirichlet Allocation because of its ability to distinguish topics from different documents and the presence of a package that clearly and effectively visualizes the results. Since this method is unsupervised, the number of topics must be selected in advance, and the optimal number is 3 in the 25 consecutive iterations of the model. The results are as follows:
The visualization above projects themes onto two components, where similar themes are closer together and dissimilar themes are farther away. The words on the right are the words that make up each topic, and the lambda argument controls the exclusivity of the words. A lambda of 0 represents the most repellent word around each topic, while a lambda of 1 represents the most frequent word around each topic.
The first topic represents the quality of service and reception. The second topic has words that describe exercise and physical activity. Finally, the third topic has words that belong to the food category.
conclusion
Topic modeling is a form of unsupervised learning and the scope of this project is to briefly examine the capabilities of discovering patterns behind the underlying words. While we think our reviews of certain products/services are unique, the model makes it clear that certain words are, in fact, used by the entire population.