Data extraction

In my previous article Scrapy crawler, we did a crawler search for python book items, and retrieved the titles, prices, reviews, and links of 1260 Python book items on 60 + pages, and stored all item data in a local. Json file. The data storage format is as follows:


Commodity data retrieved by crawlers





Dic =[] f = open("D:/python/... /getdata.json", 'r',encoding='utf-8')# TMP = "name,price,comnum,link=[] for I in range(0,1260): dic[i]['price']=tmp + dic[i]['price'][1:] dic[i]['comnum']=dic[i]['comnum'][:-3]+tmp price.append(float(dic[i]['price'])) comnum.append(int(dic[i]['comnum'])) name.append(dic[i]['name']) link.append(dic[i]['link']) data = numpy.array([name,price,comnum,link]).T print (data)Copy the code

Here, the data is processed and converted to the Python scientific computing library’s numpy.array format. The data output is as follows:




print(data)


The data is then stored as a.csv file

CsvFile = open('D:/python/... /csvFile.csv','w') writer = csv.writer(csvFile) writer.writerow(['name', 'price', 'comnum','link']) for I in range(0,1260): writer.writerow(data[I]) csvfile.close ()Copy the code

You can now open the. CSV file in that path, which has been saved in the following format:




The CSV file


> < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;”

Data preprocessing

Missing value handling

The data in the. CSV file is read by the library pandas and stored in a data box format. It can be found that many books in this data have 0 reviews, so data cleaning should be done first to find these missing values, which is also an important part of the data analysis process. In the process of data analysis, there are two ways to deal with these missing data, which can be filled with the mean of the number of comments, or delete the missing data directly and select the processing method according to the corresponding situation.

Data = pda.read_csv("D:/python/... / csvfile.csv ") # Data ["comnum"][(data["comnum"]==0)]=None #data.fillna(value=data["comnum"].mean(),inplace=True) Data1 =data.dropna(axis=0,subset=["comnum"])Copy the code

There is too much missing data, so deletion is adopted here.

Outlier handling

When handling outliers, we should first find outliers. We should first draw scatter plots of data to observe the distribution of data. Here we use Python’s data visualization library Matplotlib to draw the plots.

Import matplotlib.pyplot as PLT # Draw scatter plot (horizontal axis: price, vertical axis: price) FIG = plt.figure(figsize=(10,6)) plt.plot(data1['price'],data1['comnum'],"o") Plt.xlabel ('price') plt.yLabel ('comnum') plt.show()Copy the code


Price – comment number scatter plot


It can be seen that the number of comments on some data is too high, perhaps for hot-selling products or there are brush comments, and the price of some data is too high, even as high as 700, while the price of general books is not higher than ¥150. For these outliers, we generally do not consider them in data analysis. We can just delete or change them. Take a look at the box chart of the data and observe the distribution:

FIG = plt.figure(figsize=(10,6)) Ax1 = add_subplot(1,2,1) ax2 = add_subplot(1,2,2) Ax1.set_xlabel ('price') ax2.boxplot(data1['comnum'].values) ax2.set_xlabel('comnum') # set x, Y-axis value range ax1.set_ylim(0,150) ax2.set_ylim(0,1000) plt.show()Copy the code


Box figure


The yellow line in the box plot of prices represents the median, which is about ¥50. The upper and lower quartiles in the box plot are ¥40 to ¥70 respectively. The upper and lower bounds are ¥110 and ¥0 respectively. It can be seen that the median distribution point of the number of comments is low. The value of outliers obviously deviates from other observed values, which will have adverse effects on the analysis results. Therefore, we delete outliers with prices above 120 and comments above 700 without consideration, and the code is as follows:

# Delete price above ¥120, Data2 =data[data['price']<120] data3=data2[data2['comnum']<700] #data3 =data2['comnum']<700 PLT. Figure (figsize = (1, 6)) PLT. The plot (data3 [' price '], data3 [' comnum], "o") PLT, xlabel (' price ') PLT. Ylabel (' comnum) plt.show()Copy the code

After processing, there are about 500 data remaining, and the distribution of new data is shown as follows:




Price – comment number scatter plot

Data visualization analysis

Histogram visualization analysis

Finally, we can make visual analysis of the data. We can make histogram of the price and the number of comments to analyze the distribution of data.

Pricemax =da2[1].max() pricemin=da2[1].min() commentmax=da2[2].max() commentmin=da2[2].min() ## Calculate extremely bad Pricerg =pricemax-pricemin commentrg=commentmax-commentmin # Group distance = Extreme/number of groups Pricedst =pricerg/13 Commentdst =commentrg/13 FIG = Plt. figure(figsize=(12,5)) ax1 = FIG. Add_subplot (1,2,2) ax2 = FIG. Add_subplot (1,2,2) ax2 = FIG. Add_subplot (1,2,2) Pricesty = numpy. Arange (pricemin pricemax, pricedst) ax1, hist (da2 [1], pricesty, rwidth = 0.8). Ax1 set_title (' price ') # drawing comments histogram Commentsty = numpy. Arange (commentmin commentmax, commentdst) ax2. Hist (da2 [2], commentsty, rwidth = 0.8). Ax2 set_title (' comnum)  plt.show()Copy the code


histogram


It can be observed from the histogram: 1, the price of the book is roughly normally distributed, about RMB 40 books more, that the python books basic pricing in RMB 40 or so 2, comments below 50 books goods up to (200), with increasing comments, quantity to reduce gradually, that most goods sales, sales of good book is the several classic.

K-means clustering visualization analysis

Finally, the data is analyzed by clustering. K-means clustering algorithm, a machine learning algorithm, is adopted here. K-means clustering algorithm is an unsupervised learning algorithm in machine learning, simple, fast and suitable for conventional data sets. Calculate the distance between the sample point and each cluster center and select the one with the small distance for clustering. 3. Calculate the new cluster center and change the new cluster center. This algorithm can be used to achieve the classification of goods:

TMP =numpy.array([data3['price'],data3['comnum']]) KMeans # Settings are divided into 3 categories, KMS =KMeans(N_clusters =3) y=kms.fit_predict(TMP) # plt.xlabel('price') plt.ylabel('comnum') for i in range(0,len(y)): if(y[i]==0): plt.plot(tmp[i,0],tmp[i,1],"*r") elif(y[i]==1): plt.plot(tmp[i,0],tmp[i,1],"sy") elif(y[i]==2): plt.plot(tmp[i,0],tmp[i,1],"pb") plt.show()Copy the code


Clustering distribution diagram


As can be seen from the clustering results, the K-means clustering algorithm divides those with comments less than 100 into one category, those with comments roughly ranging from 100 to 350 into one category, and those with comments roughly over 350 into three categories based on whether books are popular or not. The clustering effect can be clearly seen from the figure.

conclusion

After getting an initial crawler data, this paper summarizes the whole process from data extraction into files, data missing value and outlier value processing, histogram distribution of commodities and K-means clustering visualization analysis. The above procedure is the general process of using Python for simple data analysis. After getting a data, you can also refer to the code in this article to do your own data analysis and play with the data. Then I will continue to summarize other python crawler and data analysis skills, learning while output, some content will be released in my blog | qiao don’t chocolate, welcome to visit! O (studying studying) O