The text and pictures in this article come from the network, only for learning, communication, do not have any commercial purposes, if you have any questions, please contact us to deal with.
The following article is about learning Python by J Brother
For those who are new to Python, you can copy the link below to watch the basic Python tutorial for free
https://v.douyu.com/author/y6AZ4jn9jwKW
Copy the code
preface
Before on, guangshen’s friends are probably still wearing short sleeves envy the atmosphere of snow in the north. As a result, last week, Guangzhou and Shenzhen also ushered in a cooling, people have joined the “cooling group chat”.
In order to help you resist the cold, I specially climb down jingdong down jacket data. Why not Tmall, the reason is very simple, slider verification is a bit troublesome.
Data acquisition
Jingdong’s website is a dynamically loaded Ajax site that can only be accessed by parsing the interface or using selenium automated testing tools. About the dynamic web crawler, the public number history of the original article introduced, interested friends can go to know about it.
Selenium was used for data acquisition this time. Due to the rapid update of my Google Browser version, the original Google driver was interrupted. So I replaced the browser auto-update and downloaded the corresponding version of the driver.
Then, using selenium, I searched for down jackets on JINGdong and logged in by scanning the code on my mobile phone, obtaining the product name, price, store name, number of comments and other information of down jackets.
from selenium import webdriver from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from lxml import etree Import random import JSON import CSV import time browser = webdriver.Chrome('/ 下 J 中 Python/ jd.com/Chromedriver ') wait =WebDriverWait(browser,50) # set wait time url = 'https://www.jd.com/' data_list= [] # set global variables to store data keyword =" down "# def Page_click (page_number) : try: # slide to the bottom the execute_script (" window. The scrollTo (0, document. Body. ScrollHeight);" Button = wait. Until (ec.element_to_be_clickable ((by.css_selector,)) '#J_bottomPage > sp.p-num > a.p-next > em')) #J_bottomPage > sp.p-num > a.p-next > em Ec.presence_of_all_elements_located ((by.css_selector, "#J_goodsList > ul > li:nth-child(30)")) 30 after loading the goods the execute_script (" window. The scrollTo (0, document. Body. ScrollHeight);" ) wait.until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(60)")) "#J_bottomPage > span.p-num > a.curr"), HTML = browser.page_source# retrieve page information prase_html(HTML)# call the function except to extract data TimeoutError: return page_click(page_number)Copy the code
Data cleaning
Import data
Import pandas as pd import numpy as np df = pd. Read_csv ("/ / Python/ jd/down. CSV ") df.sample(10)Copy the code
Rename columns
(df = df. Rename the columns = {" title ":" name of commodity, 'price' : 'commodity prices',' shop_name ':' shop name ', 'comment' : 'comment on the number of'})Copy the code
Viewing Data Information
Df.info () "1. Duplicate value possible 2. Missing value 3 in store name. Number of evaluation need to be cleaned "' < class 'pandas. Core. Frame. DataFrame' > RangeIndex: 4950 entries, 0 to 4949 Data columns (total 4 columns) : # Column non-null Count Dtype --------- -------------- ----- 0 Item name 4950 Non-NULL Object 1 Item price 4950 non-NULL FLOAT64 2 Store name 4949 Non-NULL object 3 Number of comments 4950 Non-null object DTypes: Float64 (1), Object (3) Memory Usage: 154.8+ KBCopy the code
Deleting Duplicate Data
df = df.drop_duplicates()
Copy the code
Missing value handling
Df [" 表 名 "] = df[" 表 名 "].fillna(" 表 名 ")Copy the code
Trade name cleaning
The thickness of the
TMP =[] for I in df[" 下 载 "]: if" 下 载 "in I: TMP. Append (" 下 载 ") else: if" 下 载 "in I: TMP. Append (" 下 载 ") else: Tmp.append (" other ") df[' thickness '] = TMPCopy the code
Version –
For I in df[" 下 载 "]: if" 下 载 "in I: tmp.append(" 下 载 ") elif" 下 载 "in I: tmp.append(" 下 载 ") else: tmp.append(" 下 载 ") df[' 下 载 '] = TMPCopy the code
style
TMP =[] for I in df[" 下 载 "]: if" 下 载 "in I: TMP. Append (" 下 载 ") elif" 下 载 "in I: Tmp.append (" casual ") elif" simple "in I: tmp.append(" simple ") else: tmp.append(" other ") df[' style '] = TMPCopy the code
Commodity price cleaning
Df [" price range "] = pd cut (df [" commodities "], [0, 100300, 500, 700, 1000000],labels=['100 yuan below ','100 yuan -300 yuan ','300 yuan -500 Yuan ','500 yuan -700 yuan ','700 yuan -1000 yuan ','1000 yuan above '],right=False)Copy the code
Evaluation number cleaning
The import re df [' digital '] = [re. The.findall (r '(\ d + \. {0, 1} \ d *)', I) [0] for I in df [' comment on the number of ']] # extract digital df [' digital '] = df [' digital '] astype (' float ') # into numeric df [' units'] = ['. Join (re. The.findall (r '(m), I)) for I in df [' comment on the number of ']] # extraction unit (m) df [' units'] = df [' units']. Apply (lambda x, if x = 10000 = 'wan' else1) df [' comment on the number of '] = df * [' digital '] Df [' units'] # calculating the number of comments df [' comment on the number of '] = df [r]. 'comment on the number of' astype (" int ") df. Drop ([' digital ', 'units'], axis = 1, inplace = True)Copy the code
Shop Name cleaning
If df[" STR "] = df[" STR "].Copy the code
visualization
Introduce visual correlation libraries
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline plt.rcParams['font.sans-serif'] = ['SimHei'] # Set loaded font name plt.rcparams ['axes. Unicode_minus '] = False# Import jieba import re from Pyecharts.charts import * from pyecharts import options as opts from pyecharts.globals import ThemeType import stylecloud from IPython.display import ImageCopy the code
Descriptive statistics
Correlation analysis
Histogram of commodity price distribution
Plot (df[" 小 plots "],color=" bins ",bins=10) Axes (fontsize=16) axes (fontsize=16) axes (fontsize=16)Copy the code
Histogram of comment population distribution
Plot (df[" number of plots "],color="green",bins=10,rug=True) Axes (fontsize=16) axes (fontsize=16) axes (fontsize=16)Copy the code
The relationship between the number of comments and commodity prices
Graph,axes= PLT. Subplots (figsize=(15,8)) SNS. plt.xticks(fontsize=16) plt.yticks(fontsize=16)Copy the code
Down jacket price distribution
Astype (" STR ").value_counts() print(df2) df2 = df2.sort_values(Ascending =False) Regions = df2.index.to_list() values = df2.to_list() c = ( Pie(init_opts=opts.InitOpts(theme=ThemeType.DARK)) .add("", list(zip(regions,values))) .set_global_opts(legend_opts = opts.LegendOpts(is_show = False),title_opts= opts.titLeopts (title=" down jacket price range distribution ",subtitle=" ",pos_top="0.5 ",pos_left = 'left') .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%",font_size=14)) ) c.render_notebook()Copy the code
Top10 stores by number of comments
Df5 = df.groupby(' store name ')[' comment number ']. Mean () df5 = df5.sort_values(ascending=True) df5 = df5.tail(10) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1100px",height="600px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_opts(title_opts= opts.titleopts (title=" number of comments TOP10",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code
Version –
[' product ']. Mean () df5 = df5.sort_values(ascending=True)[:2] #df5 = df5.tail(10) df5 = df5.round(2) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_OPts (title_OPts = opts.titLEOPts (title_OPts = opts.titLeopts (title=" all versions of down jacket average price ",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code
The thickness of the
[' product ']. Mean () df5 = df5.sort_values(ascending=True)[:2] #df5 = df5.tail(10) df5 = df5.round(2) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_OPts (title_OPts = opts.titLEOPts (title_OPts = opts.titLEOPts (title=" average price of down jacket of all thickness ",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code
style
[' product ']. Mean () df5 = df5.sort_values(ascending=True)[:4] #df5 = df5.tail(10) df5 = df5.round(2) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_OPts (title_OPts = opts.titLEOPts (title=" all styles of down jacket average price ",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code