Learn with attitude

Having analyzed Ajax-loaded web pages several times, I come back to using Selenium automation to get web information.

Usually for asynchronously loaded web pages, we need to find the real request of the web page, and construct the request parameters, and finally get the real request URL. Using Selenium to simulate browser operations, you don’t have to worry so much about it and can just see it.

Of course, while bringing convenience, it also has disadvantages, such as increased time and reduced efficiency. But for the amateur crawler, being faster is not that important.

Start by installing selenium on PyCharm on your computer, and then download the ChromeDriver that corresponds to Google chrome on your computer. Because my Mac OS version is relatively new, I need to disable the Rootless kernel protection mechanism before I can install it, so IT took a lot of trouble to install it successfully.

For jingdong Mall notebook web page analysis, this time as long as the page source analysis, you can get notebook price, title, number of comments, business name, business nature.

The crawl code is as follows:

from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait  from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium Import webDriver from BS4 import BeautifulSoup import Pymongo import time pymongo.MongoClient(host='localhost', Port =27017) db = client.jd_products collection = db.products # Start browser browser = webdriver.chrome () wait = WebDriverWait(browser, 50) def to_mongodb(data): collection.insert(data) print("Insert The Data Successfully") except: print('Insert The Data Failed') def search(): browser.get('https://www.jd.com/') try: # Find search box and search button, Enter the information and click the button input = Wait. Until (ec.presence_of_all_elements_located ((by.css_selector, "#key"))) submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#search > div > div.form > button"))) input[0].send_keys(' pad ') submit. Click () Click button_1 = wait. Until (ec.element_to_be_clickable ((by.css_selector, "#J_selector > div:nth-child(2) > div > div.sl-value > div.sl-v-list > ul > li:nth-child(1) > a"))) button_1.click() button_2 = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, Top > div. F -sort > a:nth-child(2)"))) button_2.click( wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > em:nth-child(1) > b'))) return page[0].text except TimeoutException: search() def next_page(page_number): try: # slide to the bottom of the page to load out all commodity information browser. Execute_script (" window. The scrollTo (0, document. Body. ScrollHeight);" ) time.sleep(10) HTML = browser.page_source parse_html(HTML) # When the page reaches 100 pages, the next button is invalid, so select end program while page_number == 101: Exit () # Find next button, Button = wait. Until (ec.element_to_be_clickable ((by.css_selector, '#J_bottomPage > span.p-num > a.pn-next > em'))) button.click() wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(60)")) "#J_bottomPage > span.p-num > a.curr"), str(page_number))) except TimeoutException: return next_page(page_number) def parse_html(html): Data = {} soup = BeautifulSoup(HTML, 'html.parser') goods_info = soup. Select ('.gl-item') # Len (goods_info) print(quantity) for info in goods_info: Select ('.p-name.p-name-type-2 a em')[0].text.strip() title = title.strip () '') print("title: ", Select ('.p-price I ')[0].text.strip() price = int(float(price)) print("price: ", Select ('.p-commit strong')[0].text.strip() commit = Commit. Replace (' stripe ', '') if 'wan' in commit: commit = commit. Split (" wan ") commit = int(float(commit[0]) * 10000) else: commit = int(float(commit.replace('+', ''))) print("commit: Shop_name = info.select('.p-shop a') if (len(shop_name)) == 1: print("shop_name: ", shop_name[0].text.strip()) data['shop_name'] = shop_name[0].text.strip() else: print("shop_name: Select ('.p- ICONS I ') if (len(shop_property)) >= 1: ", 'shop_name') data['shop_name'] = 'shop_property' Message = shop_property[0].text.strip() if message == 'shop_property ': print("shop_property: ", message) data['shop_property'] = message else: print("shop_property: ", 'shop_property') data['shop_property'] = message else: print("shop_property: ", 'shop_property'] = message else: Print ("shop_property: ", 'shop_property') to_mongodb(data) print(data) print("\n\n") def main(): total = int(search()) print(total) for i in range(2, total+2): Time. Sleep (20) print (" first ", I - 1, "page:") next_page (I) if __name__ = = "__main__" : the main ()Copy the code

Although I searched with the keyword “notebook” at the beginning, I still need to click the “notebook” button again. This is because the direct search for “notebook” will show the notebook that usually takes notes in class, so useless information will be obtained. Therefore, we can obtain the information we want by using jingdong’s own more detailed classification.

Each page had 60 items of product information, so there should be 6,000 items of product information on the laptop, but only 5,992 items were retrieved.

Two reasons are estimated:

1 one ️ The title of commodity in MongoDB is the main key, and the title of commodity is repeated

2 discount ️ website failed to load all commodity information

Finally, the successful acquisition of commodity information

Read MongoDB data for visual analysis

from pyecharts import Bar import pandas as pd import numpy as np import pymongo client = pymongo.MongoClient('localhost', 27017) db = client.JD_products table = db.products df = pd.DataFrame(list(table.find())) shop_message = Groupby (['shop_name']) shop_com = shop_message['shop_name']. Agg (['count']) shop_com.reset_index(inplace=True) shop_com_last = shop_com.sort_values('count', ascending=False)[:12] attr = np.array(shop_com_last['shop_name']) v1 = np.array(shop_com_last['count']) attr = [" {} ". The format (i.r eplace (' jingdong ', '). The replace (' flagship ', '). The replace (' proprietary ', '). The replace (' official ', '). The replace (' jingdong ', '). The replace (' computer ', ' '). The replace (' products stores', '). The replace (' station ', '). The replace (' laptop ', For I in attr] v1 = ["{}". Format (I) for I in v1] bar = bar (" title_pos='center', title_top='18', title_pos='center', title_top='18', Width =800, height=400) bar.add(" convert ", attr, v1, is_convert=True, xaxis_min=10, yaxis_label_textsize=12, is_yaxis_boundarygap=True, yaxis_interval=0, is_label_show=True, is_legend_show=False, label_pos='right', Is_yaxis_inverse =True, is_splitline_show=False) bar.render("Copy the code

As you can see from the above, ThinkPad is at the top of the list, which also echoes the word cloud below. Business, office, because it is a business office based brand notebook. In addition, lenovo, Asustek, Acer, Huawei are also on the list, support domestic products!!

from pyecharts import Bar import pandas as pd import pymongo client = pymongo.MongoClient('localhost', 27017) db = client.JD_products table = db.products df = pd.DataFrame(list(table.find())) price_info = df['price'] bins = [0, 2000, 2500, 3000, 3500, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 12000, 14000, 16000, 19000, 200000] level = ['0-2000', '2000-2500', '2500-3000', '3000-3500', '3500-4000', '4000-5000', '5000-6000', '6000-7000', '7000-8000', '8000-9000', '9000-10000', '10000-12000', '1000-14000 ', '14000-16000', '16000-19000', Cut (price_info, bins=bins, Value_counts ().sort_index() attr = price_stage.index v1 = price_stage.values bar = bar (' price histogram ', title_pos='center', title_top='10', width=800, height=400) bar.add('', attr, v1, is_stack=True, xaxis_rotate=30, Yaxis_min =0, xaxis_interval=0, is_splitline_show=False, is_label_show=True)Copy the code

Notebook price range in 4000-6000 has a large concentration, also reflects the middle price of the notebook now to a certain extent, REMEMBER that just went to college, the price in 5000+ notebook can have a good configuration, LOL special effects open.

from pyecharts import Pie import pandas as pd import pymongo client = pymongo.MongoClient('localhost', 27017) db = client.JD_products table = db.products df = pd.DataFrame(list(table.find())) shop_message = df.groupby(['shop_property']) shop_com = shop_message['shop_property'].agg(['count']) shop_com.reset_index(inplace=True)  shop_com_last = shop_com.sort_values('count', Ascending =False) attr = shop_com_last['shop_property'] v1 = shop_com_last['count'] pie = pie (' shop property', title_pos='center', width=800, height=400) pie.add('', attr, v1, radius=[40, 75], label_text_color=None, Is_label_show =True, legend_orient='vertical', legend_pos='left') pie.render(' store properties.Copy the code

Statistics down self – support and non – support, or insignificant. The biggest difference between Jingdong and Taobao is that jingdong has its own products and fast delivery. Although self-operated fake goods, but still a small probability of events. When I buy electronic products, such as mobile phones and computers, my first choice is to go to the official website or jingdong self-operated stores. I will never go to the electronic city to compete with dishonest merchants, even if the price may be low. However, the express delivery on the official website is usually slow and takes 3-5 days, while jingdong may only take 1-2 days, so JINGdong is the best choice for me to purchase.

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt import pandas as pd import pymongo import jieba import re client = pymongo.MongoClient('localhost', 27017) db = client.JD_products table = db.products data = pd.DataFrame(list(table.find())) data = data[['_id']] text = '' for line in data['_id']: R = '[a zA - Z0-9'! "& # $% \ '() *, +, -. / :; < = >? @,.? U,... 【 】 the"? ""' '! [\ \] ^ _ ` {|} ~] + 'line = re. Sub (r,' 'line. Replace (' laptop', '). The replace (' inch ', ')) text + = '. 'the join (jieba. The cut (the line, cut_all=False)) backgroud_Image = plt.imread('computer.jpeg') wc = WordCloud( background_color='white', mask=backgroud_Image, font_path='msyh.ttf', max_words=2000, stopwords=STOPWORDS, max_font_size=130, random_state=30 ) wc.generate_from_text(text) img_colors = ImageColorGenerator(backgroud_Image) Wc.recolor (color_func=img_colors) plt.imshow(wc) plt.axis('off') wc.to_file("computer.jpg") print(" Generated word cloud successfully ")Copy the code

Here, the notebook configuration parameters in the title are filtered out with re. Although the notebook parameters determine the performance of the notebook, but the real to buy a notebook, the most important or according to their own needs and budget, and then consider the notebook parameters, and finally choose a suitable for their own notebook. General notebook parameters are as follows:

CPU: Core series I3, I5, I7, standard voltage M and low voltage U

Hard disk: 500 GB, 1 TB, and 2 TB

Graphics card: AMD, NVIDIA

Memory: 4G, 8GB