Disclaimer: The technical means and implementation process recorded in this paper are only used as the learning and use of crawler technology, and we do not assume any responsibility for any thing or consequences caused by any act or omission of any person based on all or part of the contents of this paper.
Crawler demand: crawler jingdong Mall to search for goods according to keywords, product name, price, cumulative evaluation information;
Crawl tools: Chrome, PyCharm
Python library: selenium
01 Website structure analysis
Open the home page of JINGdong and enter “mobile phone” to search:
Click on the product details page, all information can be found in the details page.
02 Create Selenium crawler
Open Pycharm, create selenium_Jd.py, and write the following code:
import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.options import Options from pyquery import PyQuery options = Options(); Add_experimental_option ("excludeSwitches", ['enable-automation']); User_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/91.0.4472.124 Safari/537.36' options.add_argument('--user-agent=%s' % user_agent = webdriver.Chrome('C:\chromedriver.exe', options=options) browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) jd_url = 'https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&wq=%E6%89%8B%E6%9C%BA&pvid=eb248c3d491144a99d93c073ef Def start_page(page): Current_page = 0 while current_page < page: if (current_page == 0): browser.get(jd_url) else: Wait = WebDriverWait(browser, 30) browser.maximize_window() current_page = current_page + 1 start_page(1)Copy the code
Run the code, can successfully open the JINGdong commodity search page.
03 Climb to the product details page
By analyzing the page, find the URL information of the product details page:
Write code to crawl the product details page URL:
Def get_detail_url(html_source): dom = PyQuery(html_source) items = dom('.m-list .ml-wrap .gl-item').items() detail_urls = [] for item in items: detail_urls.append(item.find('.p-name.p-name-type-2 a').attr('href')) return detail_urlsCopy the code
Call item detail parsing method, run successfully!
04 Page turning
Only when the page is pulled down slowly through the drop-down box will the following information be delayed loading, so it is necessary to control the scroll bar to pull down slowly, and then climb the product details page, and then carry out paging processing:
# slowly pull the scroll bar to pull to the bottom window_height = the execute_script (' return document. Body. ScrollHeight) curent_height = 0 for I in range(curent_height, window_height, 100): browser.execute_script('window.scrollTo(0, {})'.format(I)) time.sleep(0.5)# store the current page details URL detail_url_list.extend(get_detail_URL (browser.page_source))Copy the code
To continue parsing the paging information, skip to the next page by clicking:
In order to facilitate processing, directly use the mouse click to achieve the jump:
element = browser.find_element_by_class_name('pn-next') browser.execute_script("arguments[0].click();" , element)Copy the code
Run the code, successfully turn the page!
05 Processing of commodity details page
After crawling all the details page urls, you can open the details page URL to crawl details:
Def get_detail_info(html_source): dom = PyQuery(html_source) detail_dom = dom('.w .itemInfo-wrap') detail_info = { 'name': detail_dom.find('.sku-name').text(), 'price': detail_dom.find('.summary.summary-first').find('.summary-price-wrap').find('.dd').find('.p-price').text(), 'comment_count': detail_dom.find('.summary.summary-first').find('.summary-price-wrap').find('.comment-count').text(), } print(detail_info) return detail_infoCopy the code
Iterate through the list of previously crawlable item details page urls and crawl item details one by one:
Def start_detail_page(detail_url_list): detail_info_list = [] for detail_url in detail_url_list: browser.get(detail_url) time.sleep(1) detail_info_list.append(get_detail_info(browser.page_source)) print(detail_info_list)Copy the code
Run the code and the result is as follows:
Climb JINGdong commodity information success!