Disclaimer: The technical means and implementation process recorded in this paper are only used as the learning and use of crawler technology, and we do not assume any responsibility for any thing or consequences caused by any act or omission of any person based on all or part of the contents of this paper.

Crawls demand: crawls taobao to search for commodities according to keywords, commodity name, price, monthly sales information;

Crawl tools: Chrome, PyCharm

Python library: selenium

01 Website structure analysis

Open taobao home page and enter “mobile phone” to search:

Click on the product details page, all information can be found in the details page.

02 Create Selenium crawler

Open the Pycharm development tool, create selenium_tao.py, and write the following code:

from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options(); Add_experimental_option ("excludeSwitches", ['enable-automation']); browser = webdriver.Chrome('C:\chromedriver.exe', options = options) taobao_url = 'https://s.taobao.com/search?q=%E6%89%8B%E6%9C%BA' def start_page(page): browser.get(taobao_url) browser.maximize_window() start_page(1)Copy the code

Error :riWCd, add the following code, remove the attribute of automatic control, perfect solution:

browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
  "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})
Copy the code

03 Climb to the product details page

By analyzing the commodity search page, the list of the page is all displayed, and the paging button is all displayed, you can meet the needs of this climb:

Wait. Until (ec. presence_of_element_located((by.css_selector, 'm - itemlist. Items. The item'))) # drag the bottom scroll bar the execute_script (' window. ScrollTo (0, the document. The body. ScrollHeight); Wait. Until (ec.presence_of_element_located ((by.css_selector, '.wraper.items.item ')))Copy the code

Add code to crawl and record all page details:

def get_detail_url(): html = browser.page_source dom = PyQuery(html) items = dom('.m-itemlist .items .item').items() detail_urls = [] for item  in items: detail_urls.append('http:' + item.find('.title a').attr('href')) print(detail_urls) return detail_urlsCopy the code

04 Page turning

After crawling all the product details page url of the current page, it is necessary to turn the page, find the hyperlink of the next page, and click it directly with the code

browser.find_element_by_css_selector('.wraper .items .item.next > a').click()
Copy the code

Run the code and find an error page turning, change to the following way to click:

element = browser.find_element_by_css_selector('.wraper .items .item.next > a') browser.execute_script("arguments[0].click();" , element)Copy the code

Run the code, jump successful!

05 Processing of commodity details page

Analyze the HTML of the detail page and climb the key information of the detail page:

Def get_detail_info(html_source): dom = PyQuery(html_source) detail_dom = dom('#J_DetailMeta') detail_info = { 'name': detail_dom.find('.tb-detail-hd').find('a').text(), 'price': detail_dom.find('.tm-fcs-panel').find('.tm-price-panel.tm-price-cur').find('.tm-price').text(), 'sale_number': detail_dom.find('.tm-ind-panel').find('.tm-ind-item.tm-ind-sellCount').find('.tm-indcon').find('.tm-count').text() } return detail_infoCopy the code

Open all the detail urls that were previously climbed in order to parse the relevant information of the detail page:

def start_detail_page(detail_url_list): detail_info_list = [] for detail_url in detail_url_list: Browser. get(detail_url) # set Max wait time wait = WebDriverWait(browser, 30) wait.until( EC.presence_of_element_located( (By.CSS_SELECTOR, '#J_DetailMeta .tb-detail-hd') ) ) detail_info_list.append(get_detail_info(browser.page_source)) print(detail_info_list)Copy the code

All sample codes can be downloaded through wechat official account reply keyword [pachong23]!