Disclaimer: The technical means and implementation process recorded in this paper are only used as the learning and use of crawler technology, and we do not assume any responsibility for any thing or consequences caused by any act or omission of any person based on all or part of the contents of this paper.
Crawls demand: crawls taobao to search for commodities according to keywords, commodity name, price, monthly sales information;
Crawl tools: Chrome, PyCharm
Python library: selenium
01 Website structure analysis
Open taobao home page and enter “mobile phone” to search:
Click on the product details page, all information can be found in the details page.
02 Create Selenium crawler
Open the Pycharm development tool, create selenium_tao.py, and write the following code:
from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options(); Add_experimental_option ("excludeSwitches", ['enable-automation']); browser = webdriver.Chrome('C:\chromedriver.exe', options = options) taobao_url = 'https://s.taobao.com/search?q=%E6%89%8B%E6%9C%BA' def start_page(page): browser.get(taobao_url) browser.maximize_window() start_page(1)Copy the code
Error :riWCd, add the following code, remove the attribute of automatic control, perfect solution:
browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
Copy the code
03 Climb to the product details page
By analyzing the commodity search page, the list of the page is all displayed, and the paging button is all displayed, you can meet the needs of this climb:
Wait. Until (ec. presence_of_element_located((by.css_selector, 'm - itemlist. Items. The item'))) # drag the bottom scroll bar the execute_script (' window. ScrollTo (0, the document. The body. ScrollHeight); Wait. Until (ec.presence_of_element_located ((by.css_selector, '.wraper.items.item ')))Copy the code
Add code to crawl and record all page details:
def get_detail_url(): html = browser.page_source dom = PyQuery(html) items = dom('.m-itemlist .items .item').items() detail_urls = [] for item in items: detail_urls.append('http:' + item.find('.title a').attr('href')) print(detail_urls) return detail_urlsCopy the code
04 Page turning
After crawling all the product details page url of the current page, it is necessary to turn the page, find the hyperlink of the next page, and click it directly with the code
browser.find_element_by_css_selector('.wraper .items .item.next > a').click()
Copy the code
Run the code and find an error page turning, change to the following way to click:
element = browser.find_element_by_css_selector('.wraper .items .item.next > a') browser.execute_script("arguments[0].click();" , element)Copy the code
Run the code, jump successful!
05 Processing of commodity details page
Analyze the HTML of the detail page and climb the key information of the detail page:
Def get_detail_info(html_source): dom = PyQuery(html_source) detail_dom = dom('#J_DetailMeta') detail_info = { 'name': detail_dom.find('.tb-detail-hd').find('a').text(), 'price': detail_dom.find('.tm-fcs-panel').find('.tm-price-panel.tm-price-cur').find('.tm-price').text(), 'sale_number': detail_dom.find('.tm-ind-panel').find('.tm-ind-item.tm-ind-sellCount').find('.tm-indcon').find('.tm-count').text() } return detail_infoCopy the code
Open all the detail urls that were previously climbed in order to parse the relevant information of the detail page:
def start_detail_page(detail_url_list): detail_info_list = [] for detail_url in detail_url_list: Browser. get(detail_url) # set Max wait time wait = WebDriverWait(browser, 30) wait.until( EC.presence_of_element_located( (By.CSS_SELECTOR, '#J_DetailMeta .tb-detail-hd') ) ) detail_info_list.append(get_detail_info(browser.page_source)) print(detail_info_list)Copy the code
All sample codes can be downloaded through wechat official account reply keyword [pachong23]!