Due to the need of work, my colleague just started to learn Python, but he didn’t understand the tool Selenium for half a month, because it made him bald for half a month. Finally, he asked me to give him a solution.

So I explained it to him with an example of taobao crawler, and he figured it out in less than an hour. Crawler projects that beginners can understand.


There are a few concepts we need to understand before the crawler begins, and selenium will be used in this crawler.

What is Selenium?

Selenium is a Web testing automation tool that automates the operation of the browser. If you need to install the corresponding driver to operate a browser, for example, if you need to operate Chrome through Selenium, the Chrome Driver must be installed and of the same version as Chrome.

After that, install Selenium:

pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple
Copy the code

1. Import modules

First we import the module first

from selenium import webdriver
Copy the code

We’ll use other modules in the future, but LET me show them all:


2. Initialize the browser

Then there is the initialization of the browser

browser = webdriver.Chrome()
Copy the code

You can use many browsers, Android, blackberry, Internet Explorer, etc. To use a different browser, download the corresponding browser driver.

Because I only installed the Google browser driver, so I use Chrome Google, the driver can be downloaded by yourself.

Driver for Chrome:

Npm.taobao.org/mirrors/chr…


3. Login and obtain page

The first thing to solve is the login problem. Do not directly enter your account when logging in, because taobao’s anti-crawling is particularly serious. If it detects that you are a crawler, it is not allowed to log in.

So I used another login method, Alipay scan code login, request to alipay scan code login page url.

def loginTB() :
    browser.get(
        'https://auth.alipay.com/login/index.htm?loginScene=7&goto=https%3A%2F%2Fauth.alipay.com%2Flogin%2Ftaobao_trust_login.ht m%3Ftarget%3Dhttps%253A%252F%252Flogin.taobao.com%252Fmember%252Falipay_sign_dispatcher.jhtml%253Ftg%253Dhttps%25253A%25 252F%25252Fwww.taobao.com%25252F&params=VFBMX3JlZGlyZWN0X3VybD1odHRwcyUzQSUyRiUyRnd3dy50YW9iYW8uY29tJTJG')

Copy the code

Jump to the Alipay code scanning login interface.

I’ve set a wait time, 180 seconds after the search box appears, it’s not actually 180 seconds, it’s a display wait, as long as the element appears, it’s no longer waiting.

Then look up the search box and enter a keyword search.

   Wait for the search box to appear
    wait = WebDriverWait(browser, 180)
    wait.until(EC.presence_of_element_located((By.ID, 'q')))

    # Find the search box, enter the search keyword and click Search
    text_input = browser.find_element_by_id('q')
    text_input.send_keys('food')
    btn = browser.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')
    btn.click()
Copy the code

Parse the data

After the web page is obtained, the data is parsed again, and the commodity data needed is climbed to the LXML parsing library. XPath selects child nodes for direct parsing.


Five, climb the page

After searching in the search box, the required page details will appear, but not just to climb a page, is to continuously climb the next page of multiple pages of product information. There’s an endless loop that climbs until the product page is gone

def loop_get_data() :
    page_index = 1
    while True:
        print("=================== is fetching page {} ===================".format(page_index))
        print("Current page URL:" + browser.current_url)
        # parse data
        parse_html(browser.page_source)

        # Set display wait wait next page button
        wait = WebDriverWait(browser, 60)
        wait.until(EC.presence_of_element_located((By.XPATH, '//a[@class="J_Ajax num icon-tag"]')))
        time.sleep(1)
        try:
            Scroll through the action chain to the button element on the next page
            write = browser.find_element_by_xpath('//li[@class="item next"]')
            ActionChains(browser).move_to_element(write).perform()
        except NoSuchElementException as e:
            print("Climb complete, there is no next page of data!")
            print(e)
            sys.exit(0)
        time.sleep(0.2)
        # Click next
        a_href = browser.find_element_by_xpath('//li[@class="item next"]')
        a_href.click()
        page_index += 1
Copy the code

Six, crawler completion

Def loop_get_data() is called in the while loop, so it doesn’t need to be called again.

After the crawler is finished, it is saved in a shop_data.json file.

The result of the crawl is as follows:

All the web pages involved in this crawler can be replaced. Friends need the source code. Please comment in the comment section: After Taobao, I can send a private message to you, or you can ask me any questions you encounter in the process of crawling.