Learn Python web crawler in five minutes

preface

“If a reptile writes well, his prison will feed him,” is a common joke in the industry to tease reptile engineers. As a crawler, some sensitive and important data cannot be captured casually for commercial use, or they may be invited to “drink tea” at any time. This year, a number of mutual gold companies have been reported and investigated for problems with crawlers.

However, crawler technology is not guilty, or we developers to learn about. Before we do that, let’s look at the concepts.

What is a reptile

Web crawler: also known as web spider, web robot, is a program or script that automatically crawls information on the World Wide Web according to certain rules.

In the era of big data, to conduct data analysis, the first thing to do is to have data sources. Where can data sources come from? You can buy them with money and no budget, so you can only grab them from other websites.

Broken down, the industry is divided into two categories: crawlers and anti-crawlers.

Anti-crawler: as the name implies, it is to prevent you from coming to my website or APP to do crawlers.

Reptilian engineers and anti-reptilian engineers are a loving couple who often lose their jobs because the other has to work overtime to write code. For example, take a look at this picture:

The basic principles of reptiles

Crawler tools and language selection

I. Crawler tools

As we all know, there are some common tools you can use to improve your productivity. Here are some of the tools I recommend: Chrome, Charles, Postman, and XPath-Helper

2. Reptilian language

At present, the mainstream Java, node. js, C#, python and other development languages can achieve crawler.

Therefore, in the choice of language, you can choose the best language to write crawler scripts.

At present, the crawler section is mostly used in Python, because Python syntax is simple and easy to modify, and there are many crawler related libraries in Python, which can be used when you take it, and there are also many materials on the Internet.

Use of the Selenium library for Python crawlers

First, basic knowledge

First of all, we need to use Python language to do crawler, and we need to learn the basic knowledge of Python, as well as HTML, CSS, JS, Ajax and other related knowledge. Here is a list of some crawling-related libraries and frameworks in Python:

1.1, URLlib and URLlib2 1.2, Requests 1.3, Beautiful Soup 1.4, Xpath syntax and LXML library 1.5, PhantomJS 1.6, Selenium 1.7, PyQuery 1.8, Scrapy .Copy the code

Due to limited time, this article only introduces the crawler techniques of Selenium library, such as automated testing, as well as other libraries and frameworks. Interested students can learn on their own.

Two, Selenium basis

2.1. Selenium is an automated testing tool for testing web sites. It supports a variety of browsers, including Chrome, Firefox, Safari, and phantomJS.

2.2 Installation method

pip install Selenium
Copy the code

2.3 Eight ways for Selenium to locate elements

Crawler Example Demonstration

The need of this case is: crawl douban Top250 movie information.

url：https://movie.douban.com/top250
Copy the code

Database table scripts:

CREATE TABLE Movies (Id INT PRIMARY KEY IDENTITY(1,1), Name NVARCHAR(20) NOT NULL DEFAULT' ',
	EName NVARCHAR(50) NOT NULL DEFAULT ' ',
	OtherName NVARCHAR(50) NOT NULL DEFAULT ' ',
	Info NVARCHAR(600) NOT NULL DEFAULT ' ',
	Score NVARCHAR(5) NOT NULL DEFAULT '0',
	Number NVARCHAR(20) NOT NULL DEFAULT '0',
	Remark NVARCHAR(200) NOT NULL DEFAULT ' ',
	createUser INT NOT NULL DEFAULT 0,	
	createTime DATETIME DEFAULT GETDATE(),
	updateUser INT NOT NULL DEFAULT 0,	
	updateTime DATETIME DEFAULT GETDATE()
);
Copy the code

The first step of crawler is to analyze the URL. After analysis, the URL of Douban Top250 page has certain rules:

Each page displays 25 movie messages, and the URL rules are as follows, and so on.

import importlib
import random
import sys
import time
import pymssql
from selenium import webdriver
from selenium.webdriver.common.by import By

# Anti-crawler setup -- fake IP and request
ip = ['111.155.116.210'.'115.223.217.216'.'121.232.146.39'.'221.229.18.230'.'115.223.220.59'.'115.223.244.146'.'180.118.135.26'.'121.232.199.197'.'121.232.145.101'.'121.31.139.221'.'115.223.224.114']
headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36".'X-Requested-With': 'XMLHttpRequest'.'X-Forwarded-For': ip[random.randint(0, 10)],
    'Host': ip[random.randint(0, 10)]
}

importlib.reload(sys)

try:
    conn = pymssql.connect(host="127.0.0.1", user="sa", password="123", database="MySchool",charset="utf8")
except pymssql.OperationalError as msg:
    print("error: Could not Connection SQL Server! please check your dblink configure!")
    sys.exit()
else:
    cur = conn.cursor()

def main():
    for n in range(0, 10):
        count = n*25
        url = 'https://movie.douban.com/top250?start='+str(count)
        j = 1
        # if(n == 7):
        # j = 5
        for i in range(j, 26):
            driver = webdriver.PhantomJS(desired_capabilities=headers)  Encapsulate browser information
            driver.set_page_load_timeout(15)
            driver.get(url)  Load the page
            # data = driver.page_source
            Screenshot # driver. Save_screenshot ('1.png'

            name = driver.find_elements(By.XPATH, "//ol/li["+str(i)+"]/div/div/div/a/span")[0].text.replace('\'', '') ename = driver.find_elements(By.XPATH, "//ol/li["+str(i)+"]/div/div/div/a/span")[1].text.replace("/", "").replace(" ", "").replace('\' '.' ')
            try:
                otherName = driver.find_elements(By.XPATH, "//ol/li["+str(i)+"]/div/div/div/a/span")[2].text.lstrip('/').replace("/"."|").replace(""."").replace('\'', '') except: otherName = '' info = driver.find_elements(By.XPATH, "//ol/li["+str(i)+"]/div/div/div/p")[0].text.replace("/", "|").replace(" ", "").replace('\' '.' ')
            score = driver.find_elements(By.XPATH, "//ol/li["+str(i)+"]/div/div/div/div/span[2]")[0].text.replace('\'', '') number = driver. Find_elements (By XPATH, "/ / ol li [" + STR (I) +"] / div/div/div/div/span [4] ") [0]. The text. The replace (" evaluation ", "").replace('\' '.' ')
            remark = driver.find_elements(By.XPATH, "//ol/li["+str(i)+"]/div/div/div/p/span")[0].text.replace('\'', '') sql = "insert into Movies(Name,EName,OtherName,Info,Score,Number,Remark) values('"+name + \ "', '"+ename+"', '"+otherName+"', '"+info + \ "', '"+score+"', '"+number+"', '"+remark+"') "try: cur.execute(SQL) conn.com MIT () print(" "+ STR (n)+", "+ STR (I)+") Conn.rollback () print(" add failed: "+ SQL) driver.quit() if __name__ == '__main__': main()Copy the code

Achievements:

If this article is helpful to you, please like it and share it with your friends. If you want to get more development knowledge, you can scan the code to follow the wechat public account, get a member account and learning materials. Note: the data involved in this crawler case is only used for learning and use, not for any commercial behavior, if there is improper, please contact this public number.Copy the code