Want to learn Python the crawler, can order beside the oh ~ 🈲 🈲 🈲 🈲 g.lgcoder.com/ad1/index.h… 🈲 🈲 🈲 🈲
What is a reptile
Crawlers, also known as “web crawlers,” are programs that automatically access the Internet and download web content. It is also the basis of search engines, like Baidu and GOOGLE, which rely on powerful web crawlers to retrieve massive Amounts of Internet information and then store it in the cloud to provide high-quality search services for users.
What is the use of reptiles
You might say, what’s the use of learning to crawler, other than running a search engine company? Ha ha, someone finally asked the right question. For example: Enterprise A has set up A user forum, and many users leave comments on the forum about their use experience and so on. Now A needs to understand user needs, analyze user preferences, and prepare for the next round of product iteration and update. So how to get the data, of course, is the need for crawler software from the forum to get it. Therefore, in addition to Baidu and GOOGLE, many enterprises are recruiting crawler engineers with high salaries. Search “crawler engineer” on any job site and see the number of jobs and salary ranges to get an idea of how popular crawlers are.
Principle of reptile
Initiate a request: Send a request (a request) to the target site over HTTP and wait for a response from the target site server.
Get Response content: If the server responds properly, it gets a Response. The content of the Response is the content of the page to be retrieved. The content of the Response may include HTML, Json strings, binary data (such as images and videos), and so on.
Parsing content: The content may be HTML, you can use regular expressions, web page parsing library parsing; May be Json, can be directly converted to Json object parsing; It may be binary data that can be saved or further processed.
Save data: After the data is parsed, it is saved. It can be saved as a text document or in a database.
Python crawler instance
The definition, function, principle and other information of crawler have been introduced in front. I believe that many partners have begun to be interested in crawler and ready to try it. Let’s go to the dry stuff and post a simple Python crawler:
1. Preparatory work: install Python environment, Install PYCHARM, install MYSQL database, create database exam, build a table house for storing crawler results in exam [SQL statement: create table house(price varchar(88),unit varchar(88),area varchar(88));]
2. Objective of crawler: to crawl the price, unit and area of the house in all links on the home page of a rental website, and then store the crawler structure in the database.
3. Crawler source code: as follows
Import requests # Request URL page content
From bS4 import BeautifulSoup #
Import Pymysql # link database
Import time # time function
Import LXML # parsing library (support HTML\XML parsing, support XPATH parsing)
The #get_page function takes the content of the URL link from the Requests get method and merges it into a format that BeautifulSoup can handle
def get_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘lxml’)
return soup
The #get_links function retrieves all rental links from the list page
def get_links(link_url):
soup = get_page(link_url)
links_div = soup.find_all(‘div’,class_=”pic-panel”)
links=[div.a.get(‘href’) for div in links_div]
return links
The #get_house_info function gets information about a house page: price, unit, area, etc
def get_house_info(house_url):
soup = get_page(house_url)
price =soup.find(‘span’,class_=’total’).text
unit = soup.find(‘span’,class_=’unit’).text.strip()
Area = ‘test’; area = ‘test’
info = {
‘price: price,
‘unit: the unit,
‘area: area
}
return info
Write the database configuration information to the dictionary
DataBase ={
‘host’ : ‘127.0.0.1’,
‘database’: ‘exam’,
‘user’ : ‘root’,
‘password’ : ‘root’,
‘charset’ :’utf8mb4′}
# link database
def get_db(setting):
return pymysql.connect(**setting)
Insert crawler data into database
def insert(db,house):
values = “‘{}’,”*2 + “‘{}'”
Sql_values = values. The format (house [‘ price ‘], house [‘ units’], house [‘ area ‘])
sql =”””
insert into house(price,unit,area) values({})
“””.format(sql_values)
cursor = db.cursor()
cursor.execute(sql)
db.commit()
The main program flow: 1. Connect the database; 2. Get the URL list of each house source information; 3. Insert the database one by one
db = get_db(DataBase)
links = get_links(‘bj.lianjia.com/zufang/’)
for link in links:
time.sleep(2)
house = get_house_info(link)
insert(db,house)
First of all, “a good work must be done with a good tool”, and it is the same with writing a crawler program in Python. In the process of writing a crawler, we need to import various library files. It is these extremely useful library files that help us complete most of the work of the crawler, and we just need to retrieve the relevant excuse functions. The import format is the import library file name. Note that PYCHARM can be used to install the library file by placing the cursor over the name of the library file and pressing CTRL + Alt, or by using the command line (Pip Install). If it fails or is not installed, the crawler will definitely report an error. In this code, the first five lines import the relevant library files: Requests for URL page content; BeautifulSoup is used to parse page elements; Pymysql is used to connect to the database; Time contains various time functions; LXML is a parsing library for parsing HTML, XML files, and XPATH parsing as well.
Secondly, we start from the main program at the end of the code to see the whole crawler process:
Connect to the database through the get_DB function. The get_db function is used to connect to a DataBase by calling Pymysql’s connect function. **seting is Python’s way of collecting keyword arguments. Pass the dictionary information to CONNECT as an argument.
Obtain the links of all housing resources on the home page of Lianjia through get_links function. All listings are listed in Links. The get_links function requests the content of the homelanking page through Requests, and then formats the content into a format it can handle through the BeautifuSoup interface. Finally, all div styles containing images are found by electrophoresis find_all function. Then, a for loop is used to obtain the contents of the hyperlinks TAB (a) contained in all div styles (that is, the content of href attribute). All hyperlinks are stored in the list links.
Through the FOR loop, to traverse all links in the links (such as one of the links are: bj.lianjia.com/zufang/1011…
In the same way as in 2), use find to locate elements in 3) and write the price, unit, and area information in a dictionary Info.
Insert is called to write the Info from a link to the house table in the database. Digging deeper into the insert function, we can see that it executes an SQL statement through the database cursor() and then the database commits the response. Here the SQL statement writing method is special, using the format function to format, so as to facilitate the reuse of the function.
Finally, if you run the crawler code, you can see that the home page of Lianjia has all the information of the listings written into the data. (Note: test is the test string I specified manually.)
Postscript: In fact, Python crawlers are not difficult. After getting familiar with the whole crawler process, there are some details that need to be paid attention to, such as how to get page elements, how to build SQL statements and so on. Don’t panic when you encounter a problem, just watch the IDE’s hints to eliminate the bugs one by one, and finally get the structure we want.
By QAtest source: Zhihu