Wechat official account: Operation and maintenance development story, author: Su Xin
Here we will explore the application of multithreading in crawlers with an example, so we won’t explain too much theoretical stuff. Click connect for details of concurrency
Crawl an app store
Of course, please diagnose yourself before crawling whether to follow the gentleman’s agreement, comply with the data can not be climbed
To view the robots protocol, you only need to add rerobot.txt to the domain name suffix
Such as:
1. The target
-
URL:
http://app.mi.com/category/15
-
Get all APP names, profiles, and download links in the “Games” category
2. The analysis
2.1 Web page Attributes
First, you need to determine whether it is dynamic loading
If the query parameter is 1, it will be the second page. Write a crawler to test it
The import requests a url = "http://app.mi.com/category/15" headers = {the user-agent: "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"} HTML = requests. Get (url=url, headers=headers).content.decode("utf-8") print(html)Copy the code
Search for “King of Glory” in the output HTML and find no problem, so what about the second page? The above code in the url = “http://app.mi.com/category/15” to the url = “http://app.mi.com/category/15#page=1”
A second search for “Hearthstone” does not show up, so the site may be loading dynamically
-
Caught analysis
Turn on chrome’s built-in bug, switch to Network, and click flip
You can see that the GET request is suffixed with many parameters
After many tests found
-
Page is the number of pages, but the value needs to be reduced by 1 to be the true number of pages
-
CategoryId indicates the application category
-
PageSize is not yet clear, so open the URL of the captured package and take a look
PageSize is the number of APP messages per page and returns a JSON string
2.2 analysis of json
Copy some JSON
{"count":2000, "data": [{"appId":108048, "displayName":" King of Glory ", "icon":"http://file.market.xiaomi.com/thumbnail/PNG/l62/AppStore/0eb7aa415046f4cb838cfe5b5d402a5efc34fbb25", "Level1CategoryName ":" netgame RPG", "packageName":" com.tceh.tmGp.sgame"}, {},... }Copy the code
I don’t know what all the messages are for, so save them for now
2.3 Level 2 Page
Click “King of Glory” to jump to APP details and see what the URL looks like
http://app.mi.com/details?id=com.tencent.tmgp.sgame
Surprisingly, the query parameter id is the same as the packageName value above, so the details page needs to concatenate the URL
2.4 Obtaining Information
-
The name of the APP
< div class = "intro - titles" > < p > shenzhen tencent computer systems co., LTD. < / p > < h3 > < / h3 > king glory... </div>Copy the code
-
Introduction of the APP
<p class="pslide"> King of Glory is Tencent's first 5V5 team fair competition mobile game, the national MOBA mobile game masterpiece! 5V5 King Canyon, fair fight, restore MOBA classic experience; Contract battles, five armies, border breakouts, etc., bring fancy combat fun! 10 seconds real-time cross-region matching, and friends open black points, to the strongest king attack! A number of heroes at their choice, one blood, five kill, super god, the strength of crushing, harvest! The enemy is about to arrive at the battlefield. Summon your friends and prepare for battle in Honor of Kings! < / p > < h3 class = "special - h3" > the new feature < / h3 > < p class = "pslide" > 1. New Hero - Ma Chao: The last hero of the Five Tigers, via & LDquo; Throw - pick ” , the enhanced attack moves through the complex battlefield. < BR />2. New gameplay - King Simulation Battle (coming soon) : Recruit heroes, arrange troops, and compete with seven other players in the sandbox! < BR />3. New system - Vientiane: integrate all the previous entertainment mode gameplay, adventure journey gameplay. In the future, users will use editors “ Throughout the tico &; The creation of high-quality original gameplay, it is possible to join the Vientiane nature; <br />4. New feature - Exclusive certification for professional players: more than 100 KPL professional players have official certification in the game; <br />5. New feature - don't want to be in the same team: players who don't want to be in the same team can be set in the settlement screen for the king under 50 stars. <br />6. New feature - System AI hosting: players can choose AI hosting after hanging up, but AI will not CARRY matches; 7. New Skin: Shen Mengxi - Shark Cannon Pirate Cat. </p>Copy the code
-
APP Download Address
<div class="app-info-down"><a href="/download/108048? id=com.tencent.tmgp.sgame& ref=appstore.mobile_download& nonce=4803361670017098198%3A26139170& appClientId=2882303761517485445& AppSignature = 66 myrvedlh4rcytlpgkjcibdw_xgomyme9g39hf4f2g "class =" download "> direct download < / a > < / div >Copy the code
2.4 Confirmation Technology
Based on the above analysis, extracting data using LXML would be a good choice. For xpath use, click jump
Xpath syntax is as follows:
-
Name:
//div[@class="intro-titles"]/h3/text()
-
Brief introduction:
//p[@class="pslide"][1]/text()
-
Download link:
//a[@class="download"]/@href
3. Code implementation
import requestsfrom lxml import etreeclass MiSpider(object): def __init__(self): Self. Bsase_url = "HTTP: / / http://app.mi.com/categotyAllListApi?page= {} & categoryId = 15 & pageSize = 30" # 1 page URL self. The headers = {the user-agent: "Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0;" Def get_page(self, url): Reponse = requests. Get (url=url, headers=self. Headers) Def parse_page(self, url) def parse_page(self, url): html = self.get_page(url).json() # two_url_list:[{"appId":"108048","dispayName":"..",...},{},{},...] two_url_list = html["data"] for two_url in two_url_list: {} two_url = "http://app.mi.com/details?id=". The format (two_url [" packageName "]) # splicing app details link self. Parse_info (two_url) # Def parse_info(self, two_URL) def parse_info(self, two_URL): HTML = self.get_page(two_URL).content.decode(" UTF-8 ") parse_html = etree.html (HTML) # Retrieve target information app_name = parse_html.xpath('//div[@class="intro-titles"]/h3/text()')[0].strip() app_info = parse_html.xpath('//p[@class="pslide"][1]/text()')[0].strip() app_url = "http://app.mi.com" + "//a[@class="download"]/@href')[0]. Strip () print(app_name, app_url, app_info) # def main(self): for page in range(67): url = self.bsase_url.format(page) self.parse_page(url)if __name__ == "__main__": spider = MiSpider() spider.main()Copy the code
Next, the data is stored in a variety of ways: CSV, MySQL, and MongoDB
Data is stored
The MySQL database is used here to store it
Build table SQL
/* Navicat MySQL Data Transfer Source Server : xxx Source Server Type : MySQL Source Server Version : 50727 Source Host : MySQL_ip:3306 Source Schema : MIAPP Target Server Type : MySQL Target Server Version : 50727 File Encoding : 65001 Date: 13/09/2019 14:33:38 */ CREATE DATABASE MiApp CHARSET=UTF8; SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0; -- ---------------------------- -- Table structure for app -- ---------------------------- DROP TABLE IF EXISTS `app`; CREATE TABLE `app` ( `name` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT 'APP ', 'url' text CHARACTER SET UTf8MB4 COLLATE UTf8MB4_general_ci NULL DEFAULT NULL COMMENT 'APP ', Info longtext CHARACTER SET UTf8MB4 COLLATE UTf8MB4_general_ci NULL COMMENT 'APP ') ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic; SET FOREIGN_KEY_CHECKS = 1;Copy the code
1. pymysql
A brief introduction to the use of Pymysql, this module is a third party, need to use PIP installation, installation method is not described again.
1.1 Built-in Methods
Pymysql method
connect()
Connect to database with connection information (host, port, user, password, charset)
Pymysql object method
-
Cursor () a cursor used to locate a database
-
Cursor. execute(SQL) Executes SQL statements
-
Db.mit () commits the transaction
-
Cursor.close () closes the cursor
-
Db.close () closes the connection
1.2 Precautions
Transactions must be committed to the database whenever a data modification operation is involved
To query the database, use the FET method to obtain the query result
1.3 details
See PyMSQL for more details
2. Storage
Creating a configuration file (config.py)
Database connection info HOST = "xxx.XXX.xxx.xxx "PORT = 3306USER =" XXXXX "PASSWORD = "XXXXXXX "DB = "MIAPP"CHARSET =" UTf8MB4"Copy the code
Table structure
mysql> desc MIAPP.app;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| name | varchar(20) | YES | | NULL | |
| url | varchar(255) | YES | | NULL | |
| info | text | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+3 rows in set (0.00 sec)
Copy the code
The SQL statement
insert into app values(name,url,info);
Copy the code
The complete code
import requestsfrom lxml import etreeimport pymysqlfrom config import *class MiSpider(object): def __init__(self): Self. Bsase_url = "HTTP: / / http://app.mi.com/categotyAllListApi?page= {} & categoryId = 15 & pageSize = 30" # 1 page URL self. The headers = {the user-agent: "Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0;" } self.db = pymysql.connect(host=HOST, port=PORT, user=USER, password=PASSWORD, database=DB, # create cursor self.i = 0 # create cursor self.i = 0 Def get_page(self, url) def get_page(self, url): Reponse = requests. Get (url=url, headers=self. Headers) Def parse_page(self, url) def parse_page(self, url): html = self.get_page(url).json() # two_url_list:[{"appId":"108048","dispayName":"..",...},{},{},...] two_url_list = html["data"] for two_url in two_url_list: {} two_url = "http://app.mi.com/details?id=". The format (two_url [" packageName "]) # splicing app details link self. Parse_info (two_url) # Def parse_info(self, two_URL) def parse_info(self, two_URL): HTML = self.get_page(two_URL).content.decode(" UTF-8 ") parse_html = etree.html (HTML) # Retrieve target information app_name = parse_html.xpath('//div[@class="intro-titles"]/h3/text()')[0].strip() app_info = parse_html.xpath('//p[@class="pslide"][1]/text()')[0].strip() app_url = "http://app.mi.com" + parse_html.xpath('//a[@class="download"]/@href')[0].strip() ins = "insert into app(name,url,info) values (%s,%s,%s)" # Self.cursor. execute(ins, [app_NAME, app_URL, Format (self.i, app_name)) def main(self): for page in range(67): Format (page) self.parse_page(url) # cursor.close() self.db.close() print(" Total {} a successful APP written ". The format (self i.)) if __name__ = = "__main__" : spiders = MiSpider () spiders. The main ()Copy the code
multithreading
Crawling through this information may seem slow, time-consuming if there is a lot of data, and underutilized computer resources
So that’s the idea of multi-threading, and multi-processes and multi-threading are all over the web, just to be clear
A process can contain many threads, the process dies, and the thread ceases to exist
For example, if you have a train, if you think of the train as a process, then each car is a thread, and it is these threads that together make up the process
Python has the concept of multithreading
Suppose we now have two operations:
n += 1n -= 1
Copy the code
It actually works this way inside Python
x = n
x = n + 1n = x
x = n
x = n + 1n = x
Copy the code
One of the properties of threads is that they compete for computer resources, so if one thread just calculates x = n and another thread runs n = x, then it’s all messed up, so n plus a thousand ones minus a thousand ones doesn’t get one, and then you have to worry about thread locking.
When running, each thread scrambles to share data. If thread A is manipulating A piece of data, thread B also needs to manipulate the data, which may cause data disorder and thus affect the operation of the whole program.
So the Python has a mechanism to work in a thread, the whole interpreter lock off, it will lead to other threads cannot access any resources, the lock is called the GIL the global interpreter lock, it is because of the existence of this lock, nominal multithreading is turned into a single thread, so a lot of said GIL is Python chicken ribs.
In view of this defect, many standard libraries and third-party modules or libraries are developed based on this defect, which makes it particularly difficult to improve Multi-threading in Python. Therefore, in practical development, I currently use four solutions to this problem:
Replace Thead with multiprocessing
Replace cpython with jPython
Add synchronization Lock threading.lock ()
Message queue queue.queue ()
If you need a comprehensive understanding of concurrency, please click on Concurrent programming, here is a brief introduction to use
1. Queue method
Q = queue () q.put(url) q.et () # block when queue is empty q.epty () # Check if queue is empty, True/FalseCopy the code
2. Thread methods
T = Thread(target= function name) # create Thread object. Start () # create Thread object I in range(5): t= Thread(target= function name) t.start() t.start()Copy the code
3. Rewrite the
Understand the above content can rewrite the original code multithreading, rewriting before adding time to time
Multithreading technology selection:
-
Crawler involves many IO operations, so it will waste computer resources to change the process hastily.
pass
-
Changing JPython is simply not necessary.
pass
-
Locking can be done, but it’s still slow for IO because you have to lock files.
pass
-
Using message queue can improve crawler speed effectively.
Thread pool design:
- Since there are 67 pages to crawl and 2010 apps, consider listing the URL
def url_in(self):
for page in range(67):
url = self.bsase_url.format(page)
self.q.put(page)
Copy the code
Below is the complete code
import requestsfrom lxml import etreeimport timefrom threading import Threadfrom queue import Queueimport jsonimport pymysqlfrom config import *class MiSpider(object): def __init__(self): self.url = "http://app.mi.com/categotyAllListApi?page={}&categoryId=15&pageSize=30" self.headers = {"User-Agent": "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit / 537.36 (KHTML, Def url_in(self): self.url_queue = Queue() def url_in(self): For page in range(67): url = self.url.format(page) # join queue self.url_queue. Put (url) # def get_data(self): If the result is True, the queue is empty. If self.url_queue.empty(): Url = self.url_queue.get() HTML = requests. Headers =self.headers).content.decode(" utF-8 ") HTML = json.loads(HTML) # Define a list to hold all APP information [(name, URL,info),(),(),...] for app in html["data"]: # application link app_link = "http://app.mi.com/details?id=" + app/" packageName "app_list. Append (self. Parse_two_page (app_link)) return app_list def parse_two_page(self, app_link): html = requests.get(url=app_link, headers=self.headers).content.decode('utf-8') parse_html = etree.HTML(html) app_name = parse_html.xpath('//div[@class="intro-titles"]/h3/text()')[0].strip() app_url = "http://app.mi.com" + parse_html.xpath('//div[@class="app-info-down"]/a/@href')[0].strip() app_info = parse_html.xpath('//p[@class="pslide"][1]/text()')[0].strip() info = (app_name, app_url, Print (app_name) return info # def main(self): T_list = [] for I in range(67): self.url_in() t = Thread(target=self.get_data) t_list.append(t) t.start() for i in t_list: i.join() db = pymysql.connect(host=HOST, user=USER, password=PASSWORD, database=DB, charset=CHARSET) cursor = db.cursor() ins = 'insert into app values (%s, %s, %s)' app_list = self.get_data() print(" writing to database ") cursor.executemany(ins, app_list) db.commit() cursor.close() db.close()if __name__ == '__main__': Start = time.time() spider = MiSpider() spider. Main () end = time.time() print(" time :%.2f"% (end-start))Copy the code
Of course, the idea here is to queue the URL, but also write parsing and saving into the thread, to improve the efficiency of the program.
More crawler technology click to visit
This article uses the article synchronization assistant to synchronize