This text and picture filter network, can learn, exchange use, does not have any commercial purposes, if you have any questions please contact us to deal with.

The following article comes from blue lamp programming, the author: Breeze

Python GUI programming: hd movie online viewing platform production, the whole network free movie watching

https://www.bilibili.com/video/BV1tz4y1o7Yc/
Copy the code

To give you a taste of what it means to download 2,000 pieces of data a second.

Basic development environment

  • Python 3.6
  • Ms. Pichardo’s

Use of related modules

import csv
import time
import requests
import concurrent.futures
Copy the code

Target Page Analysis

There are 214 pages of data, 20 pieces of data per page, for a total of 4,280.

: developer tools, click the second page, in XHR will appear the data.

This is the link parameter, where pn corresponds to the page number, the second page selected so PN: 2

If you are careful, you can see that the returned data is not a JSON data.

Such data extraction must be converted to JSON data for extraction. There are two ways to do this:

Method one:

The parameters of the “cb: jQuery1124036392017581464287_1608882113715 > take out, don’t want to go in, can directly to the response. The json output in the form of ().

import requests url = 'http://49.push2.eastmoney.com/api/qt/clist/get' params = { # 'cb': 'jQuery1124036392017581464287_1608882113715', 'pn': '2', 'pz': '20', 'po': '1', 'np': '1', 'ut': 'bd1d9ddb04089700cf9c27f6f7426281', 'fltt': '2', 'invt': '2', 'fid': 'f3', 'fs': 'm:0 t:6,m:0 t:13,m:0 t:80,m:1 t:2,m:1 t:23', 'fields': 'f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152', '_': } headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} Response = requests. Get (url=url, params=params, headers=headers) html_data = response.json() stock_data = html_data['data']['diff']Copy the code

Method 2:

1, normal parameters, request web page to return data response.txt

2, with regular matching jQuery1124036392017581464287_1608882113715 (. *?) Match intermediate data

3, by importing JSON module, serial transfer JSON data json.loads

import pprint import re import requests import json url = 'http://49.push2.eastmoney.com/api/qt/clist/get' params = { 'cb': 'jQuery1124036392017581464287_1608882113715', 'pn': '2', 'pz': '20', 'po': '1', 'np': '1', 'ut': 'bd1d9ddb04089700cf9c27f6f7426281', 'fltt': '2', 'invt': '2', 'fid': 'f3', 'fs': 'm:0 t:6,m:0 t:13,m:0 t:80,m:1 t:2,m:1 t:23', 'fields': 'f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152', '_': } headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} Response = requests. Get (url=url, params=params, headers=headers) result = re.findall('jQuery1124036392017581464287_1608882113715\((.*?) \); ', response.text)[0] html_data = json.loads(result) stock_data = html_data['data']['diff'] pprint.pprint(stock_data)Copy the code

For this site, both methods are acceptable, but the second method is generally recommended because the first method is opportunistic.

The value is followed by a list of data, which can be used to loop through, and the relevant data for each stock can be obtained through the key value pair.

For I in stock_data: dit = {' code ': I [' f12],' name ': I [' f14],' latest price: I [' f2 '], 'size' : STR (I) [' f3] + '%', 'or amount: [I] 'f4', 'volume (hand) : I [' f5'], 'turnover: I [' f6],' amplitude: STR (I [' f7 ']) + '%', 'the highest' : I [' f15], 'minimum' : I [' f16], 'open today: [I] 'f-17 thunder', 'yesterday closed: I [' f-18s],' than ': I [' f10'], 'turnover rate: STR (I [" f8 "]) +' % ', 'p/e ratio (dynamic) : I [' f9'], 'price-to-book: I [' moviemaker],}Copy the code

Save the data through the CSV module.

How fast is crawling using multiple threads?

Speed for five threads:

if __name__ == '__main__': start_time = time.time() executor = concurrent.futures.ThreadPoolExecutor(max_workers=5) for page in range(1, 215): url = f'http://49.push2.eastmoney.com/api/qt/clist/get?pn={page}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&in vt=2&fid=f3&fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17, f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1608882114076' executor.submit(main, url) executor.shutdown()Copy the code

Total time: 3.685572624206543

Total data: 4279 pieces of data

So on average, 1161 pieces of data are climbed per second.

Speed when I give 10 threads:

if __name__ == '__main__': start_time = time.time() executor = concurrent.futures.ThreadPoolExecutor(max_workers=10) for page in range(1, 215): url = f'http://49.push2.eastmoney.com/api/qt/clist/get?pn={page}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&in vt=2&fid=f3&fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17, f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1608882114076' executor.submit(main, url) executor.shutdown()Copy the code

Total time: 1.7553794384002686

Total data: 4279 pieces of data

So on average, 2,437 pieces of data are climbed per second.

Speed when I give 20 threads:

.

.

.

.

.

.

.

.

.

.

I can’t. The computer won’t hold up.