Life is short. I use Python

The previous portal:

Python crawler (1) : The beginning

(1) Install the basic class library

(2) Getting started with the Basics of Linux

(3) Basic Introduction to Docker

(4) Database foundation

(5) Install the crawler framework

Python Crawler (7) : FUNDAMENTALS of HTTP

Python Crawler (8) : Web basics

Python Crawler (9) : Crawler basics

Python crawler (10) : Session and Cookies

1. Urllib (urllib)

1. Urllib (urllib)

1. Urllib (urllib)

1. Urllib (urllib)

1. Urllib (urllib)

Python Crawler (16) : Urllib practical crawling girl map

(17) Python crawler for basic Requests

(18) : Requests for advanced operations

Python crawler (19) : Xpath basis exercises

Python Crawler (20) : Xpath advanced

Beautiful Soup (Soup)

Beautiful Soup (Soup)

Python Crawler (23) : Introduction to PyQuery, a parsing library

Python Crawler (24) : 2019 Douban Movie Rankings

The introduction

The actual combat of the last article did not use the page element analysis, feel a little regret, but the final movie list is very fragrant, really recommend a look.

This topic is to write a good code to write the article, absolutely can use the page element analysis, and also need to have a certain analysis of the site’s data loading, in order to get the final data, and xiaobian to find the two data sources without IP access restrictions, quality assurance, is absolutely the best choice for small white practice.

For the record: this article is for study purposes only.

Analysis of the

Must first take stock data, must want to know what are the stock first, here small make up found a website, this website has stock code list: https://hq.gucheng.com/gpdmylb.html.

Open Chrome developer mode and select the ticker symbols one by one. The specific process xiaobian will not be posted, students to achieve their own.

We can store all the ticker symbols in one list. All that’s left is to find a website and loop around to get the data for each stock.

Small make up this site has been found, is a straight flush, link: http://stockpage.10jqka.com.cn/000001/.

I think you smart students have found that 000001 in this link is the stock symbol.

All we need to do is splice the link together, and we’ll be able to keep getting the data we want.

In actual combat

First of all, let’s introduce the request library and resolution library that we will use in this exercise: Requests and PyQuery. The data store will eventually land in Mysql.

Gets a list of stock symbols

The first step is to build the ticker list. Let’s define a method:

def get_stock_list(stockListURL): R = request.get (stockListURL, headers = headers) doc = PyQuery(R.ext) list = [] # For I in doc('.stocktable a').items(): try: href = i.atr. href list.append(re.findall(r"\d{6}", href)[0]) except: Continue list = [item.lower() for item in listCopy the code

The above link as parameters passed, you can run under the results, xiaobian here do not post the results, a little long…

Get detailed data

The detailed data looks like it’s on the page, but, in fact, it’s not on the page where the data is actually retrieved. It’s on a data interface.

http://qd.10jqka.com.cn/quote.php?cate=real&type=stock&callback=showStockDate&return=json&code=000001Copy the code

As for how to find out, I will not say this time, or hope that you want to learn the reptile students can move their own hands, to look for a few times, naturally find the way.

Now that the data interface is available, let’s take a look at the data returned:

ShowStockDate ({" info ": {" 000001" : {" name ":" \ u5e73 \ u5b89 \ u94f6 \ u884c "}}, "data" : {" 000001 ": {" 10" : "16.13", "eight" : "16.14", "9" : "15. 87 ", "13" : "78795234.00", "19" : "1262802470.00", "7" : "16.12", "15" : "40225508.00", "14" : "37528826.00", "69" : "17.73", "70", "14.51", "12" : "5", "17" : "945400.00", "264648" : "0.010", "199112" : "0.062", "1968584" : "0.406", "2034120" : "9.939", "1378761" : "16.026", "5267 92 ":" 1.675 ", "395720" : "948073.000", "461256" : "39.763", "3475914" : "313014790000.000", "1771976" : "1.100", "6" : "16.12", "11" : "" }}})Copy the code

Obviously, this result is not standard JSON data, but this is the standard format of the data returned by JSONP. Here we first process the first and last, turn it into a standard JSON data, parse it against the data on the page, and finally write the analyzed value to the database.

def getStockInfo(list, stockInfoURL): count = 0 for stock in list: try: Dict1 = json.loads(r.float) [14: dict1 = json.loads(r.float) Dict1: insert_data = {"code": stock, "name": dict1['info'][stock]['name'], "jinkai": dict1['data'][stock]['7'], "chengjiaoliang": dict1['data'][stock]['13'], "zhenfu": dict1['data'][stock]['526792'], "zuigao": dict1['data'][stock]['8'], "chengjiaoe": dict1['data'][stock]['19'], "huanshou": dict1['data'][stock]['1968584'], "zuidi": dict1['data'][stock]['9'], "zuoshou": dict1['data'][stock]['6'], "liutongshizhi": Dict1 ['data'][stock]['3475914']} cursor.execute(sql_insert, insert_data) conn.com MIT () print(stock, ': write complete ') except: Print (' write exception 'Copy the code

Here we add exception handling, because this time the data is a little too much, it is likely to throw an exception for some reasons, we certainly do not want to interrupt data fetching when there is an exception, so we add exception handling here to continue fetching data.

The complete code

We will be a little code encapsulation, complete the actual combat.

Import requests import re import JSON from PyQuery import PyQuery import Pymysql conn = pymysql.connect(host='localhost', port=3306, user='root', password='password', database='test', Charset =' utf8MB4 ') # Fetch cursor = conn.cursor() return {"conn": conn, "cursor": cursor} connection = connect() conn, cursor = connection['conn'], connection['cursor'] sql_insert = "insert into stock(code, name, jinkai, chengjiaoliang, zhenfu, zuigao, chengjiaoe, huanshou, zuidi, zuoshou, liutongshizhi, create_date) values (%(code)s, %(name)s, %(jinkai)s, %(chengjiaoliang)s, %(zhenfu)s, %(zuigao)s, %(chengjiaoe)s, %(huanshou)s, %(zuidi)s, %(zuoshou)s, %(liutongshizhi)s, Now ())" headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} def get_stock_list(stockListURL): R = request.get (stockListURL, headers = headers) doc = PyQuery(R.ext) list = [] # For I in doc('.stocktable a').items(): try: href = i.atr. href list.append(re.findall(r"\d{6}", href)[0]) except: Return list def getStockInfo(list, stockInfoURL): return list def getStockInfo(list, stockInfoURL) count = 0 for stock in list: try: Dict1 = json.loads(r.float) [14: dict1 = json.loads(r.float) Dict1: insert_data = {"code": stock, "name": dict1['info'][stock]['name'], "jinkai": dict1['data'][stock]['7'], "chengjiaoliang": dict1['data'][stock]['13'], "zhenfu": dict1['data'][stock]['526792'], "zuigao": dict1['data'][stock]['8'], "chengjiaoe": dict1['data'][stock]['19'], "huanshou": dict1['data'][stock]['1968584'], "zuidi": dict1['data'][stock]['9'], "zuoshou": dict1['data'][stock]['6'], "liutongshizhi": Dict1 ['data'][stock]['3475914']} cursor.execute(sql_insert, insert_data) conn.com MIT () print(stock, ': write complete ') except: Print (' write exception ') stock_list_url = 'https://hq.gucheng.com/gpdmylb.html' stock_info_url = 'http://qd.10jqka.com.cn/quote.php?cate=real&type=stock&callback=showStockDate&return=json&code=' list = get_stock_list(stock_list_url) # list = ['601766'] getStockInfo(list, stock_info_url) if __name__ == '__main__': main()Copy the code

results

In the end, xiaobian took about 15 minutes, and successfully captured 4600+ data, the results will not be displayed.

The sample code

All of the code in this series will be available at Github and Gitee repositories for easy access.

Example code -Github

Example code -Gitee