preface

Has not updated for a long time, don’t know if you fans have such worry, can’t, this is not to the end of the year, and start to prepare the annual work report, I to now also don’t know what is the thing with, anyway is such a thing, every company is needed, the most gives me a headache is the future of the link, If I could predict the future, I’d be writing you a plan if I didn’t make a fortune buying lottery tickets. But who lets you be the eldest brother, ah, so, have no way, be from the net all sorts of seek information or some data, write future plan, just of, anyway also be the end of the year, have no what matter to do, idle also be idle, picked up a few part-time of the list, earn some money to go home send red envelope to junior! I don’t know if you have encountered such a dilemma

Online to find some materials again, found on the Internet for python crawler case on less, and more, and anyway I recently from the Internet to find information, even part-time list are associated with the crawler, today will give you a brief introduction of the related content of the crawler, you do not have what matter can also practice your play

Public id: Java Architects Association

The development environment

Development tool: PyCharm

Development environment: PYTHon3.8.5

The principle of interpretation

In fact, when it comes to crawlers, the simple understanding is that we want to save some part of the content we view on the web to local, why say so?

For example, as everybody knows, when we were on the browser browse a web site information, actually he is request data corresponding to the background, and then, corresponding backend server the data request, will return to to our front-end data interface (that is, you can see the surface), if so complicated, that combine we look at the picture

Copy the code

This process is colloquially known as:

The browser submits the request -> downloads the web page code -> parses/renders the page

And crawler to do is actually very simple, but also very easy to understand, since you view some information in the browser, is a communication between the server and the implementation of the process, then you are not now required to simulate a behavior of the browser, and then store the data to the local?

The process is as follows:

Simulate a browser to send a request -> download the web code -> extract only useful data -> store it in a database or file

An illustration of this is ** **

The implementation process

Now that we have sorted out a working principle of crawler in front, the next is not simple ah, we are to look at the code implementation process, writing code is the most technical content

First, the address of the request to get the data

Since we are now simulating a browser behavior, we should at least know a request address for the data. That is the URL, I provide you today when the source code to climb B station barrage, then I will take this as an example to explain

So how do we find the URL? Well, first of all, as I said earlier, as we said earlier, we’re going to simulate a website and initiate a request, so what we’re going to look at is a REQUEST URL, right

Note: What is request

Use the HTTP library to send a Request to the target site, that is, send a Request Request contains: Request header, Request body, and so onCopy the code

We press F12 to open developer mode, then click on the bullet-screen list in the page, after selecting the bullet-screen date to view, an option starting with history will appear in the console below, click, we will see the request URL option

Second, simulate user requests

It mainly includes two aspects, user simulation and login simulation

** User simulation

支那

Because we’re simulating the browser making a request to the server, as if a stranger came knocking on your door, would you open the door? No, so if we pretend to be friends and relatives, you can open the door. In the request, we have a user-agent to identify ourselves by selecting ** **

Note:

User Agent (UA for short) is a special string header that enables the server to identify the operating system and version used by the customer, CPU type, browser and version, browser rendering engine, browser language, and browser plug-in.

Landing simulation

The above problem is solved, but also leads to the next problem, we conduct this information check, is through a logged in account, when you are not logged in ** **

You can’t see the historical barrage data, so how did we model landing behavior during simulation

At this time, we have to mention a behavior of the browser, when we log in to a website, when we visit again, many times do not need to log in again, then why? The reason is that when you log in, the browser generates a cookie locally. The cookie is associated with a particular Web document, storing information about when the client accesses the Web document and making it available to the client when it accesses the Web document again. This completes our direct access behavior

In other words, as long as we get this cookie, when we access the background server, we don’t need any login operation, all right

Step 3: Request data

After going through the above steps, we have successfully initiated a request to the server. When you perform the previous steps, you will get a result like this

That means we’ve successfully connected to the server

Response status 200: Successful 301: jump 404: file does not exist 403: permission 502: server errorCopy the code

Next we need to determine what data I need through the user-agent and cookie we pass in, and get the data from the URL request

So that’s the result

Fourth step, data analysis

By step 3, we have retrieved the corresponding data information, but we only need a part of it, so we parse it, leaving the data we need for ourselves

Step 5: Data storage

The parsed data is saved locally. At this point, the data crawl is complete

Well, a data crawler work here is over, I hope you don’t have anything to do when you can start the actual operation, the following is my code, you can refer to their own

Code implementation

In order to take care of the new contact with this friend, so HERE I will write relatively detailed, some of the small knowledge points will not expand to explain, after all, I have to write the year-end summary of helpless pain

# Crawl B station barrage # what is the module: A module is a collection of pieces of code that implement some function, usually one or more functions in a.py file called import requests Url = 'https://api.bilibili.com/x/v2/dm/history?type=1&oid=260575715&date=2020-12-20' # # information on how to access path urllib (python's own) Request (third party module) # dictionary: A dictionary is a python data type characterized by unordered elements and unique keys. The dictionary is created by {key: Values} getdata=requests. Get (URL) print(getdata) headers={#user-agent) If you pretend to be someone he knows, like Yi Jung, then he will open the door for you. Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', #cookie: We enter the username and password when we log in, and then we don't need to log in again when we log in. This is because the account and password are cached together. "_uuid=326F06D1-FE8D-9191-42CE-DD309D14353C67633infoc; buvid3=33D869DB-6F2F-4BB0-B607-1B7B34F07CFD53925infoc; sid=4lmxv7lu; rpdid=|(u)Y|l|uJRk0J'ulmJY|kRR|; dy_spec_agreed=1; LIVE_BUVID=AUTO2815973097085458; blackside_state=1; CURRENT_FNVAL=80; bp_video_offset_26390853=467157764523510086; bp_t_offset_26390853=467157764523510086; fingerprint=073aaf7f9d22ae55cfafd954c7f31b26; buivd_fp=33D869DB-6F2F-4BB0-B607-1B7B34F07CFD53925infoc; buvid_fp_plain=BCE2280A-DF5C-4872-98E2-4002159A716F143082infoc; PVID=3; bfe_id=fdfaf33a01b88dd4692ca80f00c2de7f; buvid_fp=33D869DB-6F2F-4BB0-B607-1B7B34F07CFD53925infoc; DedeUserID=26390853; DedeUserID__ckMd5=8d24b1d50476c5e5; SESSDATA=c6386003%2C1624877887%2Ca501d*c1; bili_jct=704cf795ee7a134f74dd244b80c5107d" } resq = requests.get(url,headers=headers) resq.encoding='utf-8' Print (resq. Text) # < d p = "25.25400, 1,25,16777215,1608370925,0 aded156e, 42587259146338307" > don't ah, my happy > < / d # programming using python, Findall (r'<d p=".*?") > (. *?) </d>',resq.text) # cache cookie print(data) # import pandas as pd test.to_csv('e:/testcsv.csv',encoding='utf-8')Copy the code

If you’re new to Python, you can print the data every time you get it

Share your code cloud address, in addition to information, and I share the learning code, you can download