Many students ask direct messages of the crawler is relevant tutorial, wanted to think, or specifically with everyone out some Python crawler learn the relevant tutorial, elaborated how to write a Python web crawler from scratch, and the easy problems in web crawler, such as those with anti climb encrypted web sites, and the crawler can’t get data, as well as the login authentication and so on questions, Will accompany a large number of site crawler actual combat to carry out.
The main reason we write web crawlers is to crawl the data we want and use crawlers to automate some of the things we want to do on the site.
Here I’ll start with the basics of how to do what you want with a web crawler.
Let’s start with a simple piece of code.
import requests # import requests package
url = 'https://www.cnblogs.com/LexMoon/'
strhtml = requests.get(url) Get page data
print(strhtml.text)
Copy the code
We first import requests to import the network request package, then define a string URL that is the target web page, and then use the imported Requests package to request the content of that web page.
Here we use requests. Get (URL), which is not the get to get, but a method of network requests.
There are many methods of network request, the most common are GET, post, others like PUT and delete you will hardly see.
Requests. Get (URL) is a get request sent to the URL page, and a result is returned, which is the response information for the request.
Response information is divided into response header and response content.
The response header is whether your visit was successful, what type of data is returned to you, and much more.
The response content is the source code of the web page you obtained.
Okay, so you’re getting started with a Python crawler, but there are still a lot of problems.
1. What is the difference between GET and POST requests?
2. Why do I get pages that don’t contain the data I want?
3. Why is the content I crawl down from some sites different from what I actually see?
What is the difference between get and POST requests?
The difference between GET and POST mainly lies in the location of parameters. For example, when there is a website that requires users to log in, when we click login, where should the account and password be placed.
The most intuitive representation of a GET request is that the parameters of the request are placed in the URL.
For example, if you search Python, you’ll find its URL as follows:
https://www.baidu.com/s?wd=Python&rsv_spt=1
Dw =Python = dw=Python To start, separate it with ampersand.
If we need to enter a password for a website that uses a GET request, our personal information is easily exposed, so we need a POST request.
In a POST request, the parameters are placed in the request body.
For example, if I log in to the W3C website, I can see that the Request Method is post.
At the bottom of the request, we also sent the login information, which is the encrypted account password, sent to the opposing server for verification.
Why do I crawl pages that don’t contain the data I want?
Our crawler may sometimes crawl down a site and, when looking at the data inside, find that it is the target page that crawls down, but the data we want is not there.
For example, a classmate asked me a question a few days ago. When he was climbing the flight information of Ctrip, he could get the flight information from other places except ctrip.
Web address: https://flights.ctrip.com/itinerary/oneway/cgq-bjs?date=2019-09-14
The diagram below:
This is a very common problem, because when he requests. Get, he went to the URL I put in get, but although this page is the same URL, the data in it is not the same address.
Sounds like a very difficult, but from the point of view of ctrip, the creator of this website, this part of the loading list of flight information can be very large, if you are directly on this page, we need a long time, users can open this web page so that web hang up then close, so the designers in this URL request only put the main body frame, Get the user to the page quickly, while the main flight data is loaded later, so the user doesn’t drop out because of a long wait.
After all, what we do is for the user experience, so how should we solve this problem?
If you’ve learned about the front end, you probably know about Ajax asynchronous requests, but it’s okay if you don’t, because we’re not talking about the front end here.
We just need to know that we started requesting https://flights.ctrip.com/itinerary/oneway/cgq-bjs?date=2019-09-14 web pages have a js script, after the web request to the will to carry out, The purpose of this script is to request information about the flight we want to climb.
At this point we can open the browser console, Google or Firefox is recommended, press F to go to the tank, no, press F12 to go to the browser console, then click NetWork.
This is where we can see all the network requests and responses that occur on this page.
We request the flight information can be found on the inside of the URL is https://flights.ctrip.com/itinerary/api/12808/products.
Why is it that the content I climb down from some sites is different from the content I actually see?
And the last question is why is it that the content I’m climbing down on some sites is different from the content I actually see?
The main reason for this is that your crawler is not logged in.
Just as we usually browse the web, some information need to log in to access, crawler is also the same.
This brings us to an important concept: our everyday viewing of web pages is based on Http requests, and Http is a stateless request.
What is statelessness? You can think of it as not recognizing people, which means your request goes to the other server, and the other server doesn’t know who you really are.
So why can we continue to visit this page for a long time after we log in?
This is because although Http is stateless, the other server has arranged an ID card for us, also known as a cookie.
When we first go to this page, if we haven’t visited it before, the server will give us a cookie, and then any request we make on this page, we’ll put the cookie in. So the server can use the cookie to identify who we are.
For example, relevant cookies can be found in Zhihu.
For these sites, we can get existing cookies directly from the browser and use them in the code, requests. Get (URL,cookies=” aidnWinfawinf “), or let crawlers simulate logging in to the site to get cookies.
More Python crawler tutorials will continue to be updated!