No offense to the title, but I thought the AD was fun and the mind map above is yours to take, because I can’t learn that much anyway

The article directories

    • preface
      • Welcome to our circle
    • Cookies and Sessions to bypass login authentication
      • The statement
      • What are the cookies? What is session?
      • How to implement the “Remember my login status” function
      • A post request
      • Practice is the first step
      • Feng loop turn
      • Put the little biscuits into the biscuit tin
    • Automated implementation: Selenium
      • Let me show you
      • The code shown
      • Environment configuration
      • Selenium
        • Setting up the Browser Engine
        • What can Selenium do?
        • Why is Selenium so capable?

preface

Previous Review: I will Learn Python secretly (Day 10)

What about the last one? The last one wasn’t so good. I know for a fact, I was sick. So, this one I prepared a lot of interesting things (bad smile), hey hey, come to work with me.

I can and you can too!!



Interrupt a push :(if you are small white, you can take a look at the following paragraph)

Welcome to our circle

I created a Python learning q & A group, interested friends can know: what is this group

There are more than 700 friends in the group!!

Portal of the Straight group: Portal


This series of articles assumes that you have some basic knowledge of C or C++, because I started Python after learning a little C++. Thank you qi Feng for your support. This series of articles by default you will baidu, learning 'module' this module, or suggest that you have your own editor and compiler, the last article has given you to do the recommendation? I don't want much, just click on it. And then, the catalogue for this series, to be honest, MY preference is for the two Primer Plus books, so I'll stick with their catalogue structure. This series will also focus on cultivating your ability to do things on your own. After all, I cannot tell you all the knowledge points, so the ability to solve your needs by yourself is particularly important. Therefore, please do not regard the holes I have buried in the article as holes.Copy the code

Cookies and Sessions to bypass login authentication

The statement

Are you excited to see this headline? Don’t you think we can steal numbers today? Hey, hey, get the black hat ready. Fuck you. Whoa, whoa, whoa, whoa, whoa, whoa, whoa, whoa, whoa, whoa. How can we do something like this when we’re law-abiding citizens?

I’ll just show you how to bypass login authentication if someone clicks “Remember account password.” As for how you want to get this condition, that has nothing to do with me ah, hereby declare ha ha ha.

Saw me two days ago the article “climb their own photos” blog friends do not know whether there is an impression of this process, whether there are doubts ah, so troublesome operation, everywhere reflects the intervention of people, how to do the machine? You don’t log in, you don’t save, you don’t look for urls, how do you get cookies?

Can ask the small partner of this kind of question (really have), I can only say your brain is quite active, but don’t run sideways, above you these a few problems, all have technical means can solve, but we let crawler can log in own account, can’t do a lot of things? You have the tools.


What are the cookies? What is session?

Cookie: In web sites, HTTP requests are stateless, that is, even after the first connection to the server and successful login, the second request server still cannot know which user is currently requesting. Cookie is to solve the problem: when the browser to access web site, the site will be a set of data stored in the client, when the user to send a second request will automatically store last time request the cookie data automatically to the server, to carry the server through the browser to carry data can identify the current user.

Generally there is some local data for the web page, used for the next visit back to verify, often used for login authentication, remember the status

session: Session is a hashtable-like structure stored on the server to store user data. When the browser sends a request for the first time, the server automatically generates a HashTable and a Session ID to uniquely identify the HashTable, and sends the response to the browser. When the browser sends the request for the second time, it puts the Session ID from the previous server response in the request and sends it to the server. The server extracts the Session ID from the request and compares it with all the saved Session ids to find the HashTable corresponding to the user.

Similar to a client’s local cookie, a session is a server’s ‘cookie’ that performs the same functions, often interactively authenticating logins together and remembering state

How to implement the “Remember my login status” function

Therefore, we can know that if the validity period of the Session ID is set to 1 year when the client sends the Cookie to the client, the client will send the Session ID value to the server when it visits my website in the next year. The server restores the HashTable containing key-value pairs from memory or the database based on this Session ID.

However, sessions on the server are not actually saved. After a certain period of time, the Session on the server is destroyed to relieve the server of access pressure. When the data on the server is destroyed, there is no way to “remember my login state” even if cookies are stored on the client.

Therefore, the method in this paper is only used for short-term authentication of cookies to skip login authentication access, and the local cookie expiration time is mainly related to the session setting time on the server.


A post request

What is a POST request? If you haven’t heard of post requests, consider get requests.

In fact, both POST and GET can be requested with parameters, but the parameters of the GET request are displayed in the URL.

The parameters of the POST request are hidden instead of being displayed directly. For private information like account numbers and passwords, post requests should be used.

Speaking of which, I remember that the C++ group I was working with didn’t seem to have turned in the input control function assignment.

Typically, get requests are applied to get web data, such as requests.get() we learned earlier. Post requests are used to submit data to a web page, such as form type data (such as account and password data for a web form).


Practice is the first step

Open the CSDN login page and fill in your personal information:Passport.csdn.net/login?code=…

Check the check box, check the check box, and click Login.

You guess which package ah, play a little clever ah, you see you login success, the right side is still constantly loading package, it can determine the login package in front. After clicking login, as soon as the signal is sent, the first step must be login, so take a look at the first few packages, and you will see the “doLogin” at a glance. Click open.



See, Post,

What? You see a bunch of set-cookies? No offense, I’m just gonna mention hahahahahahahaha.

Well, I drew it for you. Above I mention that once, actually I want to say, all open, different website is not the same, may not be in which small corner found your small cookies.

In fact, there are more than cookies, account numbers and passwords:

A student who has been practicing Python for two weeks caught his student ID photo at CSDN

Let’s try another login method, login with a parameter.

import requests
# import requests.
url = 'https://www.csdn.net/'
Assign the requested login url to the url.
headers = {
'origin':'https://passport.csdn.net'.This parameter is not needed in this case, just for demonstration purposes
'referer':'https://passport.csdn.net/login'.'User-Agent':'government'
}
Add a request header to simulate normal browser access and avoid anti-crawler.
data = {
"loginType": "1"."pwdOrVerifyCode": "Password"."userIdentification":"Account"
}

Encapsulate login parameters into dictionaries and assign values to data.
login_in = requests.post(url,headers=headers,data=data)

print(login_in)
Copy the code

Ok, return 403, sloppy.. It’s okay. It’s okay.


Feng loop turn

Oh, I tried again and again and finally logged in successfully:

import requests
from bs4 import BeautifulSoup

header = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36'.'Connection': 'keep-alive'.'accept': 'application/json, text/plain, */*'.#'Cookie': cookie,
'referer'Referer data = {referer data = {"loginType": "1"."pwdOrVerifyCode": you,"userIdentification": you}Encapsulate login parameters into dictionaries and assign values to data.
login_in = requests.post(url,headers=header,data=data)

print(login_in)
Copy the code

Great, this time it’s 200.

What’s next? Then find a blog to comment on and forget about it.

cookies = login_in.cookies
Call the cookies property of the Requests object (login_in) to get the login cookies and assign the value to the variable cookies.Url_1 = find it yourself# the url of the article we want to comment on.
data_1 = {
'content': 'test'.'articleId': Personal}Encapsulate comments as dictionaries.
comment = requests.post(url_1,headers=header,data=data_1,cookies=cookies)
Post to request a comment, place the following parameters: url, headers, comment parameters, cookies parameters, and assign them to comment.
The method of invoking cookies is to pass cookies=cookies in the POST request.
print(comment.status_code)
Print the status code of comment. If the status code is 200, it proves that our comment is successful.
Copy the code

CDSN comments refresh slowly, so use the status code as the criterion, sometimes it takes a day for you to see the slow.

If you wait all day and don’t get a comment, it’s fine. I’m telling you, it’s probably been cut by the backstage. It’s okay. We’ll find a better way later.

Put the little biscuits into the biscuit tin

Well, for the sake of intuition, I’ll take the code excerpt from the previous section of the student ID card.

import requests
from bs4 import BeautifulSoup
cookie = "* Paste cookie information copied from Chrome *"
header = {
'User-Agent': 'Put your own'.'Connection': 'keep-alive'.'accept': 'Put your own'.'Cookie': cookie,
'referer': 'Put your own blog home page address'
}
url = 'https://me.csdn.net/api/user/show' # CSDN personal center, load the js address of the name
seesion = requests.session()
response = seesion.get(url,headers=header)
print(type(session.cookies))
# print the type of cookies,session.cookies are login cookies
Copy the code

Great, you can see the result is: < class ‘requests. Cookies. RequestsCookieJar >’

I’m afraid this thing can’t be stored in the text, someone to try to see.

But take a closer look and see if this cookie looks like a typical character string

Do it yourself, I said to one: actually do not have to turn the string can also try, can not do it again.


Of course, there are other ways to get cookies, but this is the most straightforward.


Automated implementation: Selenium

The website now, also is not fool, which login does not want your verification code? Very few. Then the captcha code will have to be manually entered, of course, some people say what machine learning ah, crack captcha ah, good idea, try.

There are websites, ah, you must have encountered it, crisscross road traffic, intricate, climb a ball ah climb.

Not to mention sites that encrypt urls or simply ban crawlers.

Well, let’s take a look at how much of a hurdle Selenium can help us overcome with this new technology we’re about to touch.

Let me show you

For a rough demonstration, open a browser, open a blog post, and then close it. As for the other high-end operations, we’ll show them in code later:

The code shown


# Local Chrome Settings
from selenium import  webdriver
import time

driver = webdriver.Chrome()
driver.get('https://blog.csdn.net/qq_43762191')
time.sleep(2)

driver.get('https://lion-wu.blog.csdn.net/article/details/109244401')
time.sleep(2)

driver.close()
Copy the code

If I give you the code and you try it, it probably won’t work because you probably haven’t configured the environment.

Environment configuration

Well, there’s no need to worry. Everything will come.

First of all, you need to have a Google Browser. Second, you need to check your Browser version, which is important because the first generation version corresponds to the first generation driver, and if it doesn’t match, it can be a bit of a problem. Next, we to download a driver: npm.taobao.org/mirrors/chr… Pick your own version.

Once you’ve downloaded it, unzip it and put it in the same directory as the Python installation. If you don’t know which directory it is, put it in as many possible Python installation directories as you can.

Ok, turn on PyCharm again and run the previous code.

Oh, and you also need to download a selenium package, which is a little big.


Today do not speak too much operation, on the start, the number of words is more than eight thousand, the fun things are left to the next article.

Let’s take a look at the above lines of code and get it off to a good start. I know, some of you might want to look it up for yourself.


Selenium

Setting up the Browser Engine

The first step is to import the module
from selenium import  webdriver		
import time

driver = webdriver.Chrome()	# Gain control of Google Chrome. This error will be reported if there is no driver
driver.get('https://blog.csdn.net/qq_43762191')	# Command Google Explorer: Hey, kid, open this page for me
time.sleep(2)	This is mainly because the browser is slow, or the network is slow, but there is a delay, you wait 2 seconds.

driver.get('https://lion-wu.blog.csdn.net/article/details/109244401')	# Open another one
time.sleep(2)	# same as above

driver.close()	# Okay, that's it. Turn it off
Copy the code

What can Selenium do?

I’ll put it this way, above, set Chrome as the engine and assign it to the variable driver. The driver is the instantiated browser, and you’ll always see it later, which is understandable because we control the instantiated browser to do something for us.

You know.

Why is Selenium so capable?

Selenium simplifies the problems we have encountered before, making it as easy to climb dynamic web pages as it is to climb static ones.

Static pages are the kind of pages that we can just BeautifulSoup to start with. We use BeautifulSoup to crawl this type of page because the source code contains all the information on the page, so the URL in the address bar is the URL of the source code.

Later, we moved to more complex web pages, and if I remember correctly, we started by grabbing CSDN comments, and that’s when we got to JSON. There are QQ music behind, to crawl the data is not in the HTML source code, but in json, you can not directly use the URL bar URL, and need to find the real URL of JSON data. This is a dynamic web page.

No matter where the data resides, the browser is always making various requests to the server, and when these requests are complete, they together make up the rendered web source code shown in Elements for the developer tool.

Selenium comes in handy when the page interaction is complex or the URL encryption logic is complex. It can actually open a browser, wait for all the data to load into Elements, and then use the page as a static web page to crawl.

Having said all that, there are certainly some fly in the ointment when using Selenium. Because it takes some time to actually run the local browser, open the browser, and wait for the web rendering to complete, Selenium’s work inevitably sacrifices speed and more resources, but at least it’s no slower than humans. So they tell you to wait. Young people, they’d rather stop for three minutes than grab a second.


That’s it. Keep it in suspense.