# What is crawler request website, extract page content maximization program. You get HTML code, and you need to extract the data you want from that text

  1. Initiate a request:

Sending an HTTP Request to a target site is sending a Request, which can include additional headers and other information, and waiting for the server to respond

2. Obtain the response content

If the server is normal, a response will be returned in the first step. The content of the response is the content of the page to be obtained, and the type may be HTML, Json string, binary data (such as pictures and videos), etc.

3. Parse the content

The resulting content might be HTML. Regular expressions and web page parsing libraries can be used for parsing. If it is JSON, it can be parsed directly as a JSON object. If the data is binary, it can be saved or further processed.

4. Save data:

It can be saved as text, to a database or to a file in a specific format

# the request and response

What does #request contain?

What’s in response?

What kind of data can crawlers grab?

How to parse the data after obtaining it?

# Note: Using request.get(‘url’) gets only the contents of the first request, i.e. when we execute:

import requests
response = requests.get('http://www.weibo.com')
print(response.text)# Print out the code returned by the first request
Copy the code

This returns the code shown in the red box below:

Is the data captured different from the browser page?

Native HTML does not include any information, the information displayed on the page is through the subsequent JS call background interface to get the data, JS then load the page out. So the source code in Elements is different from the source code we got:

How to solve this JavaScript rendering problem?

  • Analyzing the Ajax request: The Ajax request returns a JSON-formatted string
  • Selenium WebDriver simulates or drives a browser web page, automating testing based on. Examples are as follows:
#pip install selenium
from selenium import webdirver
driver = webdriver.Chrom();# Automatically open Chrome
driver.get('www.weibo.com')The browser page automatically redirects to the specified url
print(driver.page_source)# Get the source code and browser page source code, this is the rendered page source code, JS rendering problem does not exist.
Copy the code

The downside of this approach is that Chrome automatically runs in the background

  • Splash: Like Selenium, both are used to simulate JS rendering, which can be searched on Github and is well documented.
  • Also use: PyV8, Ghost. Py and other libraries

### Solution to JavaScript rendering problem summary

How to save data

  • The text
  • Relational database
  • Philippine relational database
  • Binary file

Information sharing

Java learning notes, 10T materials, more than 100 Java projects to share


Welcome to pay attention to personal public number [cainiao famous enterprise dream], public number focus: Internet job hunting face, Java, Python, crawler, big data and other technology sharing ** : public number ** caibird famous enterprise dream background send “CSDN” can receive free [CSDN] and [Baidu Library] download services; Public number canaliao name enterprise dream background send “information” : you can receive 5T quality learning materials **, Java interview test points and Java face summary, as well as dozens of Java, big data projects, the information is very complete, you want to find almost all