preface

The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with.

Author: Amauri

PS: If you need Python learning materials, please click on the link below to obtain them

Note.youdao.com/noteshare?i…

This article belongs to the entry level crawler, old drivers do not need to read.

This is mainly to crawl netease news, including news title, author, source, release time, news body.

First we go to 163’s website, and we pick a random category, and the category I chose here is national news. Then right click to view the source code, found that there is no page in the news list. This indicates that the page is asynchronous. That is, the data obtained through the API interface.

So if you’re sure, you can use F12 to open the Browser console, click on Network, and we’ll scroll down to the right: “… special/00804KVA/cm_guonei_03.js? . “Or something like that, click on Response and it turns out that’s the API we’re looking for.

http://temp163..com/special/0...*).js
Copy the code

The connection above is the address we want to request for this capture.

There are only two Python libraries to use:

  • requests
  • json
  • BeautifulSoup

The Requests library is used to make web requests, essentially mimicking a browser for resources.

Because we are collecting the API interface, its format is JSON, so we need to use json library to parse. BeautifulSoup is used to parse HTML documents and conveniently retrieves the content of a given div.

Let’s start writing our crawler:

The first step is to import the above three packages:

import json
import requests
from bs4 import BeautifulSoup
Copy the code

Next we define a method to get the data in the specified page number:

def get_page(page):
    url_temp = 'http://temp.163.com/special/00804KVA/cm_guonei_0{}.js'
    return_list = []
    for i in range(page):
        url = url_temp.format(i)
        response = requests.get(url)
        ifresponse.status_code ! =200:
            continue
        content = response.text  Get the response body
        _content = formatContent(content)  Format the JSON string
        result = json.loads(_content)
        return_list.append(result)
    return return_list
Copy the code

This gives a list of contents for each page number:

Before capturing the text, analyze the HTML page of the text to find the position of the text, author, and source in the HTML document.

We see the article source in the document as id = “ne_article_source” with the A tag. Author position: class = “ep-editor” span tag. Div tag with body position: class = “post_text”.

The following interview collection code for these three contents:

def get_content(url):
    source = ' '
    author = ' '
    body = ' '
    resp = requests.get(url)
    if resp.status_code == 200:
        body = resp.text
        bs4 = BeautifulSoup(body)
        source = bs4.find('a', id='ne_article_source').get_text()
        author = bs4.find('span', class_='ep-editor').get_text()
        body = bs4.find('div', class_='post_text').get_text()
    return source, author, body
Copy the code

All the data we need to capture has been captured so far.

So the next step, of course, is to save them, and I’m going to save them in text form for convenience. Here is the final result:

Note that the current implementation is completely synchronous, linear, and the problem is that acquisition is very slow. The main delay is on the network IO, which can be upgraded to asynchronous IO, asynchronous collection.