June 06, 2017 First time to climb data

Easy access to Songhuajiang News [www.shjnet.cn/ms/msxw/]

1, the first analysis of the source page to see the content to crawl in what position 2, analysis of the HTML to get the content you want

1, view the source code

We found that the data we want is under the <h4 tag

Get the source code for the web page through Requests by coding

html = requests.get(url).content
Copy the code

BeautifulSoup then finds the label we want

links = soup.find_all('h4', class_='blank')
Copy the code

This crawls the data from the news list

3. Next, use the URL that the list crawls to get the details of the content, the method is the same as above

Direct paste source code:

#! /usr/bin/env python
# coding:utf8
import sys

import requests
from bs4 import BeautifulSoup

reload(sys)
sys.setdefaultencoding("utf8")

url = 'http://www.shjnet.cn/ms/msxw/index.html'


def getNewsList(url, page=0):
    if(page ! =0):
        url = 'http://www.shjnet.cn/ms/msxw/index_%s.html' % page
    html = requests.get(url).content
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('h4', class_='blank')
    for link in links:
        detailUrl = "http://www.shjnet.cn/ms/msxw/" + link.a.get('href').replace('/'.' ')
        print "-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --"
        print "Headline :" + link.a.get_text() + "Address for Details :" + detailUrl
        getNewsDetail(detailUrl)
    page = int(page) + 1
    print soup.select('#pagenav_%s' % page)
    if (soup.select('#pagenav_%s' % page)):
        print U 'Start grabbing next page'
        print 'the %s page' % page
        getNewsList(url, page)


def getNewsDetail(detailUrl):
    html = requests.get(detailUrl).content
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('div', class_='col-md-9')
    for link in links:
        # print link.span.get_text()
        # print link.h2.get_text()
        # print link.find('div', class_='cas_content').get_text()
        if (link.find('div', class_='col-md-10').select('img')):
            imgs = link.find('div', class_='col-md-10').find_all('img')
            for img in imgs:
                print Image: "" + detailUrl[:detailUrl.rfind('/')] + "/" + img.get('src').replace('/'.' ')


if __name__ == '__main__':
    getNewsList(url)
Copy the code

Effect:

The Python used in this article is 2.7

Problems encountered in the crawl

printhtml = requests.get(url).textGarbled code consulted the students in the team group, get the answer..textUnicode data is returned..contentIt returns bytes, which is binary data, and then converts ithtml = requests.get(url).contentResolve garbled characters
Delete the url when stitching details. /Redundant stringlink.a.get('href').replace('./', '')
Error getting details content

http://

BeautifulSoup for the first time has a simple book on the right for how to use it

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

June 06, 2017 First time to climb data

Easy access to Songhuajiang News [www.shjnet.cn/ms/msxw/]

We found that the data we want is under the <h4 tag

The Python used in this article is 2.7

Problems encountered in the crawl

Find_all (“tag”) Searches for the collection of all current tag tags.

Find (“tag”) returns a tag. (This method is rarely used)

Select (“”) can be found by label name, but is much more used to find filter elements by label layer by layer.

To obtain`Content > <`Content to use`.get_text()`

Get <href content > content to use`.get('href')`

Currently deploys content in the console

June 06, 2017 First time to climb data

Easy access to Songhuajiang News [www.shjnet.cn/ms/msxw/]

We found that the data we want is under the <h4 tag

The Python used in this article is 2.7

Problems encountered in the crawl

Find_all (“tag”) Searches for the collection of all current tag tags.

Find (“tag”) returns a tag. (This method is rarely used)

Select (“”) can be found by label name, but is much more used to find filter elements by label layer by layer.

To obtainContent > <Content to use.get_text()

Get <href content > content to use.get('href')

Currently deploys content in the console

Related Posts

“K8S Ecology Weekly” one year anniversary!

AgreementMaker: Efficient Matching for Large Real-world translation

WebRTC practical experience sharing: How to quickly build audio and video call APP?

To obtain`Content > <`Content to use`.get_text()`

Get <href content > content to use`.get('href')`