Easy access to Songhuajiang News [www.shjnet.cn/ms/msxw/]
1, the first analysis of the source page to see the content to crawl in what position 2, analysis of the HTML to get the content you want
1, view the source code
We found that the data we want is under the <h4 tag
Get the source code for the web page through Requests by coding
html = requests.get(url).content
Copy the code
BeautifulSoup then finds the label we want
links = soup.find_all('h4', class_='blank')
Copy the code
This crawls the data from the news list
3. Next, use the URL that the list crawls to get the details of the content, the method is the same as above
Direct paste source code:
#! /usr/bin/env python
# coding:utf8
import sys
import requests
from bs4 import BeautifulSoup
reload(sys)
sys.setdefaultencoding("utf8")
url = 'http://www.shjnet.cn/ms/msxw/index.html'
def getNewsList(url, page=0):
if(page ! =0):
url = 'http://www.shjnet.cn/ms/msxw/index_%s.html' % page
html = requests.get(url).content
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('h4', class_='blank')
for link in links:
detailUrl = "http://www.shjnet.cn/ms/msxw/" + link.a.get('href').replace('/'.' ')
print "-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --"
print "Headline :" + link.a.get_text() + "Address for Details :" + detailUrl
getNewsDetail(detailUrl)
page = int(page) + 1
print soup.select('#pagenav_%s' % page)
if (soup.select('#pagenav_%s' % page)):
print U 'Start grabbing next page'
print 'the %s page' % page
getNewsList(url, page)
def getNewsDetail(detailUrl):
html = requests.get(detailUrl).content
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', class_='col-md-9')
for link in links:
# print link.span.get_text()
# print link.h2.get_text()
# print link.find('div', class_='cas_content').get_text()
if (link.find('div', class_='col-md-10').select('img')):
imgs = link.find('div', class_='col-md-10').find_all('img')
for img in imgs:
print Image: "" + detailUrl[:detailUrl.rfind('/')] + "/" + img.get('src').replace('/'.' ')
if __name__ == '__main__':
getNewsList(url)
Copy the code
Effect:
The Python used in this article is 2.7
Problems encountered in the crawl
- print
html = requests.get(url).text
Garbled code consulted the students in the team group, get the answer..text
Unicode data is returned..content
It returns bytes, which is binary data, and then converts ithtml = requests.get(url).content
Resolve garbled characters - Delete the url when stitching details
. /
Redundant stringlink.a.get('href').replace('./', '')
- Error getting details content
http://
- BeautifulSoup for the first time has a simple book on the right for how to use it