Getting started with Python, I found myself a task to do a crawler project for website articles, because it was the fastest way to learn code. So I’m going to start the Python tutorial series today, and I recommend that you write and practice as much as you can when learning Python.
The target
- 1. Learn the Python crawler
- 2. Crawl the news list of news websites
- 3. Crawl the picture
- 4. Save the retrieved data in a local folder or database
- Learn how to install Python extensions using PyCharm’s PIP
First, how can Python easily crawl a web page
1. Preparation
BeautifulSoup4 and Chardet modules used in the project belong to the three-party extension package. If you did not install PIP by yourself, I did the installation by PyCharm. The following is a brief description of the installation of Chardet and Chardet by PyCharm
-
Follow the steps below in pyCharm Settings
-
Search for the extension library you want, such as chardet, and then click Install Package, BeautifulSoup4 and do the same
- After successful installation, it will appear in the installation list, which indicates that we have successfully installed the Network crawler extension library
Two, from shallow to deep, we grab the web page first
Here we take grasping Jane book home page as an example: www.jianshu.com/
# Simple web crawler
from urllib import request
import chardet
response = request.urlopen("http://www.jianshu.com/")
html = response.read()
charset = chardet.detect(html)# {' language ':', 'encoding' : 'utf-8', 'confidence' : 0.99}
html = html.decode(str(charset["encoding"])) # decoding
print(html)
Copy the code
Because the HTML document is relatively long, here is a simple paste part for you to see
<! DOCTYPE html> <! - [if IE 6]><html class="ie lt-ie8"> <! [endif]--> <! - [if IE 7]><html class="ie lt-ie8"> <! [endif]--> <! - [if IE 8]><html class="ie ie8"> <! [endif]--> <! - [if IE 9]><html class="ie ie9"> <! [endif]--> <! - [if! IE]><! --> <html> <! - <! [endif]--> <head> <meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content="Width = device - width, initial - scale = 1.0, user - scalable = no"> <! -- Start of Baidu Transcode --> <meta http-equiv="Cache-Control" content="no-siteapp" />
<meta http-equiv="Cache-Control" content="no-transform" />
<meta name="applicable-device" content="pc,mobile">
<meta name="MobileOptimized" content="width"/>
<meta name="HandheldFriendly" content="true"/>
<meta name="mobile-agent" content="format=html5; url=http://localhost/"> <! -- End of Baidu Transcode --> <meta name="description" content="Jane book is a quality creative community, where you can create whatever you want, a short essay, a photo, a poem, a painting... We believe that everyone is an artist in life, with unlimited creativity.">
<meta name="keywords" content="Jane book, Jane book official website, graphic editing software, Jane book download, graphic creation, creation software, original community, fiction, prose, writing, reading.">... Leave out a whole bunch of themCopy the code
This is Python3 crawler simple introduction, is not very simple, I recommend you to type a few times
Python3 crawls the images from the web page and saves them to a local folder
The target
- Climb baidu post bar in the picture
- Save the picture to the local, are sister pictures are not much to say, directly on the code, the code is very detailed comments. You can read the notes carefully
import re
import urllib.request
# Crawl the page HTML
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
return html
html = getHtml("http://tieba.baidu.com/p/3205263090")
html = html.decode('UTF-8')
# Method to get image links
def getImg(html):
# Use regular expression to match the address of the picture in the web page
reg = r'src="([.*\S]*\.jpg)" pic_ext="jpeg"'
imgre=re.compile(reg)
imglist=re.findall(imgre,html)
return imglist
imgList=getImg(html)
imgCount=0
#for Download all images to local PIC folder, create a local PIC folder before saving
for imgPath in imgList:
f=open(".. /pic/"+str(imgCount)+".jpg".'wb')
f.write((urllib.request.urlopen(imgPath)).read())
f.close()
imgCount+=1
print("All fetching completed")
Copy the code
I can’t wait to see what I got
It was so easy to get 24 pictures of girls. Isn’t that easy?
Python3 crawls the list of news sites
- Here we only crawl news headlines, news urls, news picture links.
- The data is for display only, and I will save it to the database after I finish learning Python.
I’m going to get a little bit more complicated here, and I’m going to show you the distribution
- 1 here we need to crawl to the HTML page above the first step is how to crawl the page
- Analyze the HTML tag we want to grab
This is the key code that we imported into BeautifulSoup4 library
# use html.parser
soup = BeautifulSoup(html, 'html.parser')
Get every a node from class=hot-article-img
allList = soup.select('.hot-article-img')
Copy the code
The allList obtained by the above code is the news list we want to obtain, captured as follows
[<div class="hot-article-img">
<a href="/article/211390.html" target="_blank">! [](https://img.huxiucdn.com/article/cover/201708/22/173535862821.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214982.html" target="_blank" title="TFBOYS members fly separately, the commercial value of the ceiling?"> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/17/094856378420.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/213703.html" target="_blank" title="Buy a hand shop jianghu"> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/17/122655034450.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214679.html" target="_blank" title="IPhone X officially tells us that phones and cameras are starting to go their separate ways."> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/14/182151300292.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214962.html" target="_blank" title="Credit has been exhausted, LeEco Auto may abandon its son."> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/16/210518696352.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214867.html" target="_blank" title="Don't underestimate the Ig Nobels. Salute curiosity."> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/15/180620783020.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214954.html" target="_blank" title="Ten years ago to change the world, can be more than the iPhone | start"> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/16/162049096015.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214908.html" target="_blank" title="Thank you Twitter for standing up for me."> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/16/010410913192.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/215001.html" target="_blank" title="Apple confirmed the elimination of royalty, but how much more content do you think is worth paying for?"> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/17/154147105217.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214969.html" target="_blank" title="Is the era of 'full pay' for Chinese music coming?"> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/17/101218317953.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>, <div class="hot-article-img">
<a href="/article/214964.html" target="_blank" title="Belle Delisting Revelations: How the" King of Shoes "Is Estranged from the New Generation of Consumers"> <! -- Keep videos and pictures one -->! [](https://img.huxiucdn.com/article/cover/201709/16/213400162818.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlac e/1/quality/85/format/jpg) </a> </div>]Copy the code
Here the data is captured, but it is too messy, and there are many things we do not want, the following is to extract our effective information through traversal
- 3 Extract valid information
# Walk through the list to get valid information
for news in allList:
aaa = news.select('a')
Select only results with length greater than 0
if len(aaa) > 0:
# post link
try:Null if an exception is thrown
href = url + aaa[0]['href']
except Exception:
href=' '
# post image URL
try:
imgUrl = aaa[0].select('img') [0] ['src']
except Exception:
imgUrl=""
# News headlines
try:
title = aaa[0]['title']
except Exception:
title = "Title is empty"
print("Title",title,"\ nurl.",href,"\n Picture address:",imgUrl)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
Copy the code
Add exception handling here, mainly because some news may have no title, no URL or picture, if we do not do exception handling, it may lead to the interruption of our crawl.
Valid information after filtering
The heading is empty url: https://www.huxiu.com/article/211390.html images address: https://img.huxiucdn.com/article/cover/201708/22/173535862821.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== TFBOYS members fly separately, commercial value ceiling is now? Url: https://www.huxiu.com/article/214982.html images address: https://img.huxiucdn.com/article/cover/201709/17/094856378420.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== River's lake url title buyers shop: https://www.huxiu.com/article/213703.html images address: https://img.huxiucdn.com/article/cover/201709/17/122655034450.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== Title iPhone X official told us, mobile phone and the camera began to split the url: https://www.huxiu.com/article/214679.html images address: https://img.huxiucdn.com/article/cover/201709/14/182151300292.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== Title of credit has already been overdrawn, Letv car or into a Gu Yueting sacrifice url: https://www.huxiu.com/article/214962.html images address: https://img.huxiucdn.com/article/cover/201709/16/210518696352.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== Title: don't look down upon the ig Nobel award, to salute to curiosity url: https://www.huxiu.com/article/214867.html images address: https://img.huxiucdn.com/article/cover/201709/15/180620783020.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== Title 10 years ago to change the world, can not have the iPhone | start url: https://www.huxiu.com/article/214954.html images address: https://img.huxiucdn.com/article/cover/201709/16/162049096015.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== Thank weibo title for me family url: https://www.huxiu.com/article/214908.html images address: https://img.huxiucdn.com/article/cover/201709/16/010410913192.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== Apple confirms elimination of royalty, but how much else makes it worth your money? Url: https://www.huxiu.com/article/215001.html images address: https://img.huxiucdn.com/article/cover/201709/17/154147105217.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== Is the era of "full pay" for Chinese music coming? Url: https://www.huxiu.com/article/214969.html images address: https://img.huxiucdn.com/article/cover/201709/17/101218317953.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ============================================================================================== XieWang title belle delisting revelation: "generation" and how new consumers from the url: https://www.huxiu.com/article/214964.html images address: https://img.huxiucdn.com/article/cover/201709/16/213400162818.jpg?imageView2/1/w/280/h/210/|imageMogr2/strip/interlace/ 1/quality/85/format/jpg ==============================================================================================Copy the code
Here we grab the news website news information is done, the following post out the complete code
from bs4 import BeautifulSoup
from urllib import request
import chardet
url = "https://www.huxiu.com"
response = request.urlopen(url)
html = response.read()
charset = chardet.detect(html)
html = html.decode(str(charset["encoding"])) # set the encoding of the captured HTML
# use html.parser
soup = BeautifulSoup(html, 'html.parser')
Get every a node from class=hot-article-img
allList = soup.select('.hot-article-img')
# Walk through the list to get valid information
for news in allList:
aaa = news.select('a')
Select only results with length greater than 0
if len(aaa) > 0:
# post link
try:Null if an exception is thrown
href = url + aaa[0]['href']
except Exception:
href=' '
# post image URL
try:
imgUrl = aaa[0].select('img') [0] ['src']
except Exception:
imgUrl=""
# News headlines
try:
title = aaa[0]['title']
except Exception:
title = "Title is empty"
print("Title",title,"\ nurl.",href,"\n Picture address:",imgUrl)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
Copy the code
Once the data is obtained, we need to save the data in the database. Once the data is stored in the database, we can do the following data analysis and processing. We can also use the crawling articles to provide the APP with the news API interface. – Will write an article titled “Python3 Database 101: Saving data from crawling into a database”