This is the 10th day of my participation in Gwen Challenge
In order to the saints do nothing, do not speak of the church, all things do and fu shi also, for the fu zhi also, success and fu ju also. If a man lives by her, he goes by her.
There is often a need to crawl some data from a certain website, because python has the most packages, first try to use Python to crawl ~ then have this article
With python, you can not only crawl data, but also crawl graphs
I suggest that we do crawlers within the scope of the law, after all, the order is under the leadership, but we have to carry the pot ~
Basic Python Configuration
Install the PIP
PIP makes it easy to install other Python packages by package name. PIP is already built in Python 2 >=2.7.9 or Python 3 >=3.4. You can use the following command to check whether PIP is installed.
python -m pip --version
# output: PIP 18.0 from C:\Users\lenovo\AppData\Local\Programs\Python\Python36\lib\site-packages\ PIP (Python 3.6)
Copy the code
If not, you can install it by downloading get-pip.py and running the following command:
python get-pip.py
Copy the code
We can install other packages using PIP, and BeautifulSoup, which is used below, requires us to install BS4
pip3 install bs4
Copy the code
Crawler common package
requests
Requests is a handy package for handling URL resources.
import requests
r = requests.get('https://juejin.cn')
print(r)
print(r.status_code)
print(r.text)
Copy the code
Output result:
<Response [200] >200<! doctype html> <html data-n-head-ssr lang="zh" data-n-head="%7B%22lang%22:%7B%22ssr%22:%22zh%22%7D%7D"</title><meta data-n-head= </title><meta data-n-head="ssr" charset="utf-8"><meta data-n-head="ssr" name="viewport" content="width=device-width, initial-scale=1, user-scalable=no, viewport-fit=cover"><meta data-n-head="ssr" name="apple-itu......
Copy the code
Requests We can add headers to the request as well as cookie information for the site to validate. For detailed documentation, see: Requests: HTTP for Humans
A few common examples:
r = requests.get('https://xxx', auth=('user'.'pass'))
r = requests.post('https://xxxx', data = {'key':'value'})
payload = {'key1': 'value1'.'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)
print(r.text)
Copy the code
BeautifulSoup
Using Beautiful Soup makes it easy to extract data from HTML.
The official Chinese documents address: beautifulsoup. Readthedocs. IO/zh_CN/v4.4….
Example Of simple usage:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
# The way data is viewed
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# Lacie,
# Tillie]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Copy the code
Open function file download
Open is a built-in python function that opens a file and returns the file object. Common parameters are file and mode. Complete parameters are as follows:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
""" Parameter description: file: required, relative or absolute file path. Mode: optional, file opening mode: buffering: set buffering Encoding: common utf8 Errors: error level newline: distinguish newline closefd: passed file parameter type opener: """
Copy the code
Open the way to climb the map
Link to the map: www.easyapi.com/xxx
This link features:
- Simple, just one picture
- The link remains the same, but the image changes after refreshing
Webpage HTML code and page display are shown as follows:
We take one of the important source to view:
<head>.</head>
<body id="404">
<div class="mheight wp">
<div class="con_nofound">
<div>
<! The key is how to get the img tag and the SRC content -->
<p><img src="https://qiniu.easyapi.com/photo/girl106.jpg" title="Look at the beauty." width="600"></p>
</div>
</div>
</div>.</body>
Copy the code
Refresh the page, you will find that the img SRC path is numbered, but the title remains the same.
So, we can use to get the tag and SRC
Getting HTML content
Get HTML content using Requests:
headers = {'referer': 'https://www.easyapi.com/xxx'.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 47.0) Gecko / 20100101 Firefox 47.0 / '}
htmltxt = requests.get(res_url, headers=headers).text
Copy the code
Find links to images in HTML
html = BeautifulSoup(htmltxt)
for link in html.find_all('img', {'title': 'Admire the beauty'}) :# print(link.get('src'))
srcLink = link.get('src')
Copy the code
Download the pictures
# 'wb' means to open a file in binary format for writing only. If the file already exists, the file is opened and edited from the beginning, that is, the original content is deleted. If the file does not exist, create a new one. Generally used for non-text files such as pictures.
with open('./pic/' + os.path.basename(srcLink), 'wb') as file:
file.write(requests.get(srcLink).content)
Copy the code
The complete code
Web images are random, so we loop 1000 times to get and download images. Complete code:
import requests
from bs4 import BeautifulSoup
import os
index = 0
headers = {'referer': 'https://www.easyapi.com/xxx/service'.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 47.0) Gecko / 20100101 Firefox 47.0 / '}
# Save images
def save_jpg(res_url) :
global index
html = BeautifulSoup(requests.get(res_url, headers=headers).text)
for link in html.find_all('img', {'title': 'Admire the beauty'}) :print('./pic/' + os.path.basename(link.get('src')))
with open('./pic/' + os.path.basename(link.get('src')), 'wb') as jpg:
jpg.write(requests.get(link.get('src')).content)
print("Grabbing the first"+str(index)+"Piece of data")
index += 1
if __name__ == '__main__':
url = 'https://www.easyapi.com/xxx/service'
# There is no need to loop to 1000, you can find by printing the link, the name and address of the image is XXX /girl(number).jpg, optimization direction can be discarded to get HTML and then get the image link
for i in range(0.1000):
save_jpg(url)
Copy the code
Operation effect:
With this skill, you can do more than just climb pictures
If not, please give a thumbs up!