This is the 10th day of my participation in Gwen Challenge

In order to the saints do nothing, do not speak of the church, all things do and fu shi also, for the fu zhi also, success and fu ju also. If a man lives by her, he goes by her.

There is often a need to crawl some data from a certain website, because python has the most packages, first try to use Python to crawl ~ then have this article

With python, you can not only crawl data, but also crawl graphs

I suggest that we do crawlers within the scope of the law, after all, the order is under the leadership, but we have to carry the pot ~

Basic Python Configuration

Install the PIP

PIP makes it easy to install other Python packages by package name. PIP is already built in Python 2 >=2.7.9 or Python 3 >=3.4. You can use the following command to check whether PIP is installed.

python -m pip --version
# output: PIP 18.0 from C:\Users\lenovo\AppData\Local\Programs\Python\Python36\lib\site-packages\ PIP (Python 3.6)
Copy the code

If not, you can install it by downloading get-pip.py and running the following command:

python get-pip.py
Copy the code

We can install other packages using PIP, and BeautifulSoup, which is used below, requires us to install BS4

pip3 install bs4
Copy the code

Crawler common package

requests

Requests is a handy package for handling URL resources.

import requests

r = requests.get('https://juejin.cn')
print(r)
print(r.status_code)
print(r.text)
Copy the code

Output result:

<Response [200] >200<! doctype html> <html data-n-head-ssr lang="zh" data-n-head="%7B%22lang%22:%7B%22ssr%22:%22zh%22%7D%7D"</title><meta data-n-head= </title><meta data-n-head="ssr" charset="utf-8"><meta data-n-head="ssr" name="viewport" content="width=device-width, initial-scale=1, user-scalable=no, viewport-fit=cover"><meta data-n-head="ssr" name="apple-itu......
Copy the code

Requests We can add headers to the request as well as cookie information for the site to validate. For detailed documentation, see: Requests: HTTP for Humans

A few common examples:

r = requests.get('https://xxx', auth=('user'.'pass'))
r = requests.post('https://xxxx', data = {'key':'value'})
payload = {'key1': 'value1'.'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)
print(r.text)
Copy the code

BeautifulSoup

Using Beautiful Soup makes it easy to extract data from HTML.

The official Chinese documents address: beautifulsoup. Readthedocs. IO/zh_CN/v4.4….

Example Of simple usage:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

# The way data is viewed
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# Lacie,
# Tillie]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Copy the code

Open function file download

Open is a built-in python function that opens a file and returns the file object. Common parameters are file and mode. Complete parameters are as follows:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
""" Parameter description: file: required, relative or absolute file path. Mode: optional, file opening mode: buffering: set buffering Encoding: common utf8 Errors: error level newline: distinguish newline closefd: passed file parameter type opener: """
Copy the code

Open the way to climb the map

Link to the map: www.easyapi.com/xxx

This link features:

  1. Simple, just one picture
  2. The link remains the same, but the image changes after refreshing

Webpage HTML code and page display are shown as follows:

We take one of the important source to view:

<head>.</head>
<body id="404">
<div class="mheight wp">
    <div class="con_nofound">
        <div>
            <! The key is how to get the img tag and the SRC content -->
            <p><img src="https://qiniu.easyapi.com/photo/girl106.jpg" title="Look at the beauty." width="600"></p>
        </div>
    </div>
</div>.</body>
Copy the code

Refresh the page, you will find that the img SRC path is numbered, but the title remains the same.

So, we can use to get the tag and SRC

Getting HTML content

Get HTML content using Requests:

headers = {'referer': 'https://www.easyapi.com/xxx'.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 47.0) Gecko / 20100101 Firefox 47.0 / '}
htmltxt = requests.get(res_url, headers=headers).text
Copy the code

Find links to images in HTML

html = BeautifulSoup(htmltxt)
    for link in html.find_all('img', {'title': 'Admire the beauty'}) :# print(link.get('src'))
        srcLink = link.get('src')
Copy the code

Download the pictures

# 'wb' means to open a file in binary format for writing only. If the file already exists, the file is opened and edited from the beginning, that is, the original content is deleted. If the file does not exist, create a new one. Generally used for non-text files such as pictures.
with open('./pic/' + os.path.basename(srcLink), 'wb') as file:
     file.write(requests.get(srcLink).content)
Copy the code

The complete code

Web images are random, so we loop 1000 times to get and download images. Complete code:

import requests
from bs4 import BeautifulSoup
import os

index = 0
headers = {'referer': 'https://www.easyapi.com/xxx/service'.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 47.0) Gecko / 20100101 Firefox 47.0 / '}

# Save images
def save_jpg(res_url) :
    global index
    html = BeautifulSoup(requests.get(res_url, headers=headers).text)
    for link in html.find_all('img', {'title': 'Admire the beauty'}) :print('./pic/' + os.path.basename(link.get('src')))
        with open('./pic/' + os.path.basename(link.get('src')), 'wb') as jpg:
            jpg.write(requests.get(link.get('src')).content)
        print("Grabbing the first"+str(index)+"Piece of data")
        index += 1

if __name__ == '__main__':
    url = 'https://www.easyapi.com/xxx/service'
    # There is no need to loop to 1000, you can find by printing the link, the name and address of the image is XXX /girl(number).jpg, optimization direction can be discarded to get HTML and then get the image link
    for i in range(0.1000):
        save_jpg(url)

Copy the code

Operation effect:

With this skill, you can do more than just climb pictures

If not, please give a thumbs up!