Four methods of analyzing and extracting data by crawler

First, analyze the web page

Take the classic douban movie Top250 information as an example. The information of each movie can be extracted from the LI tag in the OL class as grid_view by obtaining the contents of all li tags and traversing them.

Page to see the rule of URL change:

The first1Page: HTTPS://movie.douban.com/top250?start=0&filter=
第2Page: HTTPS://movie.douban.com/top250?start=25&filter=
第3Page: HTTPS://movie.douban.com/top250?start=50&filter=
第10Page: HTTPS://movie.douban.com/top250?start=225&filter=
Copy the code

Start control page turning, start = 25 * (page – 1)

This paper uses regular expression, BeautifulSoup, PyQuery and Xpath respectively to parse and extract data, and saves Top250 information of Douban movies locally.

Regular expressions

A regular expression is a special sequence of characters that makes it easy to check if a string matches a pattern. It is often used for data cleaning, and can also be used incidentally for crawlers to match the desired data from the source text of a web page.

re.findall

Finds all substrings matched by the regular expression in the string and returns a list, or an empty list if no matches are found.
Note: match and search match once; Findall matches all.
Findall (string[, pos[, endpos]])
String: string to be matched. Pos: Optional argument that specifies the start position of the string. Default is 0. Endpos: Optional argument that specifies the end position of the string. Default is the length of the string.

The following is an example:

import re
text = """
<div class="box picblock col3" style="width:186px; height:264px">
<img src2="http:/ / pic2.sc.chinaz.com/Files/pic/pic9/202007/apic26584_s.jpg nfsjgnalt = "123" shanshui landscape photography picture ">
<a target="_blank" href="http://sc.chinaz.com/tupian/200509002684.htm"
<img src2="http://pic2.sc.chinaz.com/Files/pic/pic9/202007/apic26518_s.jpg" enrberonbialt="Mountain, lake, landscape pictures.">
<a target="_blank" href="http://sc.chinaz.com/tupian/200509002684.htm"
<img src2="http://pic2.sc.chinaz.com/Files/pic/pic9/202006/apic26029_s.jpg" woenigoigniefnirneialt="Landscape pictures of Tourist attractions">
<a target="_blank" href="http://sc.chinaz.com/tupian/200509002684.htm"
"""Result1 = pattern. Findall ('me 123 rich 456 money 1000000000000') print(result1) img_info = re.findall('(. *?)" .*alt="(. *?)"> < span style = "max-width: 100%; clear: both; print(src, alt) ['123', '456', '1000000000000'] http://pic2.sc.chinaz.com/Files/pic/pic9/202007/apic26584_s.jpg scenery scenery pictures http://pic2.sc.chinaz.com/Files/pic/pic9/202007/apic26518_s.jpg mountain lakes scenery scenery pictures http://pic2.sc.chinaz.com/Files/pic/pic9/202006/apic26029_s.jpg tourist attractions landscape scenery picturesCopy the code

The code is as follows:

# -*- coding: UTF- 8 -- * -"""@ Author: Ye Tingyun @ the public: the science of uniting the Python @ CSDN: https://yetingyun.blog.csdn.net/"""
import requests
import re
from pandas import DataFrame
from fake_useragent import UserAgent
importBasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json')


def random_ua():
    headers = {
        "Accept-Encoding": "gzip"."Connection": "keep-alive"."User-Agent": ua.random
    }
    return headers


def scrape_html(url):
    resp = requests.get(url, headers=random_ua())
    # print(resp.status_code, type(resp.status_code))
    if resp.status_code == 200:
        return resp.text
    else:
        logging.info('Web page request failed')


def get_data(page):
    url = f"https://movie.douban.com/top250?start={25 * page}&filter="Html_text = scrape_html(url) #', html_text)
    director_actor = re.findall('(. *?) 

', html_text)
    director_actor = [item.strip() forItem in director_actor] # info = re.findall('(.*)  /  (.*)  /  (. *) ', html_text)
    time_ = [x[0].strip() for x in info]
    area = [x[1].strip() for x in info]
    genres = [x[2].strip() forRating_score = re.findall('<span class="rating_num" property="v:average">(.*)</span>', html_text)
    rating_num = re.findall('(.*?) People evaluate < / span > ', html_text) # quote = re.findall('<span class="inq">(.*)</span>', html_text)
    data = {'Movie name': name, 'Director and Star': director_actor,
            'Show time': time_, 'Region of Release': area, 'Genre': genres,
            'score': rating_score, 'Number of evaluators': rating_num, 'introduction': quote}
    df = DataFrame(data)
    if page == 0:
        df.to_csv('movie_data2.csv', mode='a+', header=True, index=False)

    else:
        df.to_csv('movie_data2.csv', mode='a+', header=False, index=False)
    logging.info(f'page {page + 1} data has been climbed')


if __name__ == '__main__':
    for i in range(10):
        get_data(i)
Copy the code

The results are as follows:

Third, BeautifulSoup

Find () and find_all() are two methods on BeautifulSoup that match HTML tags and attributes and extract all the data that BeautifulSoup has:

Find () extracts only the first data that meets the requirement
Find_all () extracts all data that meet the requirements
Parameters in the find() or find_all() parentheses: Tag and attribute can be used either or both, depending on what we want to extract from the web page. The class_ in parentheses, with an underscore, is used to distinguish it from the class in Python syntax to avoid program conflicts. Of course, in addition to using the class attribute to match, you can also use other attributes, such as the style attribute; If only one of these parameters can be used for accurate positioning, only one parameter is used for retrieval. If you need both tags and attributes to find exactly what you’re looking for, use both parameters together.

The code is as follows:

# -*- coding: UTF- 8 -- * -"""@ Author: Ye Tingyun @ the public: the science of uniting the Python @ CSDN: https://yetingyun.blog.csdn.net/"""
import requests
from bs4 import BeautifulSoup
import openpyxl
from fake_useragent import UserAgent
importBasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json') wb = OpenPyxl.workbook () # create Workbook object sheet = Wb. active # obtain Workbook activity sheet"movie"# workbook rename sheet.append(["排名"."Movie title"."Director and star."."Show time"."Region of Release"."Type of movie"."Grade"."Number of evaluators"."Introduction"])


def random_ua():
    headers = {
        "Accept-Encoding": "gzip"."Connection": "keep-alive"."User-Agent": ua.random
    }
    return headers


def scrape_html(url):
    resp = requests.get(url, headers=random_ua())
    # print(resp.status_code, type(resp.status_code))
    if resp.status_code == 200:
        return resp.text
    else:
        logging.info('Web page request failed')


def get_data(page):
    global rank
    url = f"https://movie.douban.com/top250?start={25 * page}&filter="
    html_text = scrape_html(url)
    soup = BeautifulSoup(html_text, 'html.parser')
    lis = soup.find_all('div', class_='item')
    for li in lis:
        name = li.find('div', class_='hd').a.span.text
        temp = li.find('div', class_='bd').p.text.strip().split('\n')
        director_actor = temp[0]
        temp1 = temp[1].rsplit('/'.2)
        time_, area, genres = [item.strip() for item in temp1]
        quote = li.find('p', class_='quote') # Some movie information doesn't have a single sentenceif quote:
            quote = quote.span.text
        else:
            quote = None
        rating_score = li.find('span', class_='rating_num').text
        rating_num = li.find('div', class_='star').find_all('span') [- 1].text
        sheet.append([rank, name, director_actor, time_, area, genres, rating_score, rating_num, quote])
        logging.info([rank, name, director_actor, time_, area, genres, rating_score, rating_num, quote])
        rank += 1


if __name__ == '__main__':
    rank = 1
    for i in range(10):
        get_data(i)
    wb.save(filename='movie_info4.xlsx')
Copy the code

The results are as follows:

Four, PyQuery

Every web page has a certain special structure and hierarchy, and many nodes have ID or class as a distinction, we can use their structure and attributes to extract information.
Pyquery, a powerful HTML parsing library, allows you to directly parse the structure of DOM nodes and quickly extract content through some attributes of DOM nodes.

Example: When parsing HTML text, you first need to initialize it as a PyQuery object. It can be initialized in a variety of ways, such as passing in a string directly, passing in a URL, passing in a filename, and so on.

from pyquery import PyQuery as pq

html = ' '' 
       
        
        first item 
        second item 
         
         
        
      
 '' '

doc = pq(html)
print(doc('li'))
Copy the code

The results are as follows:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li><img src="http://pic.netbian.com/uploads/allimg/210107/215736-1610027856f6ef.jpg"/></li>
<li><img src="http://pic.netbian.com//uploads/allimg/190902/152344-1567409024af8c.jpg"/></li>  
Copy the code

We first introduce the PyQuery object, alias it PQ, and then define a long HTML string that we pass as an argument to the PyQuery class, which completes the initialization successfully. Next, pass the initialized object into the CSS selector. In this example, we pass in the Li nodes so that we can select all the Li nodes.

The code is as follows:

# -*- coding: UTF- 8 -- * -"""@ Author: Ye Tingyun @ the public: the science of uniting the Python @ CSDN: https://yetingyun.blog.csdn.net/"""
import requests
from pyquery import PyQuery as pq
import openpyxl
from fake_useragent import UserAgent
importBasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json') wb = OpenPyxl.workbook () # create Workbook object sheet = Wb. active # obtain Workbook activity sheet"movie"# workbook rename sheet.append(["排名"."Movie title"."Director and star."."Show time"."Region of Release"."Type of movie"."Grade"."Number of evaluators"."Introduction"])


def random_ua():
    headers = {
        "Accept-Encoding": "gzip"."Connection": "keep-alive"."User-Agent": ua.random
    }
    return headers


def scrape_html(url):
    resp = requests.get(url, headers=random_ua())
    # print(resp.status_code, type(resp.status_code))
    if resp.status_code == 200:
        return resp.text
    else:
        logging.info('Web page request failed')


def get_data(page):
    global rank
    url = f"https://movie.douban.com/top250?start={25 * page}&filter="
    html_text = scrape_html(url)
    doc = pq(html_text)
    lis = doc('.grid_view li')
    for li in lis.items():
        name = li('.hd a span:first-child').text()
        temp = li('.bd p:first-child').text().split('\n')
        director_actor = temp[0]
        temp1 = temp[1].rsplit('/'.2)
        time_, area, genres = [item.strip() for item in temp1]
        quote = li('.quote span').text()
        rating_score = li('.star .rating_num').text()
        rating_num = li('.star span:last-child').text()
        sheet.append([rank, name, director_actor, time_, area, genres, rating_score, rating_num, quote])
        logging.info([rank, name, director_actor, time_, area, genres, rating_score, rating_num, quote])
        rank += 1


if __name__ == '__main__':
    rank = 1
    for i in range(10):
        get_data(i)
    wb.save(filename='movie_info3.xlsx')
Copy the code

The results are as follows:

Five, the Xpath

Xpath is a very useful parsing method and a foundation for crawler learning, which will be covered in Selenium and Scrapy.

First we use LXML’s ETREE library, then we initialize it with etree.html, and then we print it out. And one of the things that’s really useful about LXML is that it automatically corrects your HTML code, and you’ll notice that the last li tag, actually I got rid of the last li tag, is not closed. LXML, however, inherits libxml2 and automatically corrects HTML code, using xpath expressions to extract the contents of tags, as shown below:

from lxml import etree
text = ' '' 
       
        
        first item 
        second item 
        third item 
        fourth item 
        fifth item 
        
      
 '' '
html = etree.HTML(text)
result = etree.tostring(html)
result1 = html.xpath('//li/@class'# xpath expressionprint(result1)
print(result)
Copy the code

['item-0'.'item-1'.'item-inactive'.'item-1'.'item-0']
<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
</body></html>
Copy the code

The code is as follows:

# -*- coding: UTF- 8 -- * -"""@ Author: Ye Tingyun @ the public: the science of uniting the Python @ CSDN: https://yetingyun.blog.csdn.net/"""
import requests
from lxml import etree
import openpyxl
from fake_useragent import UserAgent
importBasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json') wb = OpenPyxl.workbook () # create Workbook object sheet = Wb. active # obtain Workbook activity sheet"movie"# workbook rename sheet.append(["排名"."Movie title"."Director and star."."Show time"."Region of Release"."Type of movie"."Grade"."Number of evaluators"."Introduction"])


def random_ua():
    headers = {
        "Accept-Encoding": "gzip"."Connection": "keep-alive"."User-Agent": ua.random
    }
    return headers


def scrape_html(url):
    resp = requests.get(url, headers=random_ua())
    # print(resp.status_code, type(resp.status_code))
    if resp.status_code == 200:
        return resp.text
    else:
        logging.info('Web page request failed')


def get_data(page):
    global rank
    url = f"https://movie.douban.com/top250?start={25 * page}&filter="
    html = etree.HTML(scrape_html(url))
    lis = html.xpath('//ol[@class="grid_view"]/li') # Each li tag contains basic information about each moviefor li in lis:
        name = li.xpath('.//div[@class="hd"]/a/span[1]/text()') [0]
        director_actor = li.xpath('.//div[@class="bd"]/p/text()') [0].strip()
        info = li.xpath('.//div[@class="bd"]/p/text()') [1#]. Strip ()"/"Split into list _info = info.split("/"Yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, no.0].strip(), _info[1].strip(), _info[2].strip()
        # print(time, area, genres)
        rating_score = li.xpath('.//div[@class="star"]/span[2]/text()') [0]
        rating_num = li.xpath('.//div[@class="star"]/span[4]/text()') [0]
        quote = li.xpath('.//p[@class="quote"]/span/text()') # Some movie information without a sentence quote and conditional judgment to prevent errorif len(quote) == 0:
            quote = None
        else:
            quote = quote[0]
        sheet.append([rank, name, director_actor, time_, area, genres, rating_score, rating_num, quote])
        logging.info([rank, name, director_actor, time_, area, genres, rating_score, rating_num, quote])
        rank += 1


if __name__ == '__main__':
    rank = 1
    for i in range(10):
        get_data(i)
    wb.save(filename='movie_info1.xlsx')
Copy the code

The results are as follows:

Six, summarized

If we use regular expression to crawl web data, we can match it directly from the web source code text, but the error rate is high, and it is difficult to be familiar with the use of regular expression, so we need to read documents frequently.
The actual data is mostly based on THE HTML structure of Web pages, which have many nodes and various hierarchical relationships. Consider using Xpath parsers, BeautifulSoup parsers, and PyQuery CSS parsers to extract structured data and regular expressions to extract unstructured data.
Xpath: Information can be found in XML; Support HTML lookup; Navigation through elements and attributes makes lookups efficient. Selenium and Scrapy are also used in learning about Selenium and Scrapy.
BeautifulSoup: LXML-dependent parsing library that also extracts data from HTML or XML files.
PyQuery: Python is a strict implementation similar to jQuery. It can directly parse the structure of the DOM node and quickly extract content through some DOM node attributes.

Some code can be reused for simple Web pages, like this:

from fake_useragent importUa = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json')

def random_ua():
    headers = {
        "Accept-Encoding": "gzip"."User-Agent": ua.random
    }
    return headers
Copy the code

Masquerade request headers and can be switched randomly, encapsulated as functions for easy reuse.

def scrape_html(url):
    resp = requests.get(url, headers=random_ua())
    # print(resp.status_code, type(resp.status_code))
    # print(resp.text)
    if resp.status_code == 200:
        return resp.text
    else:
        logging.info('Web page request failed')
Copy the code

Request web page, return status code 200 indicates that the request can be normal, and return the page source code text.

References:

Docs.python.org/3/library/r…

pyquery.readthedocs.io/en/latest/

Shimo. Im/docs/Gw3GTP…

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

Click below to read the article and join the community

Four methods of analyzing and extracting data by crawler

First, analyze the web page

Regular expressions

Third, BeautifulSoup

Four, PyQuery

Five, the Xpath

Six, summarized

Related Posts

SRS(Simple Live Video Service) Notes (1) – Experience

Regular expressions in C++, Java, and JavaScript

Stop encapsulating Util utility classes, this divine framework is worth having!