The 20th of 120 Python crawlers, 1637

“This is the 27th day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge in 2021”

Please note that the following cases will focus on the collection of basic data for sales, and the industry will choose the beauty makeup industry.

This case will use the combination of LXML and CSSSELECT to collect, focusing on the CSSSELECT selector.

Target site analysis

The target of this capture is http://www.1637.com/. The website has multiple categories, and the categories are stored in a list in advance for subsequent expansion. Later it was found that the first-level industry can choose no limitation, at this time we can obtain all the classification, based on this, we will first capture all the data to the local, and then select the beauty/beauty industry related franchise data.

The amount of data and pages to be captured are shown in the figure below.

Grab data using the old method, first to save the HTML page to the local, and then in the secondary processing.

The technology points used

Data is extracted using LXML + CSSSelect. Before using CSSSelect, you can install the corresponding library by PIP install CSSSelect.

After installation, there are two ways to use it in the code, one using CSSSelector class, as follows:

from lxml.cssselect import CSSSelector
Similar to how regular expressions are used, construct a CSS selector object
sel = CSSSelector('#div_total>em', translator="html")
# Then pass in the Element object
element = sel(etree.HTML(res.text))
print(element[0].text)
Copy the code

If you do not use this method, you can directly use the cssSelect method, that is, the following code:

# Select em tag from cssSelect
div_total = element.cssselect('#div_total>em')
Copy the code

Either way, #div_total>em in parentheses is the most important thing to learn. This is a CSS selector. If you are familiar with the front end, it is easy to learn.

The CSS selector assumes the following HTML code:

<div class="totalnum" id="div_total">A total of<em>57041</em>A project</div>
Copy the code

Where class and ID are attribute values of HTML tags. Generally, there can be multiple classes and only one ID in a web page.

If you want to get div tags, use CSS selectors, such as #div_total or.totalnum. Note that if you want to get div tags by id, the symbol is #, and if you want to get div tags by class, the symbol is. Sometimes there are other attributes, too. In the CSS selector, you can write this to modify the HTML code as shown below.

<div class=" totalNum "id="div_total" custom=" ABC ">Copy the code

Write the following test code, note the CSS selector writing in the CSSSelector section, div[custom=” ABC “] em.

sel = CSSSelector('div[custom="abc"] em', translator="html")
element = sel(etree.HTML('
      ))
print(element[0].text)
Copy the code

#div_total>em =div_total >em =div_total >em =div_total Change to #div_total>em to select the em element in all descendant elements with id=div_total.

With that in mind, you can easily write your own CSSSelect code.

Encoding time

The grasping method adopted in this case is to grab THE HTML page to the local, and then analyze the local file, so the collection code is relatively simple, and only need to dynamically obtain the total page number. The following code focuses on the internal logic of the get_pagesize function.

import requests
from lxml.html import etree
import random
import time


class SSS:
    def __init__(self) :
        self.start_url = 'http://xiangmu.1637.com/p1.html'
        self.url_format = 'http://xiangmu.1637.com/p{}.html'
        self.session = requests.Session()
        self.headers = self.get_headers()

    def get_headers(self) :
    	# This function is available from the previous blog
        uas = [
            "Mozilla / 5.0 (compatible; Baiduspider / 2.0; +http://www.baidu.com/search/spider.html)"
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        }
        return headers

    def get_pagesize(self) :

        with self.session.get(url=self.start_url, headers=self.headers, timeout=5) as res:
            if res.text:
                element = etree.HTML(res.text)
                # Select em tag from cssSelect
                div_total = element.cssselect('#div_total>em')
                Div_total [0].text inside the em tag and convert it to an integer
                total = int(div_total[0].text)
                Get the page number
                pagesize = int(total / 10) + 1
                # print(pagesize)
                The total number is exactly rounded by 10, without adding an extra page of data
                if total % 10= =0:
                    pagesize = int(total / 10)

                return pagesize
            else:
                return None

    def get_detail(self, page) :
        with self.session.get(url=self.url_format.format(page), headers=self.headers, timeout=5) as res:
            if res.text:
                with open(Joining 1 / f ". /{page}.html"."w+", encoding="utf-8") as f:
                    f.write(res.text)
            else:
                If there is no data, request again
                print(F "page{page}Request exception, request again.")
                self.get_detail(page)

    def run(self) :
        pagesize = self.get_pagesize()
        Pagesize = 20
        for page in range(1, pagesize):
            self.get_detail(page)
            time.sleep(2)
            print(F "page{page}Grab it!")


if __name__ == '__main__':
    s = SSS()
    s.run()
Copy the code

After testing, if the time limit is not increased, it is easy to be limited IP, that is, data cannot be obtained, which can be solved by adding proxy. If you are only interested in data, you can directly download HTML package data at the download address, and decompress the password is cajie.

Secondary extraction data

When the static HTML is all crawled locally, it becomes easy to extract the page data. After all, there is no need to solve the anti-crawl problem.

The core technology point used at this time is to read the file and extract fixed data values through CSSSELECT.

By using the developer tool, the label node where the query data is located is as follows. Extract the content of class=’xminfo’.

The following code core shows how to extract data, in which the format function is the focus of learning content, because the data is stored as CSV files, so need remove_character function to process \n and English, sign.

# Data extraction class
class Analysis:
    def __init__(self) :
        pass

    # Remove special characters
    def remove_character(self, origin_str) :
        if origin_str is None:
            return
        origin_str = origin_str.replace('\n'.' ')
        origin_str = origin_str.replace(', '.', ')
        return origin_str

    def format(self, text) :
        html = etree.HTML(text)
        Get all project area divs
        div_xminfos = html.cssselect('div.xminfo')
        for xm in div_xminfos:
            adtexts = self.remove_character(xm.cssselect('a.adtxt') [0].text)  Get a list of AD words
            url = xm.cssselect('a.adtxt') [0].attrib.get('href')  Get details page address

            brands = xm.cssselect(':nth-child(2)>:nth-child(2)') [1].text  Get the brand list
            categorys = xm.cssselect(':nth-child(2)>:nth-child(3)>a') [0].text  Get categories, e.g. [" dining "," snacks "]
            types = ' '
            try:
                # There may be no secondary classification here
                types = xm.cssselect(':nth-child(2)>:nth-child(3)>a') [1].text  Get categories, e.g. [" dining "," snacks "]
            except Exception as e:
                pass
            creation = xm.cssselect(':nth-child(2)>:nth-child(6)') [0].text  # Brand building time list
            franchise = xm.cssselect(':nth-child(2)>:nth-child(9)') [0].text  # List of franchised stores
            company = xm.cssselect(':nth-child(3)>span>a') [0].text  # List of company names

            introduce = self.remove_character(xm.cssselect(':nth-child(4)>span') [0].text)  # Brand Introduction
            pros = self.remove_character(xm.cssselect(':nth-child(5)>:nth-child(2)') [0].text)  # Business product introduction
            investment = xm.cssselect(':nth-child(5)>:nth-child(4)>em') [0].text  # Investment amount
            # concatenate string
            long_str = f"{adtexts}.{categorys}.{types}.{brands}.{creation}.{franchise}.{company}.{introduce}.{pros}.{investment}.{url}"
            with open("./ franchise data.csv"."a+", encoding="utf-8") as f:
                f.write(long_str + "\n")

    def run(self) :
        for i in range(1.5704) :with open(Joining/f ". /{i}.html"."r", encoding="utf-8") as f:
                text = f.read()
                self.format(text)


if __name__ == '__main__':
    Collect data, run which part, remove the comment can be
    # s = SSS()
    # s.run()
    # Extract data
    a = Analysis()
    a.run()
Copy the code

Nth-child (2) is used repeatedly to extract HTML tags. This selector matches the NTH child of the parent element, regardless of the element type, so you just need to find the exact location of the element.

Collection time

The code download address: codechina.csdn.net/hihell/pyth… Can you give me a Star?

Here we are. No comment, no like, no hide?

Today is the 200/200 day of continuous writing. You can follow me, like me, comment on me, favorites me.

The 20th of 120 Python crawlers, 1637

Target site analysis

The technology points used

Encoding time

Secondary extraction data

Collection time

Related Posts

The SourceTree tutorial is illustrated in detail

Java 7K+

Understand the Java virtual machine memory region model in one diagram