Introduction to Python Crawlers 64-100 Reverse Crawl textbook level site - Autohouse, font reverse crawl ii

Tell me about the website

The development team of this website must be good at the front end. It started to write this blog on April 19, 2019. There is no guarantee that this code can survive until the end of the month.

There are thousands of articles on CSDN about autohome, but that’s the interesting thing about a crawler, because when it’s finished, you don’t know if it’s going to work the next time, so you can keep getting written. Hopefully today’s blog will help you learn an anti-crawl technique.

The web page to crawl to today

Car.autohome.com.cn/config/seri… All we have to do is crawl the car configuration

The specific data are as follows

Display of anti-crawl measures

Source file data

Brake /<span class='hs_kw86_baikeIl'></span> Safety systemCopy the code

Page display data

Crawl for key information

We want to get the key information in the source code first, even if the data is crawling back. Getting the data is very simple. Use the Request module

def get_html(a):
    url = "https://car.autohome.com.cn/config/series/59.html#pvareaid=3454437"
    headers = {
        "User-agent": "Your browser UA"
    }
    with requests.get(url=url, headers=headers, timeout=3) as res:
        html = res.content.decode("utf-8")
    
    return html
Copy the code

Look for key factors

Find key points in the HTML page:

var config
var levelId
var keyLink
var bag
var color
var innerColor
var option

Once you find these things, you focus on them. What are they? Data can be obtained with simple regular expressions

def get_detail(html):
    
    config = re.search("var config = (.*?) };", html, re.S)  
    option = re.search("var option = (.*?) };", html, re.S)
    print(config,option)
Copy the code

The output


>python e:/python/demo.py
<re.Match object; span=(167291, 233943), match='var config = {"message":"<span class=\'hs_kw50_co>

>python e:/python/demo.py
<re.Match object; span=(167291, 233943), match='var config = {"message":"<span class=\'hs_kw50_co> <re.Match object; span=(233952, 442342), match='var option = {"message":"<span class=\'hs_kw16_op>

Copy the code

Handling car parameters

Match the data using the search method of the regular expression, and then call group(0) to get the relevant data

def get_detail(html):
    
    config = re.search("var config = (.*?) };", html, re.S)  
    option = re.search("var option = (.*?) };", html, re.S)
    
    # Handle car parameters
    car_info = "" 
    if config and option :
        car_info = car_info + config.group(0) + option.group(0)

    print(car_info)
Copy the code

After getting the data, there is no end, this is the data after confusion, need to parse back, continue to pay attention to the source code of the web page, found a strange JS. This section of JS first need not tube, leave some impression can ~

Keyword cracking

<span class="hs_kw28_configfH"></span>
Copy the code

Hs_kw Number _configfH is a span class

I chose ::before after span

The measured

.hs_kw28_configfH::before
Copy the code

Let’s do a global search

Format the HTML source code and search internally for HS_kw to find the key functions

                function $GetClassName$($index$) {
                    return '.hs_kw' + $index$ + '_baikeCt';
                }
Copy the code

The source of this JS is the JS snippet that we just kept, copy all the JS source code, create a new snippet in the source, and run it.

ctrl+enter

:
Find the core substitution method by the parameters

Next, we do the replacement, which needs to be done using Selenium

The core code is as follows, the main notes, I wrote in the code inside, I hope to help you understand

def write_html(js_list,car_info):
    # DOM run JS - this is the most difficult crack, very time consuming ~ reference to the Internet's big god code
    DOM = ("var rules = '2';"
       "var document = {};"
       "function getRules(){return rules}"
       "document.createElement = function() {"
       " return {"
       " sheet: {"
       " insertRule: function(rule, i) {"
       " if (rules.length == 0) {"
       " rules = rule;"
       " } else {"
       " rules = rules + '#' + rule;"
       "}"
       "}"
       "}"
       "}"
       "};
       "document.querySelectorAll = function() {"
       " return {};"
       "};
       "document.head = {};"
       "document.head.appendChild = function() {};"

       "var window = {};"
       "window.decodeURIComponent = decodeURIComponent;")

    Write JS files to the file
    for item in js_list:
        DOM = DOM + item
    html_type = "
       
    # Spliced into a working web page
    js = html_type + DOM + " document.write(rules)</script></body></html>"    
    # delete the file when running again, otherwise you cannot create the file with the same name, or add your own authentication
    with open("./demo.html"."w", encoding="utf-8") as f:
        f.write(js)

    Selenium is used to read the data and replace it
    driver = webdriver.PhantomJS()
    driver.get("./demo.html")
    Read the body section
    text = driver.find_element_by_tag_name('body').text   
    Match all span tags in vehicle parameters
    span_list = re.findall("
      
       "
      (.*?)>, car_info)  # car_info is the string I concatenated above

    # replace span tags with keywords in text
    for span in span_list:
         match hs_kw7_optionZl
        info = re.search("(. *?) '", span)
        if info:
            class_info = str(info.group(1)) + "::before { content:(.*?) }"  # concatenate hs_kw7_optionZl::before {content:(.*?) }
            content = re.search(class_info, text).group(1)   # match text content, return result is "measured "" fuel consumption "" quality guarantee"
                                    
            car_info = car_info.replace(str("<span class='" + info.group(1) + "'></span>"),
                                        re.search("\" (. *?) \ "", content).group(1))
    print(car_info)
Copy the code

The results

Warehouse operation

The remaining step is the data persistence, after the data, the other is relatively simple, I hope you can directly handle.

Small extension: format JS

When you encounter this JS, go directly to the formatting tool and handle it

Tool.oschina.net/codeformat/…

Once the format is complete, the code is readable

Thinking summary

Autohome uses CSS to hide part of the real font. In the process of solving the problem, we need to first look for the class. When we find the JS position, we must deal with its encryption rules.

Follow public account: non-undergraduate programmer

After paying attention, send car to get the source code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Introduction to Python Crawlers 64-100 Reverse Crawl textbook level site – Autohouse, font reverse crawl ii

Tell me about the website

The web page to crawl to today

Display of anti-crawl measures

Crawl for key information

Look for key factors

Handling car parameters

Keyword cracking

Warehouse operation

Small extension: format JS

Thinking summary

Follow public account: non-undergraduate programmer

Introduction to Python Crawlers 64-100 Reverse Crawl textbook level site – Autohouse, font reverse crawl ii

Tell me about the website

The web page to crawl to today

Display of anti-crawl measures

Crawl for key information

Look for key factors

Handling car parameters

Keyword cracking

Warehouse operation

Small extension: format JS

Thinking summary

Follow public account: non-undergraduate programmer

Related Posts

Look back at Vue3 instructions (2) | More challenges in August

Amazing! Promises & Async/Await process!

One-click initialization of new page NPM Script – Summary of practical skills