Tell me about the website

The development team of this website must be good at the front end. It started to write this blog on April 19, 2019. There is no guarantee that this code can survive until the end of the month.

There are thousands of articles on CSDN about autohome, but that’s the interesting thing about a crawler, because when it’s finished, you don’t know if it’s going to work the next time, so you can keep getting written. Hopefully today’s blog will help you learn an anti-crawl technique.

The web page to crawl to today

Car.autohome.com.cn/config/seri… All we have to do is crawl the car configuration

The specific data are as follows

Display of anti-crawl measures

Source file data

Brake /<span class='hs_kw86_baikeIl'></span> Safety systemCopy the code

Page display data

Crawl for key information

We want to get the key information in the source code first, even if the data is crawling back. Getting the data is very simple. Use the Request module

def get_html(a):
    url = "https://car.autohome.com.cn/config/series/59.html#pvareaid=3454437"
    headers = {
        "User-agent": "Your browser UA"
    }
    with requests.get(url=url, headers=headers, timeout=3) as res:
        html = res.content.decode("utf-8")
    
    return html
Copy the code

Look for key factors

Find key points in the HTML page:

  • var config
  • var levelId
  • var keyLink
  • var bag
  • var color
  • var innerColor
  • var option

Once you find these things, you focus on them. What are they? Data can be obtained with simple regular expressions

def get_detail(html):
    
    config = re.search("var config = (.*?) };", html, re.S)  
    option = re.search("var option = (.*?) };", html, re.S)
    print(config,option)
Copy the code

The output


>python e:/python/demo.py
<re.Match object; span=(167291, 233943), match='var config = {"message":"<span class=\'hs_kw50_co>

>python e:/python/demo.py
<re.Match object; span=(167291, 233943), match='var config = {"message":"<span class=\'hs_kw50_co> <re.Match object; span=(233952, 442342), match='var option = {"message":"<span class=\'hs_kw16_op>

Copy the code

Handling car parameters

Match the data using the search method of the regular expression, and then call group(0) to get the relevant data

def get_detail(html):
    
    config = re.search("var config = (.*?) };", html, re.S)  
    option = re.search("var option = (.*?) };", html, re.S)
    
    # Handle car parameters
    car_info = "" 
    if config and option :
        car_info = car_info + config.group(0) + option.group(0)

    print(car_info)
Copy the code

After getting the data, there is no end, this is the data after confusion, need to parse back, continue to pay attention to the source code of the web page, found a strange JS. This section of JS first need not tube, leave some impression can ~

Keyword cracking

<span class="hs_kw28_configfH"></span>
Copy the code

Hs_kw Number _configfH is a span class

I chose ::before after span

The measured

.hs_kw28_configfH::before
Copy the code

Let’s do a global search

Format the HTML source code and search internally for HS_kw to find the key functions

                function $GetClassName$($index$) {
                    return '.hs_kw' + $index$ + '_baikeCt';
                }
Copy the code

The source of this JS is the JS snippet that we just kept, copy all the JS source code, create a new snippet in the source, and run it.

ctrl+enter

  • :

  • Find the core substitution method by the parameters

Next, we do the replacement, which needs to be done using Selenium

The core code is as follows, the main notes, I wrote in the code inside, I hope to help you understand

def write_html(js_list,car_info):
    # DOM run JS - this is the most difficult crack, very time consuming ~ reference to the Internet's big god code
    DOM = ("var rules = '2';"
       "var document = {};"
       "function getRules(){return rules}"
       "document.createElement = function() {"
       " return {"
       " sheet: {"
       " insertRule: function(rule, i) {"
       " if (rules.length == 0) {"
       " rules = rule;"
       " } else {"
       " rules = rules + '#' + rule;"
       "}"
       "}"
       "}"
       "}"
       "};
       "document.querySelectorAll = function() {"
       " return {};"
       "};
       "document.head = {};"
       "document.head.appendChild = function() {};"

       "var window = {};"
       "window.decodeURIComponent = decodeURIComponent;")

    Write JS files to the file
    for item in js_list:
        DOM = DOM + item
    html_type = "
       
    # Spliced into a working web page
    js = html_type + DOM + " document.write(rules)</script></body></html>"    
    # delete the file when running again, otherwise you cannot create the file with the same name, or add your own authentication
    with open("./demo.html"."w", encoding="utf-8") as f:
        f.write(js)

    Selenium is used to read the data and replace it
    driver = webdriver.PhantomJS()
    driver.get("./demo.html")
    Read the body section
    text = driver.find_element_by_tag_name('body').text   
    Match all span tags in vehicle parameters
    span_list = re.findall("
      
       "
      (.*?)>, car_info)  # car_info is the string I concatenated above

    # replace span tags with keywords in text
    for span in span_list:
         match hs_kw7_optionZl
        info = re.search("(. *?) '", span)
        if info:
            class_info = str(info.group(1)) + "::before { content:(.*?) }"  # concatenate hs_kw7_optionZl::before {content:(.*?) }
            content = re.search(class_info, text).group(1)   # match text content, return result is "measured "" fuel consumption "" quality guarantee"
                                    
            car_info = car_info.replace(str("<span class='" + info.group(1) + "'></span>"),
                                        re.search("\" (. *?) \ "", content).group(1))
    print(car_info)
Copy the code

The results

Warehouse operation

The remaining step is the data persistence, after the data, the other is relatively simple, I hope you can directly handle.

Small extension: format JS

When you encounter this JS, go directly to the formatting tool and handle it

Tool.oschina.net/codeformat/…

Once the format is complete, the code is readable

Thinking summary

Autohome uses CSS to hide part of the real font. In the process of solving the problem, we need to first look for the class. When we find the JS position, we must deal with its encryption rules.

Follow public account: non-undergraduate programmer

After paying attention, send car to get the source code