Tell me about the website
The development team of this website must be good at the front end. It started to write this blog on April 19, 2019. There is no guarantee that this code can survive until the end of the month.
There are thousands of articles on CSDN about autohome, but that’s the interesting thing about a crawler, because when it’s finished, you don’t know if it’s going to work the next time, so you can keep getting written. Hopefully today’s blog will help you learn an anti-crawl technique.
The web page to crawl to today
Car.autohome.com.cn/config/seri… All we have to do is crawl the car configuration
The specific data are as follows
Display of anti-crawl measures
Source file data
Brake /<span class='hs_kw86_baikeIl'></span> Safety systemCopy the code
Page display data
Crawl for key information
We want to get the key information in the source code first, even if the data is crawling back. Getting the data is very simple. Use the Request module
def get_html(a):
url = "https://car.autohome.com.cn/config/series/59.html#pvareaid=3454437"
headers = {
"User-agent": "Your browser UA"
}
with requests.get(url=url, headers=headers, timeout=3) as res:
html = res.content.decode("utf-8")
return html
Copy the code
Look for key factors
Find key points in the HTML page:
- var config
- var levelId
- var keyLink
- var bag
- var color
- var innerColor
- var option
Once you find these things, you focus on them. What are they? Data can be obtained with simple regular expressions
def get_detail(html):
config = re.search("var config = (.*?) };", html, re.S)
option = re.search("var option = (.*?) };", html, re.S)
print(config,option)
Copy the code
The output
>python e:/python/demo.py
<re.Match object; span=(167291, 233943), match='var config = {"message":"<span class=\'hs_kw50_co>
>python e:/python/demo.py
<re.Match object; span=(167291, 233943), match='var config = {"message":"<span class=\'hs_kw50_co> <re.Match object; span=(233952, 442342), match='var option = {"message":"<span class=\'hs_kw16_op>
Copy the code
Handling car parameters
Match the data using the search method of the regular expression, and then call group(0) to get the relevant data
def get_detail(html):
config = re.search("var config = (.*?) };", html, re.S)
option = re.search("var option = (.*?) };", html, re.S)
# Handle car parameters
car_info = ""
if config and option :
car_info = car_info + config.group(0) + option.group(0)
print(car_info)
Copy the code
After getting the data, there is no end, this is the data after confusion, need to parse back, continue to pay attention to the source code of the web page, found a strange JS. This section of JS first need not tube, leave some impression can ~
Keyword cracking
<span class="hs_kw28_configfH"></span>
Copy the code
Hs_kw Number _configfH is a span class
I chose ::before after span
The measured
.hs_kw28_configfH::before
Copy the code
Let’s do a global search
Format the HTML source code and search internally for HS_kw to find the key functions
function $GetClassName$($index$) {
return '.hs_kw' + $index$ + '_baikeCt';
}
Copy the code
The source of this JS is the JS snippet that we just kept, copy all the JS source code, create a new snippet in the source, and run it.
ctrl+enter
- :
Find the core substitution method by the parameters
Next, we do the replacement, which needs to be done using Selenium
The core code is as follows, the main notes, I wrote in the code inside, I hope to help you understand
def write_html(js_list,car_info):
# DOM run JS - this is the most difficult crack, very time consuming ~ reference to the Internet's big god code
DOM = ("var rules = '2';"
"var document = {};"
"function getRules(){return rules}"
"document.createElement = function() {"
" return {"
" sheet: {"
" insertRule: function(rule, i) {"
" if (rules.length == 0) {"
" rules = rule;"
" } else {"
" rules = rules + '#' + rule;"
"}"
"}"
"}"
"}"
"};
"document.querySelectorAll = function() {"
" return {};"
"};
"document.head = {};"
"document.head.appendChild = function() {};"
"var window = {};"
"window.decodeURIComponent = decodeURIComponent;")
Write JS files to the file
for item in js_list:
DOM = DOM + item
html_type = "
# Spliced into a working web page
js = html_type + DOM + " document.write(rules)</script></body></html>"
# delete the file when running again, otherwise you cannot create the file with the same name, or add your own authentication
with open("./demo.html"."w", encoding="utf-8") as f:
f.write(js)
Selenium is used to read the data and replace it
driver = webdriver.PhantomJS()
driver.get("./demo.html")
Read the body section
text = driver.find_element_by_tag_name('body').text
Match all span tags in vehicle parameters
span_list = re.findall("
"
(.*?)>, car_info) # car_info is the string I concatenated above
# replace span tags with keywords in text
for span in span_list:
match hs_kw7_optionZl
info = re.search("(. *?) '", span)
if info:
class_info = str(info.group(1)) + "::before { content:(.*?) }" # concatenate hs_kw7_optionZl::before {content:(.*?) }
content = re.search(class_info, text).group(1) # match text content, return result is "measured "" fuel consumption "" quality guarantee"
car_info = car_info.replace(str("<span class='" + info.group(1) + "'></span>"),
re.search("\" (. *?) \ "", content).group(1))
print(car_info)
Copy the code
The results
Warehouse operation
The remaining step is the data persistence, after the data, the other is relatively simple, I hope you can directly handle.
Small extension: format JS
When you encounter this JS, go directly to the formatting tool and handle it
Tool.oschina.net/codeformat/…
Once the format is complete, the code is readable
Thinking summary
Autohome uses CSS to hide part of the real font. In the process of solving the problem, we need to first look for the class. When we find the JS position, we must deal with its encryption rules.
Follow public account: non-undergraduate programmer
After paying attention, send car to get the source code