preface
This series of blogs is used to record some crawler small project examples, using the package capture tool request, data parsing tool xpath, because xpath is more common, convenient and efficient compared to BS4 and re. Here is a brief introduction to the principles and basic uses of xpath.
1 the xpath principle
The first step is to instantiate an Etree object and load the source code of the parsed page into the object, either downloaded locally (etree.parse(filePath)) or from the Internet (etree.html (‘page_text’)). Secondly, the method in etREE object is called in combination with xpath expression to achieve label positioning and content capture.
2 xpath Expressions
❑ “/” : The left-most diagonal bar indicates that the traversal starts from the root node. The middle position represents a hierarchy. ❑ “//” : indicates multiple tiers. Equivalent to “>>” in BS4. ❑ tag location: //div[@class=” attribute name “], note that tag index location starts from 1. ❑ Get text: /text() gets the immediate text content of the label, and //text() gets all the text content under the label. ❑ takes the attribute: / @attribute name
3 Beijing 58 city second-hand housing information crawl
The website information isbj.58.com/ershoufang/, the page is as follows:Importing related packages
import requests
import pandas as pd
from lxml import etree
Copy the code
This is necessary because the crawler may be considered illegal by the web, so we need to disguise the application as a browser. So we need to provide system information to the web page. This parameter is available in the web source code.
headers = {
'User-Agent' : "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36"
}
Copy the code
Send a GET request and get the source code for the entire page response data, then perform a local information crawl.
url = "https://bj.58.com/ershoufang/"
page_text = requests.get(url = url, headers = headers).text
Copy the code
Create etree instance object, here I define the page encoding format utF-8, in order to prevent some non-formal UTF-8 encoding source code error.
parse = etree.HTMLParser(encoding='utf-8')
tree = etree.HTML(page_text, parser = parse)
Copy the code
Local tag positioning, this price information includes title, room type, area, orientation, floor, construction time, price per square meter, total price and address.Our step is to locate the tag where the relevant text data is located and then apply it/text()or//text()Function to get text. After reviewing the source code of the web page, all text data is found at the property value of“list” 的 selectThe immediatedivTag, and the class attribute value is“property”, so it can be obtained firstselectOf all thedivThe label.
div_list = tree.xpath('//section[@ class = "list"]/div[@ class = "property"]')
Copy the code
Then check the location of the label for specific information, such as the title indivThe lineal label ofaNext to the second linealdivTag, and finally located inh3Of the labeltitleAttribute.
title.append(div.xpath('./a/div[2]//h3/@title') [0])#./ represents the leftmost div
Copy the code
Other data acquisition is almost the same, but there are some differences in data processing, such as the text data to get the housing price is ‘\n’ and space, to use the string related function to delete. Finally, the persistent storage of data, in this case I saved the housing price information as a CSV file.
df = pd.DataFrame({"Title" : title, "Room" : fangxing, "Area" : area, "Towards" : chaoxiang, "Floor" : louceng
, "Build time" : time, "Per square meter/ten thousand yuan" : one_price, "Total house price" : total_price, "Address" : address})
df.to_csv("58 city rooms.csv", encoding = "utf-8", index = True, index_label = "Serial number")
Copy the code
Results:Complete code:
# Crawl 58 city second-hand housing information
import requests
import pandas as pd
from lxml import etree
if __name__ == "__main__" :
headers = {
'User-Agent' : "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36"
}
url = "https://bj.58.com/ershoufang/"
page_text = requests.get(url = url, headers = headers).text
# data parsing
parse = etree.HTMLParser(encoding='utf-8')
tree = etree.HTML(page_text, parser = parse)
div_list = tree.xpath('//section[@ class = "list"]/div[@ class = "property"]')
# print(div_list)
title, fangxing, area, chaoxiang, louceng, time, total_price, one_price, address = [], [], [], [], [], [], [], [], []
for div in div_list :
# titles
title.append(div.xpath('./a/div[2]//h3/@title') [0])#./ represents the leftmost div
# room
fangxing_temp = div.xpath('./a//div[@ class = "property-content-info"]/p[1]//text()')
s = ""
for str in fangxing_temp :
if str! =' ' :
s += str
fangxing.append(s)
# area
area_temp = div.xpath('./a//div[@ class = "property-content-info"]/p[2]/text()')
s = area_temp[0].strip(' \n')
area.append(s)
# towards
chaoxiang.append(div.xpath('./a//div[@ class = "property-content-info"]/p[3]/text()') [0])
# floor
louceng.append(div.xpath('./a//div[@ class = "property-content-info"]/p[4]/text()') [0].strip(' \n'))
# Build time
time.append(div.xpath('./a//div[@ class = "property-content-info"]/p[5]/text()') [0].strip(' \n'))
# total prices
total_price.append(div.xpath('./a//div[@ class = "property-price"]/p[1]//text()') [0] + '万')
# unit price
one_price.append(div.xpath('./a//div[@ class = "property-price"]/p[2]//text()') [0])
# address
address.append(The '-'.join(div.xpath('./a//section/div[2]/p[2]//text()')))
df = pd.DataFrame({"Title" : title, "Room" : fangxing, "Area" : area, "Towards" : chaoxiang, "Floor" : louceng
, "Build time" : time, "Per square meter/ten thousand yuan" : one_price, "Total house price" : total_price, "Address" : address})
df.to_csv("58 city rooms.csv", encoding = "utf-8", index = True, index_label = "Serial number")
Copy the code