preface
In this paper, a comprehensive analysis of the new housing data of Lianjia, climbed the housing information of all cities, a total of 26,000 data
Tip: The following is the body of this article, the following cases for reference
Analyze urls
The structure of this URL is simple, with https:// followed by the first letter pg for the city indicating the number of pages. So in order for us to get the listing data for all the cities, we need to construct all the urls that are accurate and that we can access using request
Now we have two sections that we have to get the first letter of all the cities, and the page number of the cities
https://bj.fang.lianjia.com/loupan/pg1/
Copy the code
In this case, the question arises, where can we get the first character of the city and the corresponding page number? We open the home page and take a look, click the button in the circle and the selected city will appear, we can get the corresponding initial by these cities
We right check the element, each city corresponds to the url of the housing data of each city, we get their URL and extract the first character
Now that I have the first character, but still lack the page count for the corresponding city, my approach is to use Selenium to access each URL and get their total page count
Each page corresponds to an “A” tag. When there are two or more pages, we use the next-to-last “A” tag (with the next page button).
When there is only one page we just set it equal to 1 (no next page button)
2. Concatenate URL
We have an idea based on the above and now let’s implement the code
1. Instantiate Chrome
The basic configuration
The code is as follows (example) :
chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') Prefs = {'profile.managed_default_content_settings.images': 2} chrome_options.add_experimental_option('prefs', prefs) # Bro = webdriver.Chrome(options=chrome_options)# instantiate object})Copy the code
2. Obtain the first character and page
Let’s figure out what we’re going to do when we get to the page
After we click the city button, the city list appears to get the first character of the city. After we click the third filter, we get the maximum number of pages
So with that in mind now let’s write code
Open the web page to locate the list of all cities
Bro. Get (' https://bj.fang.lianjia.com/ ') # # open web city list li_list = bro. Find_elements_by_xpath ("/HTML/body/div [3] / div [3] / / a ")Copy the code
Position each element loop click, click — click filter — click back — click
Because the main page is refreshed each time, so each loop ends up retrieving the li_list again
The code location to get the maximum number of pages must be at the specified location. The maximum number of pages can only be found if you switch to the housing page
for i in range(len(li_list)): Bro.find_element_by_xpath ("/ HTML /body/div[1]/div/div[1]/a[2]").click() # click on the city button li_list[I].click() # click on the city # Bro.find_element_by_xpath ('//div[@data-search-position="search_result"]').click() # click to filter ws = bro.window_handles # Bro.switch_to.window (ws[1]) # Switch to a new page, Bro.close () # close the house page bro.switch_to.window(ws[0]) # Return to home page li_list=bro.find_elements_by_xpath('/html/body/div[3]/div[3]//a')Copy the code
Page first character code, I also added a city name
Dic [" city "] = li_list [I] get_attribute (" href "). The replace (". Fang.lianjia.com/ ", ""). The replace (" https://", "") # city initials Dic ["city_name"]=li_list[I]. Dic ["city_name"]=li_list[I]Copy the code
Maximum number of pages, save data code
When the data has multiple pages, there is a “Next page” button, when there is only one page, there is no “Next page” button, so as to judge the number of pages, and finally save while crawling
If "next page" in bro.page_source: Dic (" page ") = bro. Find_element_by_xpath (' / / div [@ class = "page - box"] / a [last] () - 1 '). The text # most pages else: dic["page"]="1" with open(".//city_page.csv", "a", encoding="utf-8") as f: writer = csv.DictWriter(f, dic.keys()) writer.writerow(dic)Copy the code
3. The splicing url
We’ve got the maximum page, and the first character and now start concatenating the URL
Names =["city","name","page"] df=pd.read_csv("city_page.csv",names=names) # df.page=df.page.map(lambda X :re.findall("(\d+)", STR (x))[0]) # for a, b in zip(df.city, df.page): for I in range(int(b)): url = "https://%s.fang.lianjia.com/loupan/pg%d/" % (a, TXT with open("urls.txt","a",encoding=" utF-8 ") as f: f.rite (url) f.rite ("\n")Copy the code
At this point, all city urls have been concatenated, a total of 2669 urls, with which we can use requests multithreading to quickly crawl data
Three, access to housing data
Now that we have the URL, let’s parse the page data. We’ll get the name, address, price, type, and so on
Right click to check the element, we know that the housing data is in the UL label li tag, each LI corresponds to a housing
So we can use Xpath to locate and get the data, and we don’t have a try in our code and we should have a try because some of the data doesn’t exist in the list
li_list=html.xpath('//ul[@class="resblock-list-wrapper"]/li')
for li in li_list:
dic={}
dic["title"]=li.xpath('./div//div[@class="resblock-name"]/a/text()')[0]
dic["type"]=li.xpath('./div//span[@class="resblock-type"]/text()')[0]
dic["status"]=li.xpath('./div//span[@class="sale-status"]/text()')[0]
dic["location"]=''.join([x for x in li.xpath('./div//div[@class="resblock-location"]//text()') if '\n' not in x])
dic["room"]=''.join([x for x in li.xpath('./div//a[@class="resblock-room"]//text()') if '\n' not in x])
dic["area"]=li.xpath('.//div[@class="resblock-area"]/span/text()')[0]
dic["tag"]=''.join([x for x in li.xpath('./div//div[@class="resblock-tag"]//text()') if '\n' not in x])
dic["main_price"]=li.xpath('./div//div[@class="main-price"]/span[@class="number"]/text()')[0]
dic["price"]=li.xpath('./div//div[@class="resblock-price"]/div[@class="second"]/text')[0]
dic["img_link"]=li.xpath('./a//img[@class="lj-lazy"]/@src')[0]
Copy the code
Location, area, and Tag are all lists, so discard useless characters and concatenate them into strings
''.join([x for x in li.xpath('./div//div[@class="resblock-tag"]//text()') if '\n' not in x])
list=[]
for x in li.xpath('./div//div[@class="resblock-tag"]//text()'):
if "\n" not in x:
list.append(x)
''.join(list)
Copy the code
Read the URL because we saved the URL as TXT last time and now we’re going to read the URL
Urls = open (" urls. TXT ", "r", encoding = "utf-8"), read () urls = urls. The split (" \ n ")] [: - 1 # according to the newline character segmentation is converted into a listCopy the code
Save data, this should be placed in the loop li_list, save as you crawl
with open(".//fangyuan.csv", "a", encoding="utf-8") as f:
writer = csv.DictWriter(f, dic.keys())
writer.writerow(dic)
Copy the code
Finally, you can enable the thread pool to crawl data quickly
pool = Pool(4)
pool.map(parser_data,urls)
Copy the code
After the collection is complete, let’s look at the effect. It’s more than 5MB
After deduplication, the data has nearly 3W 10 fields
That’s the end of this article, and if you’re interested in reptiles,Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base