preface

In this paper, a comprehensive analysis of the new housing data of Lianjia, climbed the housing information of all cities, a total of 26,000 data

Tip: The following is the body of this article, the following cases for reference

Analyze urls

The structure of this URL is simple, with https:// followed by the first letter pg for the city indicating the number of pages. So in order for us to get the listing data for all the cities, we need to construct all the urls that are accurate and that we can access using request

Now we have two sections that we have to get the first letter of all the cities, and the page number of the cities

https://bj.fang.lianjia.com/loupan/pg1/
Copy the code

In this case, the question arises, where can we get the first character of the city and the corresponding page number? We open the home page and take a look, click the button in the circle and the selected city will appear, we can get the corresponding initial by these cities

We right check the element, each city corresponds to the url of the housing data of each city, we get their URL and extract the first character



Now that I have the first character, but still lack the page count for the corresponding city, my approach is to use Selenium to access each URL and get their total page count

Each page corresponds to an “A” tag. When there are two or more pages, we use the next-to-last “A” tag (with the next page button).

When there is only one page we just set it equal to 1 (no next page button)



2. Concatenate URL

We have an idea based on the above and now let’s implement the code

1. Instantiate Chrome

The basic configuration

The code is as follows (example) :

chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') Prefs = {'profile.managed_default_content_settings.images': 2} chrome_options.add_experimental_option('prefs', prefs) # Bro = webdriver.Chrome(options=chrome_options)# instantiate object})Copy the code

2. Obtain the first character and page

Let’s figure out what we’re going to do when we get to the page

After we click the city button, the city list appears to get the first character of the city. After we click the third filter, we get the maximum number of pages

So with that in mind now let’s write code

Open the web page to locate the list of all cities

Bro. Get (' https://bj.fang.lianjia.com/ ') # # open web city list li_list = bro. Find_elements_by_xpath ("/HTML/body/div [3] / div [3] / / a ")Copy the code

Position each element loop click, click — click filter — click back — click

Because the main page is refreshed each time, so each loop ends up retrieving the li_list again

The code location to get the maximum number of pages must be at the specified location. The maximum number of pages can only be found if you switch to the housing page

for i in range(len(li_list)): Bro.find_element_by_xpath ("/ HTML /body/div[1]/div/div[1]/a[2]").click() # click on the city button li_list[I].click() # click on the city # Bro.find_element_by_xpath ('//div[@data-search-position="search_result"]').click() # click to filter ws = bro.window_handles # Bro.switch_to.window (ws[1]) # Switch to a new page, Bro.close () # close the house page bro.switch_to.window(ws[0]) # Return to home page li_list=bro.find_elements_by_xpath('/html/body/div[3]/div[3]//a')Copy the code

Page first character code, I also added a city name

Dic [" city "] = li_list [I] get_attribute (" href "). The replace (". Fang.lianjia.com/ ", ""). The replace (" https://", "") # city initials Dic ["city_name"]=li_list[I]. Dic ["city_name"]=li_list[I]Copy the code

Maximum number of pages, save data code

When the data has multiple pages, there is a “Next page” button, when there is only one page, there is no “Next page” button, so as to judge the number of pages, and finally save while crawling

If "next page" in bro.page_source: Dic (" page ") = bro. Find_element_by_xpath (' / / div [@ class = "page - box"] / a [last] () - 1 '). The text # most pages else: dic["page"]="1" with open(".//city_page.csv", "a", encoding="utf-8") as f: writer = csv.DictWriter(f, dic.keys()) writer.writerow(dic)Copy the code

3. The splicing url

We’ve got the maximum page, and the first character and now start concatenating the URL

Names =["city","name","page"] df=pd.read_csv("city_page.csv",names=names) # df.page=df.page.map(lambda X :re.findall("(\d+)", STR (x))[0]) # for a, b in zip(df.city, df.page): for I in range(int(b)): url = "https://%s.fang.lianjia.com/loupan/pg%d/" % (a, TXT with open("urls.txt","a",encoding=" utF-8 ") as f: f.rite (url) f.rite ("\n")Copy the code

At this point, all city urls have been concatenated, a total of 2669 urls, with which we can use requests multithreading to quickly crawl data

Three, access to housing data

Now that we have the URL, let’s parse the page data. We’ll get the name, address, price, type, and so on

Right click to check the element, we know that the housing data is in the UL label li tag, each LI corresponds to a housing

So we can use Xpath to locate and get the data, and we don’t have a try in our code and we should have a try because some of the data doesn’t exist in the list

li_list=html.xpath('//ul[@class="resblock-list-wrapper"]/li')
for li in li_list:
    dic={}
    dic["title"]=li.xpath('./div//div[@class="resblock-name"]/a/text()')[0]
    dic["type"]=li.xpath('./div//span[@class="resblock-type"]/text()')[0]
    dic["status"]=li.xpath('./div//span[@class="sale-status"]/text()')[0]
    dic["location"]=''.join([x for x in li.xpath('./div//div[@class="resblock-location"]//text()') if '\n' not in  x])
    dic["room"]=''.join([x for x in li.xpath('./div//a[@class="resblock-room"]//text()') if '\n' not in  x])
    dic["area"]=li.xpath('.//div[@class="resblock-area"]/span/text()')[0]
    dic["tag"]=''.join([x for x in li.xpath('./div//div[@class="resblock-tag"]//text()') if '\n' not in  x])
    dic["main_price"]=li.xpath('./div//div[@class="main-price"]/span[@class="number"]/text()')[0]
    dic["price"]=li.xpath('./div//div[@class="resblock-price"]/div[@class="second"]/text')[0]
    dic["img_link"]=li.xpath('./a//img[@class="lj-lazy"]/@src')[0]
Copy the code

Location, area, and Tag are all lists, so discard useless characters and concatenate them into strings

''.join([x for x in li.xpath('./div//div[@class="resblock-tag"]//text()') if '\n' not in  x])

list=[]
for x in li.xpath('./div//div[@class="resblock-tag"]//text()'):
	if "\n" not in x:
		list.append(x)
''.join(list)
Copy the code

Read the URL because we saved the URL as TXT last time and now we’re going to read the URL

Urls = open (" urls. TXT ", "r", encoding = "utf-8"), read () urls = urls. The split (" \ n ")] [: - 1 # according to the newline character segmentation is converted into a listCopy the code

Save data, this should be placed in the loop li_list, save as you crawl

with open(".//fangyuan.csv", "a", encoding="utf-8") as f:
    writer = csv.DictWriter(f, dic.keys())
    writer.writerow(dic)
Copy the code

Finally, you can enable the thread pool to crawl data quickly

 pool = Pool(4)
 pool.map(parser_data,urls)
Copy the code

After the collection is complete, let’s look at the effect. It’s more than 5MB

After deduplication, the data has nearly 3W 10 fields



That’s the end of this article, and if you’re interested in reptiles,Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base