After Posting an article about where to crawl (which can collect at least 100,000 pieces of data), a reader asked me backstage how to crawl where to go. A brief look at that time, feel difficult. I just gave him a rundown. Because at that time to crawl the freelancer website is selected on the mobile end. In order to enable readers to learn more knowledge, today we select where to the computer terminal to carry out the crawl. In fact, the idea of crawler is the same, is to get the information of the web page first, and then parse. After parsing, the desired data is extracted. If you want to further analyze the data. You also need to clean the data, model it, and so on. Today, what the author brings to you is the acquisition of hotel information, data cleaning and analysis.


1. Preparation

The Python libraries covered in this case are Selenium, the PyMongo library, and the parsing library PyQuery. Clean the library pandas and matplotlib, install the Chrome browser and configure the ChromeDriver.


2. Page analysis

First go to the “https://www.qunar.com/” website and select hotels. We can see that the current link is :”http://hotel.qunar.com/”. This is the domain we’re going to visit. As shown in the figure below.

We click on the destination page to enter the city, and then click on search to bring up the hotel list. That is, we can use Selenium to control the browser to enter the city name, and then click. We go to the hotel page. As shown in the figure below.

If there are more business needs, we can choose the type of hotel on the menu to achieve the same idea. Here we default to the “hotel search” column. And choose to filter hotels by rating. We randomly select a hotel name and right click to open the developer tool. As shown in the figure below.

After analysis, it can be seen that the content of each hotel is in the ID of “jxContentPanel”, and under the class of “b_result_box js_list_block”, the detailed information of each hotel can be seen. As shown in the figure below.

We can now use the PyQuery parsing library to parse the page and get the information we need. Below we use the code to achieve the whole process of scraping.


3. Practice

3.1 Obtaining the list of destination cities

Here, we can use the list of destination cities captured when we captured the free row data before, as follows:

3.2 Get where to page details

We already have a list of destination cities, so when accessing the Where to Go url, we can search by simply typing in the cities we get. Then you can get the hotel page. And to get multiple pages of the hotel. So we should also implement the method of turning pages. The implementation is as follows:

3.3 Parsing the Hotel List

Now that we have the hotel list page information, we can use the pyQuery parsing library to get the data we want. The implementation is as follows:


3.4 Save the file to the database and CSV file


3.5 Running Code



3.6 Viewing Results


MongoDB database results:


The CSV file:


3.7 Data Cleaning

The pandas library is used for data cleaning as follows:

The results are as follows:

3.8 Project Code

https://github.com/NGUWQ/Python3Spider/tree/master/dataanalysis



4. Conclusion

The main function of this project is to crawl Qunar Hotel. If you want to crawl other businesses of Qunar, you can expand this basis to crawl the whole site of Qunar. The idea is the same.


For those who are interested in crawlers, data analysis and algorithms, please add the wechat official account TWcoding and let’s play Python together.


If it works for you.Please,star.


God helps those who help themselves