preface
Recently in the Python crawler group, there are a lot of people who are very interested in the data on Meituan, and some people give a very handsome price, climb meituan data and bid 5000???? At that time, I was confused, when I climbed all the data found that the original 5000 feeling is less!
The crawler thinking
There are many crawler frameworks out there, and I use the following ideas roughly to implement incremental crawling.
- Requests (Selenium) crawls data;
- Determine whether the data to be climbed already exists in the database;
- Stored in a Dataframe object;
- Insert into the database.
Now that we have all the merchant urls, we are on our last step, but note that different types of data pages are different. Such as hotel
So you need to write different parsing functions for different types. Finally, when crawling should not pursue fast, the United States is very strict restrictions, the best multi-threaded request a few seconds. Then slowly let it go
Basic Environment Configuration
Version: Python3.6
System: Windows
Modules: CSV, Time, Requests, JSON
Part of the code
The results can be divided into four categories:
There are 8,195 cinemas
The hotel is 211129
Food category 490928
Life category 432803
There are 1.15 million pieces of data
Seeing so much data, I suddenly feel that 5K is missing!
That’s it, if you need a source buddy, you can see the figure below