preface

Recently in the Python crawler group, there are a lot of people who are very interested in the data on Meituan, and some people give a very handsome price, climb meituan data and bid 5000???? At that time, I was confused, when I climbed all the data found that the original 5000 feeling is less!



The crawler thinking

There are many crawler frameworks out there, and I use the following ideas roughly to implement incremental crawling.

  • Requests (Selenium) crawls data;
  • Determine whether the data to be climbed already exists in the database;
  • Stored in a Dataframe object;
  • Insert into the database.

Now that we have all the merchant urls, we are on our last step, but note that different types of data pages are different. Such as hotel



So you need to write different parsing functions for different types. Finally, when crawling should not pursue fast, the United States is very strict restrictions, the best multi-threaded request a few seconds. Then slowly let it go

Basic Environment Configuration

Version: Python3.6

System: Windows

Modules: CSV, Time, Requests, JSON

Part of the code


The results can be divided into four categories:


There are 8,195 cinemas


The hotel is 211129


Food category 490928


Life category 432803


There are 1.15 million pieces of data

Seeing so much data, I suddenly feel that 5K is missing!

That’s it, if you need a source buddy, you can see the figure below