preface

Recently in the Python crawler group, there are a lot of people who are very interested in the data on Meituan, and some people give a very handsome price, climb meituan data and bid 5000？？？？ At that time, I was confused, when I climbed all the data found that the original 5000 feeling is less!

The crawler thinking

There are many crawler frameworks out there, and I use the following ideas roughly to implement incremental crawling.

Requests (Selenium) crawls data;
Determine whether the data to be climbed already exists in the database;
Stored in a Dataframe object;
Insert into the database.

Now that we have all the merchant urls, we are on our last step, but note that different types of data pages are different. Such as hotel

So you need to write different parsing functions for different types. Finally, when crawling should not pursue fast, the United States is very strict restrictions, the best multi-threaded request a few seconds. Then slowly let it go

Basic Environment Configuration

Version: Python3.6

System: Windows

Modules: CSV, Time, Requests, JSON

Part of the code

The results can be divided into four categories:

There are 8,195 cinemas

The hotel is 211129

Food category 490928

Life category 432803

There are 1.15 million pieces of data

Seeing so much data, I suddenly feel that 5K is missing!

That’s it, if you need a source buddy, you can see the figure below

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Guy uses Python to crawl all meituan city data and won’t sell it to his friends for 5000

preface

The crawler thinking

Basic Environment Configuration

Part of the code

The results can be divided into four categories:

Guy uses Python to crawl all meituan city data and won’t sell it to his friends for 5000

preface

The crawler thinking

Basic Environment Configuration

Part of the code

The results can be divided into four categories:

Related Posts

Docker container implementation principle and container isolation pit introduction

Simplicity of PHP code – the function part

RabbitMQ