\

When I’m working in an e-commerce (foreign trade) company, I think I look great typing commands –Python, Changed me!

Personal Blog:

Blog.csdn.net/weixin_4294…

\

preface

Here I use Python + Selenium to capture linked home data, because many times the analysis of parameters, headache, can analyze the good. Some pages are very abnormal oh, parameters are encrypted. \

\

The advantages of Selenium automation (I went to Baidu… Haha), can simulate the manual operation of the web page, and compared with other crawlers do not need to write request headers (lazy), such as direct request, heard is easier to block (403), I just heard!

\

Finally, I hope we can encourage each other and make progress together. Now I would like to share some knowledge about Python and Selenium with you…

\

The whole process

\

1. Install Selenium

Command line: PIP install Selenium Press Enter

\

\

2, Download chromeDriver plug-in, corresponding to their own version, link here:

chromedriver.storage.googleapis.com/index.html

\

\

You cannot configure variables by placing them directly in Python Scripts, and you do not have to declare a path

(More on that later)

\

3. Install PyQuery using PIP installation and press Enter

\

\

PIP install pyQuery

\

4. Install Pymysql

PIP install pymysql

\

\

5, after the completion of the above installation, start our SAO operation.

\

Web analytics

\

We don’t have to declare the request header and analyze some parameters, as long as we get the URL, we can crawl wherever we want! The captured data is as follows:

\

\

Then the extraction of these data is the subject! Open developer tools, Google Browser directly file F12

\

\

It is found that the housing information of each community is in the LI label under UL. Go straight to code

\

\

The first step is to get the HTML structure and then use PyQuery to parse through the Li tag. I use the items() method

Extract the cell name and find the class with the A tag

\

\

Extract the text of the room class, below the SPAN tag

\

\

All the other TEXT texts do that and one of the things about extracting reminders is that

\

\

Here we determine the ul class, to prevent sometimes network loading problems, resulting in errors.

\

Simulate scrolling and click Next

Let’s scroll the scroll bar without jumping to a page, this is good oh sometimes a lot of asynchronous loading, such as ajax loading is a good example, when viewing comments are a lot of this! Let’s see

\

\

This is another method, depending on the individual, but I’m going to scroll 800. And then there’s the next page, there’s the next page

\

\

Code implementation directly above

\

\

Some other small moves, you can add oh!

\

Data is stored

\

I’m using mysql to store data here, above

\

\

I have built the data table in advance. The table name is LIANjie_data and the database name is Lianjie. It depends on personal situation. You could write a data_save method, but I’m going to be a little bit rude here and just write it from top to bottom. Okay, run the program

\

The data show

\

Is not completely simulated manual operation, in fact, I did not add too much action hee hee source code later put out, will also write some more little knowledge more to share

\

\

Thousands of mountains and rivers are always feeling, point a”good-looking“Ok.

\

§ § \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

\

More recommended

\

What is Paige? Draw it for you in Python! \

\

Hidden Markov model (HMM) and Viterbi algorithm \

\

Understanding the Python “garble” problem \

\

Use Python to crawl financial market data \

\

Build CNN model to crack website captcha \

\

Image recognition with Python (OCR)

\

Email: [email protected]

\

**** Free membership of the Data Science Club ****