Preface: this article has no source code, to talk about the youth that reptile engineer will eventually pass away

At the moment I am sitting on the station, looking at the front of a website crawler code, into memories.

origin

“You write the crawler of XX net, basically be to climb personal data, how long do you need probably?” “4 or 5 days”, I used to spend 2 or 3 days to write zhihu reptile left myself plenty of time to touch fish. “Well, I’ll give you a week.” “Good”

This was my first project since I joined the company with zero work experience, writing a crawler from scratch.

It is not adding functionality to the crawler system, nor is it filling holes in the ancestral code.

I gave myself four or five days.

And then went down this road of no return…

The birth of a crawling worm (FEI)

I quickly found a similar crawler code on Github with the requisite search skills for programmers.

It took me only 2 days to write the crawler, post requests, Ajax asynchronous loading, regex matching, JSON, and even optimized the original author’s code.

At the same time, it has been a long time since the author wrote this code, and the rules for obtaining some information on the website have changed. Due to the specific strategies of the site, I learned to identify the source code of the page by passing values twice to find the target information. (First find the passed value and then match the target information with the passed value)

And soon, my reptile was at work with joy.

In addition, Nei, Liu, man and mian spent more than a month repairing the pit for his own reptile (manual goodbye).

The first obstacle

After the crawler runs, a verification code appears due to the excessive number of requests for a single account. It will send an email containing the verification code to the registered mailbox, and then you can log in again.

I never got over that hurdle with requests, and I ended up using Selenium to log in to the site and submit the capTCHA, passing the cookies to the function block that gets the information.

At this time, I learned to use the imaplib module to log in to the mailbox to obtain the verification code.

Deployment server

With the captcha login problem solved, the program is running again and needs to be deployed to the server.

Selenium’s browser, Chrome, is not good for running on servers, so I learned to use Selenium in conjunction with the headless browser PhantomJS.

The server was new and the environment needed to be configured, so I learned the basic operation of configuring the server again.

Functional separation

When the number of requests sent by a single account reached its threshold in a single day, the site mercilessly blocked the account, with no cooldown, permanently blocked.

My colleagues provided me with batch accounts, and I crashed one after another.

So I changed my account strategy again, multiple accounts, take turns to climb, each climb a few times.

At this point, constantly logging in to the Selenium account has become an issue affecting the crawler’s speed, so I started to split the code.

One code is used to update the cookie for the account, one code is used to retrieve the cookie directly, and the available cookie information is stored in mysql.

And this separation of functions, it has a high-level name, the producer-consumer model.

multithreading

At this time, the crawler can run without obstacles, and the problem in front of it is the speed of the crawler. When I start multi-threading, the problem comes again. The repeated extraction of URL is not a problem to be considered when the singleton runs, but when concurrent operations such as multi-threading and multi-process are started, the URL deduplication becomes necessary.

Again, the function is separated, and the URL that needs to be extracted is put into the Redis library. The crawler directly uses the POP function to obtain the URL, and when the URL is extracted, it is deleted, without repeated extraction. Redis is an excellent database in terms of cache.

At this time, in order to cooperate with multiple threads, the code separated all irrelevant to crawlers, such as obtaining cookies and accessing data and information, and all communications only existed between Redis.

The last

One would think, isn’t that just the flow of a project?

Why is the title of this article “Reptilian Engineer Withdrawal essay”?

Because my account is still banned…

So I took over my colleague’s account registration project…

Spent more than N funds to buy IP, buy the mobile phone number of the verification code…

Finally ready to happily deploy the registry code to the server to form a program loop…

.

.

.

.

.

Found the site changed its registration policy…

Summary: You see, after learning so much knowledge, I still can’t live my life as a reptilian engineer. Please don’t ask me what skills I need to be a reptilian engineer by private message! Get out of here! Knock on the blackboard! Highlight!)