The preface

This article has no code, please feel free to read.

Years later, faced with the chaotic code of the AI researcher, I would recall that distant afternoon when I first met Mr. S. At that time, COMPANY B was a small team of six people. Macs and monitors were laid out on the table. People sat together, didn’t call each other by name, turned away, and the other person knew you were talking to them. Everything was looking so good, and we all wanted to grow together with this company.

At that time, Mr. S had just come back from Canada, and the boss introduced him to us, so Mr. S, as the data product manager, had contact with me on the project.

Everyone in a startup needs a lot of skills, so Ms Started teaching herself Python.

One day, Sir S asked me: “Do you play MineCraft? “

“Yes, but I prefer to see other people’s world on B.” I replied.

“I feel like I’m programming now, like I’m playing in my world.” Mr. S said with a smile.

“Ever feel like you’ve mastered the basic syntax of Python, watching someone else run around with Python, and you don’t know what to do with it?”

“Well, you know me.”

“Then learn a miscellaneous science.”

So I lured Mr. S. here to write a reptile with me.

Later, Mr. S left Company B.

Three months later, I left, too.

We never saw each other again.

The most important ability of programming is flexibility

Mr. S is an honest boy.

In the process of developing a crawler, the data returned to him by the website interface appeared to be in JSON format, so he used Python’s own JSON library to parse it. Parsing failed. Because these things that look like JSON don’t have double quotes.

Is it a JSON superset? S did a search and found that it might be possible to parse this data using the YMAL library. So install YMAL library, a parse error again.

Is this data directly a Python dictionary? So Sir S used the evil Eval. Error: null and lowercase true

“Why don’t you try using regular expressions directly?” I said to Mr. S.

“Depend!” Mr. S slaps the table, and the boss next to him is so frightened that he spills the happy water from the enamel cup on his white shirt.

Sjun then used regular expressions to end the fight in 10 seconds.

About reptiles and the Three Gorges Dam

One day, Mr. S came to me excitedly and said, “I have realized the great function of the Three Gorges Dam!”

“Are you a reptile engineer or a hydraulic engineer?”

“You know, no matter how rough the water is upstream, it’s always safe and stable coming out of the dam.” Mr. S did not answer my question, but said to himself.

“So you started using Kafka. Yes, I can teach you.”

S gentleman spit out tongue once: “still master teach square.”

Not long ago, S Jun’s crawler just reached the goal of 10 million data per day. But he was happy for only one day. Because he found that once the data was written to the database, it was cumbersome to read.

S has multiple data analysis systems that need to read crawler data from the database, but it is a slow process to find specific data from tens of millions of data per day. If the program crashes due to an exception, it has to read from the beginning again.

Mr. S asked me, “Now I have to read data from the database for every data analysis script, which has done so much repetitive work that the single-node database can hardly hold up. Am I going to learn how to build a cluster?”

I told Mr. S, “Of course you need to do this later. But for now, you can try Kafka. I’ve built a Kafka cluster, and you can use it like this…” .

Later, He had all the crawlers send the data directly to Kafka and then read it from Kafka. There was a Group for backing up raw data, a Group for generating intermediate tables, a Group for monitoring alarms, and a Group for drawing dashboards. No matter how much and how fast the crawler stuffs Kafka, the place reading the data from Kafka can consume and use it at its own pace.

Now that we’ve collected the data we need to make it glow

Mr. S majored in financial mathematics and statistics when he studied in Canada. So he’s also interested in data analysis. After his crawler had collected enough data, I showed him how to use Pandas to analyze the data.

Mr. S shared the hotel price change data he analyzed with us. I am a young Jupyter student with a background in finance, mathematics, and statistics. The data not only perfectly reproduced price movements over the past year, but also predicted any future changes, with forty-six charts that seemed to run out of combinations.

Grass, wood, bamboo and stone can defeat the enemy

Mr. S. once came across a particularly simple e-commerce site. The page is almost pixel-level copy of Taobao, but there is no anti-crawler mechanism at all. At Jun’s level, it took only half an hour to review the elements and complete the development. The crawler operated safely and smoothly for three weeks.

Then, one morning, the reptile died.

After a lifetime of learning, Mr. S. could no longer crawl anything of value from the site. The site seems to have hired a god of machine-fighting behavior. Humans have no problem with browsers, but any crawler hiding method is easy to spot.

Mr. S came to me: “Master, I can’t handle this website.”

“You can handle it. Use your head.”

“I’m using all my skills, and I don’t see a way to break his anti-crawler mechanism.” Mr. S has lost heart.

“Well, don’t fight it with technology. Use your head.”

Mr. S holds the display and hits it with his head again and again.

I asked Jun S: “Have you ever thought about a problem, this website imitates the skin of Taobao, but there is no anti-crawler mechanism. What kind of person do you think his boss is? Have you heard the joke?”

Mr. S suddenly jumped to his feet: “I’ll give you ten thousand yuan and you can help me build a website. What kind of website do you want? Very simple, taobao that way. You mean this joke?”

“Yes.”

Mr. S suddenly glowed with glory: “I have a solution!”

Mr. S re-opened the website in his browser and found the customer service hotline. As soon as the phone rang, he began a profanity speech: “… What the hell is going on with your website? Why can I log in and can’t log in today? Get your boss! I’ll show him how to make a website! …”

Half an hour later, the site anti – crawler mechanism all removed.

At the moment, S jun facing west hands folded, soliloquize: “brother, I’m sorry, only let you carry the pot.”

Did you pass notes in grade school

“I can now understand the mentality of people who intercept notes.” That’s what Sir S said to me the first time he used Charles.

Since then, I have rarely seen Mr. S analyze web pages. Because he learned to analyze wechat mini programs and mobile apps through man-in-the-middle attack technology in the development of crawlers. This is often the way to get the data directly and store it as soon as you get it, no more annoying xPaths or regular expressions as long as emojis.

The other day, I was playing a web version of the hacker’s game, searching the web for a hidden password somewhere, and then entering the answer box for each level to enter the next level.

The game had 12 levels, and I was stuck on level 6. Mr. S came up to me with his computer and showed me the clearance page of level 12.

“Did you replace the Js file of this website with MITMProxy?”

“Indeed still can’t hide master you.”

“You intercept a note, you edit it, you fold it and you pass it on, do you think about the feelings of the sender and receiver?”

“When I was in primary school, I didn’t pass notes.

Encryption? There is no the

“There are no secrets on the front end.” Mr. S said to me after successfully reversing a website’s Js file.

“That’s because the site’s Js code is naked in front of you. There’s no confusion.” I said to Mr. S.

“No, I can use Node.js to run obfuscated code. I’ve built the Node.js service, just pass in the JS code, and it’ll send me the results back.” Mr. S seems to have taken it personally.

“When did you learn Node.js?”

“This is not the master you said skills do not pressure body? Since crawlers require JavaScript, I learned Node.js.” Jun S’s fearless expression seemed to prove that he had already guessed what I was going to ask.

“What if the target doesn’t have a website, just an App?”

“No, I’ve done a little bit of Android reverse engineering as well. I can read Java.”

“I don’t think I need to teach you any more.”

I could crush you with a finger, but I don’t want to hurt you

Mr. S asked me one day, “Suppose you were in a primary school class and the classmate in front of you asked you to pass the note to the girl behind you. What would you do?”

I said: “View copy/modify delete/intercept drop”.

For example, the three notes were ‘I heard that your grandma is ill, let’s go to see her this weekend’, ‘my parents are not tonight, do you want to visit my house? I just got this month’s lucky money. As soon as the teacher is over, we will go to eat delicious food. ‘”

I said, “If the girl is beautiful, I will change the second piece of paper to ‘My parents are not here tonight, shall we go to Qingnan’s house to play? ‘”.

Mr. S showed a look of disgust: “Master, you said that you hate low-tech things most. If you alter the note, others won’t notice? Your handwriting is different!” .

I ask S jun: “that you have what high opinion?”

Mr. S looked up at the sky outside the window: “If it were me, THEN I would copy the first piece of paper on the sick to visit her ME, the second piece of paper on the parents, the third piece of paper on the teacher to take money these words handwriting. Then, in a different order, it becomes: Mom and Dad, my teacher is sick, and I get paid to visit her. Finally, I took the fake note to the classmate who wrote it and asked his parents for money.”

“My guess is that you want to intercept someone else’s Cookies with a man-in-the-middle attack and then use those Cookies to sneak in for your ulterior motives.”

Mr. S laughed and said, “Ha ha ha, I’m afraid to think about it. But every time I think about it, I have a terrible power, and I can control it. I knew I was different from the average person on the street.”

You must have fleeced the livestream

At the end of last year, the livestream quiz became quite popular. By that time, JUN and I had been separated for some time. I’m sure there will be no fewer than six Android phones attached to Mr. S’s computer on any given night. The phones run different answer platforms that automatically read the questions on the screen and select the answers.

I taught The automated testing technology of Android to Mr. S. Originally, I asked him to combine crawler and realize group control to grasp some difficult data, but I believe he will definitely use it to answer questions.

Flexibility. He’s getting better at it.

Let’s just hope he doesn’t become a fleece.

Afterword.

Since then, I have never seen such an interesting person as Mr. S. So I took what I taught Mr. S. and wrote a book called Python Crawler Development from Scratch, which is now available on JD.com, Dangdang and Amazon. I hope you can become as interesting and powerful as Mr. S.

  • Jingdong: item.jd.com/12436581.ht…
  • Down-down: product.m.dangdang.com/25349717.ht…
  • Amazon: www.amazon.cn/dp/B07HGBRX…

Reptiles are a mixed science. Because in a complete development process, the knowledge that needs to be involved can include but is not limited to: Python, HTML, JavaScript, regular expressions, XPath, databases, Redis, message queues, Docker, ELK, Hadoop, Data analytics, ETL, man-in-the-middle attacks, Automated testing techniques, Visualization…

Any one of these can be done by many people in a large company.

Crawler development, like a word that comes up again and again in this article, is flexible — any technology can be used as long as the data is available. The so-called grass, wood, bamboo and stone can be used as swords. Crawlers should not be a boring, unchanging, stereotyped job. It’s a job filled with ideas and challenges that make onlookers say, “I can do this again.”

Crawler development is definitely more than just the use of Scrapy, PySpider, requests frameworks or libraries. Therefore, IN this book, I also deliberately reduced the part of the instructions for the use of the framework, and focused on various methods and practices to break through the anti-crawler mechanism or use workarounds to bypass the anti-crawler mechanism.

By learning crawlers, you may not eventually choose the job of crawler engineer, but in the process of learning crawlers, you will be exposed to a variety of tools, methods and service components, which will help you in your later life and work, and let you know where the solution is when you meet a problem.

The reader communication group of this book is now open, scan the code to add the public number, reply: the reader communication can get the way to add the group.