The Python Web data crawler tutorial you’ve been waiting for is here. This article shows you how to find links and descriptions of interest from web pages, grab them and store them in Excel.

demand

I often receive messages from readers in the background of the official account.

Many of the comments are readers’ questions. Whenever I have time, I will try to answer them.

But some of the comments, at first glance, are not clear.

For example:

A minute later, perhaps feeling uncomfortable (perhaps remembering that I write in simplified characters), he repeated it in simplified characters.

I had an Epiphany.

This reader thought my public account set up the keyword push corresponding article function. So after watching my other data science tutorials, I wanted to see the “crawler” section.

I’m sorry, I didn’t write crawler articles back then.

Moreover, my public account has not set up this kind of keyword push temporarily.

Mainly because I’m lazy.

As I receive more of these messages, I can feel the needs of my readers. More than one reader has expressed interest in the crawler tutorial.

As mentioned above, the current mainstream and legal network data collection methods are mainly divided into three categories:

  • Open data set download;
  • API to read;
  • The crawler.

I’ve talked a little bit about the first two methods, but let’s talk about crawlers.

concept

Many readers are confused about the definition of a crawler. It is necessary for us to distinguish.

Here’s what Wikipedia says:

Web crawler (English: Web crawler), also known as a spider, is a kind of web robot used to browse the World Wide Web automatically. The purpose is generally to compile web indexes.

The question is, why are you so enthusiastic about web crawlers when you’re not going to be a search engine?

In fact, the term many people use to refer to web crawlers is confused with another function known as “web scraping.”

Wikipedia explains the latter as follows:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.

See? Even if you use your browser to manually copy data down, this is called web scraping. Do you immediately feel stronger?

But that’s not the end of the definition:

While web scraping can be done manually by a software user, the term typically refers to automate processes implemented using a bot or web crawler.

In other words, using a crawler (or robot) to automatically crawl for you is what you really want.

What’s with the data?

It is usually stored in a database or spreadsheet for retrieval or further analysis.

So, here’s what you really want:

Find links, get Web pages, grab specified information, store.

The process can go back and forth, or even snowball.

You want to do it in an automated way.

With that in mind, you should stop fixating on reptiles. Crawler developed, in fact, is for the search engine index database used. You’re already blasting mosquitoes to grab some data to use.

To really master a crawler, you need to have a lot of basic knowledge. HTML, CSS, Javascript, data structures…

That’s why I’ve been hesitant to write a crawler tutorial.

However, these two days, I see a paragraph of wang Shuo’s editor, which is very inspiring:

I like to talk about an alternative 80-20 rule, which is to pay 20% of the effort, understand 80% of a thing.

Now that our goal is clear, it’s to grab data from web pages. The most important skill you need to master is how to quickly and effectively grab the information you want from a web link.

Mastering it, you can’t say you’ve learned a reptile yet.

But with this foundation, you can access data much more easily than before. Especially for “liberal arts students” many application scenarios, very useful. That’s empowerment.

Furthermore, it becomes much easier to further understand how crawlers work.

This is also an application of the “alternative law of 28”.

One of the most important features of the Python language is the availability of powerful software toolkits, many of which are provided by third parties. You just need to write a simple program, can automatically parse web pages, capture data.

This article shows you the process.

The target

To crawl web data, we set a small goal.

Goals should not be too complex. But completing it should help you understand Web Scraping.

Just choose one of my recently published short book articles as the object to grab. It’s called “How to Start Data Science with Yushu Zhilan?” .

In this article, I’ve reorganized and reorganized a series of previously published data science articles.

It contains many titles and links from previous tutorials. For example, the red border circled in the image below.

Suppose you are interested in all the tutorials mentioned in this article and want to get links to these articles and store them in Excel, like this:

You need to extract and store unstructured, scattered information (links in natural language text) specifically.

What to do?

Even if you don’t know how to program, you can read through the entire article, find the links one by one, and manually copy the titles and links into an Excel spreadsheet.

However, this manual collection method is not efficient.

We use Python.

The environment

An easier way to install Python is to install the Anaconda suite.

Please download the latest version of Anaconda at this website.

Select Python 3.6 on the left to download and install.

If you need a step-by-step guide, or want to know how to install and run Anaconda on Windows, check out this video tutorial I’ve prepared for you.

Once You have Anaconda installed, go to this site to download the tutorial package.

After downloading and unpacking, you will see the following three files in the generated directory (hereinafter referred to as the “demo directory”).

Open the terminal and use the CD command to access the demo directory. If you don’t know how to use it, you can also refer to the video tutorial.

We need to install some environment dependency packages.

First execute:

pip install pipenv
Copy the code

Installed here is pipenv, an excellent Python package management tool.

After installation, perform the following steps:

pipenv install
Copy the code

See the files that start with two pipfiles in the demo directory? These are the Settings documents for pipenV.

The Pipenv tool will follow them and automatically install all the dependency packages we need.

Inside the image above is a green progress bar indicating the number of software to be installed and the actual progress.

After installation, we execute as prompted:

pipenv shell
Copy the code

Make sure you have Google Chrome installed on your computer.

We implement:

jupyter notebook
Copy the code

The default browser (Google Chrome) will open and the Jupyter laptop interface will start:

You can click on the ipynb file, the first item in the list of files, to see the entire sample code for this tutorial.

You can execute the code in turn while watching the tutorial.

Instead, I suggest going back to the main screen and creating a new, blank Python 3 notebook.

Please follow the tutorial and enter the corresponding content character by character. This will help you understand the code more deeply and internalize your skills more effectively.

With all the preparation done, let’s start typing the code.

code

The package required to read the web page and parse it is requests_html. We don’t need the full functionality of the package here, just read in HTMLSession.

from requests_html import HTMLSession
Copy the code

Next, we set up a session, where Python acts as a client and talks to the remote server.

session = HTMLSession()
Copy the code

As I said, the web page we’re going to collect information on is “How to Get into Data Science with Yushu Zhiran?” The article.

We find its url and store it in the url variable name.

url = 'https://juejin.cn/post/6844903628847710221'
Copy the code

The following statement, using the session get function, retrieves the entire page corresponding to the link.

r = session.get(url)
Copy the code

What’s on the web?

We tell Python to treat what the server returns as an HTML file type. I don’t want to look at all the messy format descriptors in HTML, just the text.

So we execute:

print(r.html.text)
Copy the code

This is the result obtained:

We have a good idea. The page information is correct and the content is complete.

Okay, let’s see how we can get closer to our goal.

Let’s try to get all the links contained in the web page using a simple and crude method.

Using the returned content as an HTML file type, we look at the links property:

r.html.links
Copy the code

This is the result returned:

So many links!

Excited, right?

But have you noticed? Many of the links here seem incomplete. For example, the first result is only:

'/'
Copy the code

What is this stuff? Is the link crawl error ah?

No, something that doesn’t look like a link is called a relative link. It is the path of a link relative to the domain name (https://www.jianshu.com) of the web page we collected.

It’s just like when we send express packages in China and fill out the form with the words “XX City, XX Province…” , you do not need to add the country name before. Only international express, need to write the country name.

But what if we want to get all the links we can access directly?

Very easy, and only requires a Python statement.

r.html.absolute_links
Copy the code

Here, we want “absolute” links, and we get the following result:

Does it look better this time?

We’ve done our job, haven’t we? Aren’t the links all here?

Yes, the links are all here, but is it different from our goal?

Check that there is.

Not only do we need to find the link, but we also need to find the description of the link. Is that included in the result?

No.

Are the links in the resulting list all that we need?

It isn’t. By looking at the length, we can sense that many of the links are not the same sites that describe other data science articles in the article.

This simple and crude way of listing all the links in an HTML file will not work for this task.

So what do we do?

We have to learn to tell Python exactly what we’re looking for. This is the key to web scraping.

Think about it. What if you wanted an assistant (human) to do this for you?

Would you tell him:

“Look for all clickable blue text links in the text, copy the text to Excel, and then right click to copy the corresponding links to Excel as well. Each link takes up one line in Excel, and text and links take up one cell each.”

Although this operation is difficult to perform, the assistant can understand it and perform it for you.

Try saying the same thing to a computer… Sorry, it doesn’t understand.

Because this is what you and your assistant are looking at.

The computer sees a web page that looks something like this.

To help you see the source code clearly, the browser colors different types of data and numbers the lines.

None of these visual AIDS are available when the data is displayed to the computer. It can only see strings of characters.

What can be done?

If you look closely, you will notice that in the HTML source code, the text and image links are surrounded by Angle brackets before and after the content. These are called “tags”.

HTML is a Markup Language (HyperText Markup Language).

What is the function of the tag? It can break down the entire file into layers.

(Photo credit: https://goo.gl/kWCqS6)

If you want to send a package to a person, you can write the address according to the structure of “province – city – district – street – community – door plate”, the Courier can also find the recipient according to this address.

Similarly, when we are interested in a particular piece of content on a web page, we can follow the structure of the tags to find it.

Does that mean you have to know HTML and CSS before you can crawl?

No, we have tools that can help you dramatically simplify your tasks.

This tool comes with Google Chrome.

On the sample article page, right-click and choose “Check” from the menu that appears.

A column appears at the bottom of the screen.

We click the button in the upper left corner of the column (shown in red). Then hover the mouse over the first text link (” Yushu Zhilan “) above, click.

At this point, you’ll notice that in the lower column, the content has also changed. The source code for this link is highlighted in the center of the column area.

After confirming that the area is the link and text description we are looking for, we right-click the highlighted area and in the menu that pops up, select Copy -> Copy Selector.

Go to a text editor, paste it, and you’ll see exactly what we’ve copied.

body > div.note > div.post > div.article > div.show-content > div > p:nth-child(4) > a
Copy the code

The long list of tags tells the computer to look for the body tag, enter the area it governs, and then look for the div.note tag. And then finally we find the A mark, and that’s what we’re looking for.

Go back to our Jupyter Notebook and define sel with the mark path we just obtained.

sel = 'body > div.note > div.post > div.article > div.show-content > div > p:nth-child(4) > a'
Copy the code

Let’s ask Python to find sel locations in the returned content and store the results in the results variable.

results = r.html.find(sel)
Copy the code

Let’s see what’s in results.

results
Copy the code

Here are the results:

[<Element 'a' href='https://www.jianshu.com/nb/130182' target='_blank'>]
Copy the code

Results is a list of just one item. This item contains a url, which corresponds to the first link we are looking for (” Yushu Zhilan “).

But where is the description of “Yushu Zhilan”?

Don’t worry, let’s have Python display the text that results results data corresponds to.

results[0].text
Copy the code

Here is the output:

'Yushu Zhilan'
Copy the code

Let’s extract the links as well:

results[0].absolute_links
Copy the code

The result is a collection.

{'https://www.jianshu.com/nb/130182'}
Copy the code

We don’t want a collection, just a link string in it. So we first turn it into a list, and then extract the first item, the url.

list(results[0].absolute_links)[0]
Copy the code

This time, we finally got the result we wanted:

'https://www.jianshu.com/nb/130182'
Copy the code

The experience of handling this first link gives you confidence, doesn’t it?

Other links, too, simply find marked paths and follow suit.

However, it would be too much trouble to manually enter these statements for every link you find.

So that’s the programming trick. Repeat the statements one by one, and if they work well, try merging them together to make a simple function.

Given a selection path (SEL), this function returns all description text and link paths it finds.

def get_text_link_from_sel(sel):
    mylist = []
    try:
        results = r.html.find(sel)
        for result in results:
            mytext = result.text
            mylink = list(result.absolute_links)[0]
            mylist.append((mytext, mylink))
        return mylist
    except:
        return None
Copy the code

Let’s test this function.

Let’s do sel the same way we did before. Let’s try it out.

print(get_text_link_from_sel(sel))
Copy the code

The following output is displayed:

[('Yushu Zhilan'.'https://www.jianshu.com/nb/130182')]
Copy the code

No problem, right?

Ok, let’s try the second link.

Again, click the second link using the button in the upper left corner of the column below.

The highlight that appears below has changed:

So again, I’m going to right-click on the highlighted part and copy out the selector.

Then we write the obtained mark path directly to the Jupyter Notebook.

sel = 'body > div.note > div.post > div.article > div.show-content > div > p:nth-child(6) > a'
Copy the code

Using the function we just wrote, what is the output?

print(get_text_link_from_sel(sel))
Copy the code

The output is as follows:

[('How to make word clouds in Python? '.'https://juejin.cn/post/6844903629304889357')]
Copy the code

That’s it. There’s no problem with the function.

What’s the next step?

Are you going to go to the third link and do the same thing?

Then you might as well manually extract the full text of the information, a little less trouble.

We need to find a way to automate the process.

Compare the marked paths we just found twice:

body > div.note > div.post > div.article > div.show-content > div > p:nth-child(4) > a
Copy the code

And:

body > div.note > div.post > div.article > div.show-content > div > p:nth-child(6) > a
Copy the code

Do you see any patterns?

Yes, all the other marks on the path are the same except for the second to last mark (“p”) after the colon.

This is the key to our automation.

In the above two markup paths, only a single result is returned because the nth-child paragraph is specified to look for the “A” tag.

What if we don’t limit the exact location of “p”?

Let’s try, this time, to keep all the other information in the marked path, except for the “p” point.

sel = 'body > div.note > div.post > div.article > div.show-content > div > p > a'
Copy the code

Run our function again:

print(get_text_link_from_sel(sel))
Copy the code

Here is the output:

All right, everything we’re looking for, it’s all here.

But our work is not done.

We also need to export the collected information to Excel for saving.

Remember our popular data box tool Pandas? It’s time for it to work again.

import pandas as pd
Copy the code

With just one line, we can turn the list into a data box:

df = pd.DataFrame(get_text_link_from_sel(sel))
Copy the code

Let’s look at the contents of the data box:

df
Copy the code

The content is fine, but we are not happy with the header and will have to change to a more meaningful column name:

df.columns = ['text'.'link']
Copy the code

Now look at the contents of the data box:

df
Copy the code

Ok, now we can export the captured content to Excel.

The Pandas built-in commands convert the text box to a CSV format that can be opened in Excel.

df.to_csv('output.csv', encoding='gbk', index=False)
Copy the code

Note that the encoding must be GBK, otherwise the default UTF-8 encoding may be garbled when viewed in Excel.

Let’s take a look at the resulting CSV file.

Very fulfilling, isn’t it?

summary

This article shows you the basics of automatic web crawling with Python. Hopefully, after reading and practicing, you can grasp the following points:

  • The relation and difference between web crawling and web crawler;
  • How to use PipenV to quickly build a specific Python development environment and automatically install dependent packages;
  • How to use Google Chrome’s built-in check function to quickly locate the marked path of interesting content;
  • How to parse a web page with the requisition-html package to query for desired content elements;
  • How to use Pandas to organize data and output it to Excel

Perhaps you feel that this article is too simplistic for your needs.

This article only shows how to grab information from one web page, but you have thousands of web pages to deal with.

Don’t worry.

Essentially, crawling a web page is the same process as crawling 10,000 web pages.

And, from our example, have you already tried grabbing links?

With links as a base, you can snowball and let Python crawlers “crawl” onto parsed links for further processing.

In the future, you may have to deal with some tricky practical scenarios:

  • How to extend the crawl function to all pages in a fan?
  • How to crawl Javascript dynamic Web pages?
  • What if the site you’re climbing has a limit on how often each IP can be accessed?

I hope to share with you the solutions to these problems in future tutorials.

It should be noted that although the web crawler is powerful in capturing data, it has a certain threshold to learn and practice.

When faced with data retrieval tasks, you should check this checklist first:

  • Is there a data set that someone else has already compiled that you can download directly?
  • Does the site provide API access to the data you need?
  • Has anyone compiled a custom crawler for your needs that you can call directly?

If the answer is none, write your own script and deploy a crawler to grab it.

In order to consolidate the knowledge of learning, please change another web page, with our code as the basis for modification, crawl which you are interested in the content.

Even better, record your capture and share the link in the comments section.

Because deliberate practice is the best way to master practical skills, and teaching is the best way to learn.

Zhu shunli!

thinking

The main content of this paper is explained.

Here’s a question for you to ponder:

The links we parse and store are actually duplicated:

It’s not that our code is wrong, it’s that how to Start Data Science with Yushu Zhilan? In one article, it has repeatedly referenced some articles, so repeated links are crawling out.

But when you store, you probably don’t want to keep duplicate links.

In this case, how can you change the code so that the links you grab and save are not duplicated?

discuss

Are you interested in Python crawlers? On what data acquisition tasks has it been used? Are there more efficient ways to collect data? Welcome to leave a message, share your experience and thinking to everyone, we exchange and discuss together.

If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.

If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.