• How to scrape websites with Python and BeautifulSoup
  • By Justin Yek
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: geniusq1981
  • Proofread by: Park-Ma, CoolSeaman

There is more information on the Internet than any one person can handle in a lifetime. So instead of going through information one by one on the Internet, there needs to be a flexible way to collect, collate and analyze that information.

We need to crawl web data.

Web crawlers can automatically extract data and present it in a form that you can easily understand. In this tutorial, we will focus on the use of crawlers in financial markets, but in reality web content crawlers can be used in many areas.

If you’re an avid investor, getting a daily closing price can be a pain, especially if the information you need is spread across multiple web pages. We will build a web crawler to automatically retrieve stock indexes from the web to simplify data crawling.

An introduction to

We will use Python as our crawler language and a simple but powerful library, BeautifulSoup.

  • For Mac users, OS X already has Python preinstalled. Open terminal and enterpython --version. Your Python version should be 2.7.x.
  • For Windows users, install Python on the official website.

Next, we need to install the BeautifulSoup library using Python’s package management tool PIP.

Enter:

easy_install pip  
pip install BeautifulSoup4
Copy the code

Note: If you have an error executing the command above, try adding sudo before each command.

Basic knowledge of

Before we really start coding, let’s take a look at the basics of HTML and some rules for web crawlers.

HTML Tags If you already understand HTML tags, skip this section.


        
<html>  
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p> Hello World </p>
    <body>
</html>
Copy the code

Here is the basic syntax for an HTML web page. Each TAB on a web page defines a block of content:

  1. <! DOCTYPE html>: A mandatory type declaration at the beginning of an HTML document.
  2. HTML documents are contained in tags<html>Inside.
  3. <head>Inside the tag is a script declaration for metadata and HTML documents.
  4. <body>Inside the tag is the visual portion of the HTML document.
  5. The title by<h1><h6>The label definition of.
  6. Paragraph content is defined in<p>In the label.

Other commonly used tags are < A > for hyperlinks,

for displaying tables, and

for displaying table rows, and < TD > for displaying table columns.

In addition, HTML tags often have id or class attributes. The ID attribute defines the unique identity of the tag, and this value must be unique within the current document. The class attribute is used to define the same style for HTML tags that have the same class attribute. We can use these ids and classes to help us locate the data we want to crawl.

For more information on HTML tags, ids, and classes, see the tutorials at W3Schools.

Crawl rules

  1. You should check the terms of use of the site before you crawl. Read carefully the claims about legitimate use of data. Generally speaking, the data you crawl cannot be used for commercial purposes.
  2. Your crawlers should not be too aggressive in asking for data from the site (as is known with spam attacks), which can cause damage to the site. Make sure your crawler works in a reasonable manner (as if a person were working on a website). It’s a good practice to request a web page once per second.
  3. The layout of the site will change from time to time, so be sure to visit the site often and rewrite your code as necessary.

Check the web page

Take a page from Bloomberg Quote as an example.

Since some people follow the stock market, we get the name of the index (S&P 500) and its price from this page. First, click the Inspect option from the right mouse button menu to view the page.

Try hovering your mouse pointer over the price, and you should see a blue square appear that wraps around the price. If you click, on the browser console, this HTML content is selected.

In the result, you can see that the price is wrapped in several layers of HTML tags,

.

Similarly, if you hover and click on “S&P 500”, it is wrapped in

and

.

We now know the exact location of the required data with the help of the class tag.

Write the code

Now that we know where the data is, we can write a web crawler. Now open your text editor.

First, we need to import all the libraries we need to use.

# import libraries
import urllib2
from bs4 import BeautifulSoup
Copy the code

Next, declare a url link variable.

# specify the url
quote_page = ‘http://www.bloomberg.com/quote/SPX:IND'
Copy the code

Python’s urllib2 is then used to request the HTML page to which the declared URL points.

# query the website and return the HTML to the variable 'page'
page = urllib2.urlopen(quote_page)
Copy the code

Finally,, parse the page content into A BeatifulSoup format so that we can process it using BeautifulSoup.

# parse the html using beautiful soup and store in variable `soup`. Soup = BeautifulSoup (page, 'HTML parser')Copy the code

Now we have a variable, soup, that contains the HTML content of the page. Here we are ready to write the code to crawl the data.

Remember the unique hierarchy of data? BeautifulSoup’s find() method helps us find these hierarchies and extract the content. In this case, we can easily find

because the HTML class name is unique.
# Take out the <div> of name and get its valueName_box = soup. Find (h1, attrs = {'class':"Name"})Copy the code

We can get the data by getting the tag’s Text property.

name = name_box.text.strip() # strip() is used to remove starting and trailing
print name
Copy the code

Similarly, we can get the price.

# get the index pricePrice_box = soup. Find (' div ', attrs = {'class':"Price"}) price = price_box. Textprint price
Copy the code

When you run the program, you can see the current price of the S&P 500 printed out.

Output to Excel CSV

Now that we have the data, it’s time to save it. Excel’s CSV format is a good choice. It can be opened in Excel, so you can easily open and process data.

First, however, we must import the Python CSV module and the DateTime module to get the date of the record. In the Import section, add the following lines.

import csv
from datetime import datetime
Copy the code

At the bottom of your code, add code to save data to a CSV file.

# open a csv file with append, so old data will not be erased
withOpen (' index. CSV ', 'a')as csv_file:
 writer = csv.writer(csv_file)
 writer.writerow([name, price, datetime.now()])
Copy the code

If you run your program now, you should be able to export an index. CSV file, which you can then open in Excel and see a row of data inside.

If you run this program every day, you can easily get the S&P 500 index without having to search the Web repeatedly.

Advanced use (advanced application)

Multiple exponentials it’s not enough for you to just get one exponent, is it? We can extract multiple indices at the same time.

First, change the quote_page variable to an array of urls.

Quote_page = [' http://www.bloomberg.com/quote/SPX:IND', ‘http://www.bloomberg.com/quote/CCMP:IND']
Copy the code

We then turn the data extraction code into a for loop, which processes the urls one by one, and stores all the data into tuples of data.

# for loop
data = []
for pg in quote_page:
 # query the website and return the HTML to the variable 'page'
 page = urllib2.urlopen(pg)

# parse the html using beautiful soap and store in variable `soup`. Soup = BeautifulSoup (page, 'HTML parser')# Take out the <div> of name and get its valueName_box = soup. Find (h1, attrs = {'class':Name = "name"}) name_box. Text. The strip ()# strip() is used to remove starting and trailing

# get the index pricePrice_box = soup. Find (' div ', attrs = {'class':"Price"}) price = price_box. Text# save the data in tuple
 data.append((name, price))
Copy the code

Then, modify the save part code to save the data line by line.

# open a csv file with append, so old data will not be erased
withOpen (' index. CSV ', 'a')as csv_file:
 writer = csv.writer(csv_file)
 # The for loop
 for name, price in data:
 writer.writerow([name, price, datetime.now()])
Copy the code

Rerun the code and you should be able to extract both exponents at once.

Advanced crawler technology

BeautifulSoup is a simple and powerful small scale web crawler. But if you’re interested in a larger scale web crawler, you should consider using alternative tools.

  1. Scrapy, a powerful Python crawler framework
  2. Try to integrate your code with some public API. The efficiency of data retrieval is much higher than that of web crawler. For example, take a look at the Facebook Graph API, which helps you retrieve hidden data that isn’t displayed on Facebook pages.
  3. If the data is too large to crawl, consider using a background database to store your data, such as MySQL.

Adopt DRY method

DRY (Don’t Repeat Yourself) stands for “Don’t Repeat Yourself.” Try to automate your daily routine, as this person did. Consider some fun projects, maybe tracking your Facebook friends’ active time (requires their consent), or getting a list of forum speakers and trying out natural language processing (a hot topic in AI right now)!

If you have any questions, feel free to leave a comment below.

Reference:

  • www.gregreda.com/2013/03/03/…
  • www.analyticsvidhya.com/blog/2015/1…

This article was originally published onAltitude Labs 的 blogThe writer is our software engineerLeonard Mok.Altitude LabsIs a company specializing inReactMobile application custom development software agents.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.