Python Data Analysis under Web crawlers: Illustrated cia Global Overview

In this article, I’ll show you how to use Python and HTML parsing to extract valuable information from web pages, and then answer some important data analysis questions.

Data collection and cleansing are almost always the most time-consuming and cumbersome steps in data science projects. Everyone loves to show off their skills by building a cool deep neural network (or XGboost) model or two using 3D interactive diagrams. But these models require raw data, and they are not easy to collect and clean.

After all, life is not a zip format file like Kaggle, waiting for you to unzip and model 🙂

But why do we collect data or build models? The initial motivation was to answer business, scientific or social questions. Is this a trend? The relevance of things? Can physical measurements predict the outcome of this phenomenon? Because answering this question will test your hypothesis as a scientist/practitioner in the field. You’re just using data (not a test tube like a chemist or a magnet like a physicist) to test your hypothesis and scientifically prove/disprove it. This is the “science” part of data science, literally…

Trust me, it’s not hard to come up with a quality problem that requires some application of data science and technology. And each of these questions becomes a small project that you can open source and share with your friends on a platform like Gihub. Even if you’re not a professional data expert, no one can stop you from writing cool code to answer a high-quality data question. It also shows that you are data sensitive and can tell a story with data.

Today let’s tackle the question…

Does a country’s GDP (at purchasing power parity) have anything to do with its percentage of Internet users? Is the trend similar for low/middle/high income countries?

Now you can think of a lot of raw data that could be collected as data to answer this question. I found a CIA (yes, ‘AGENCY’) website that holds basic factual information on every country in the world and is a good place to collect data.

So we’ll use the following Python modules to build our database and visualization,

  • Pandas, Numpy, matplotlib/seaborn
  • Python urllib (sending HTTP requests)
  • BeautifulSoup (for HTML parsing)
  • Regular Expression Module (to find the exact match text to search for)

Let’s discuss the procedural structure for solving this data science problem. The entire project code is available in my Github repository. Fork or give a star if you like.

Read the HTML home page and pass it to BeautifulSoup

Here is the CIA World Overview home page

Photo: CIA World Overview home page

We retrieve this page using a simple URllib request with an SSL error ignore context, and then pass it on to the magic BeautifulSoup, which parses the HTML for us and produces a beautiful text dump. For those unfamiliar with the BeautifulSoup library, watch the video below or read this informative article on Medium.

Here is a snippet of the HTML read from the home page,

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Read the HTML from the URL and pass it to BeautifulSoup
url = 'https://www.cia.gov/library/publications/the-world-factbook/'
print("Opening the file connection...")
uh= urllib.request.urlopen(url, context=ctx)
print("HTTP status",uh.getcode())
html =uh.read().decode()
print(f"Reading done. Total {len(html)} characters read.")
Here’s how we pass it to BeautifulSoup and use the find_all method to find all the country names and codes embedded in the HTML. Basically, the idea is to find an HTML tag named ‘option’. The text in the tag is the country name, and the number 5 and 6 of the tag value represent the two-character country code.

Now you might ask, how do you know that you only need to extract the fifth and sixth characters? The simple answer is that you must personally examine the SOUP text — the parsed HTML text — and determine these indexes. There is no general way to check this, because each HTML page and underlying structure is unique.

soup = BeautifulSoup(html, 'html.parser')

for tag in soup.find_all('option'):

temp=country_codes.pop(0) # To remove the first entry 'World'
temp=country_names.pop(0) # To remove the first entry 'World'
Crawl: Crawl all country text data into the dictionary one by one

This step is what they call a crawl or grab. To do this, the key is to determine how the URL for each country’s information page is constructed. The general situation now is that this will be hard to come by. In particular, a quick check revealed a very simple and regular structure, as in the Australian screenshot.

This means that there is a fixed URL, and you must append two characters to the country code and get the URL of the page for that country. So we simply iterate over the list of country codes, using BeautifulSoup to extract all the text and store it in the local dictionary. This is the code slice,

# base URL
urlbase = 'https://www.cia.gov/library/publications/the-world-factbook/geos/'
# empty data dictionary

# Traverse every country
for i in range(1,len(country_names)-1):
    # Read the HTML from the URL and pass it to BeautifulSoup
    html = urllib.request.urlopen(url_to_get, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    print(f"Finished loading data for {country_names[i]}")
print ("\n**Finished downloading all text data! * *")
You can save it in Pickle dump if you prefer

In addition, I prefer to serialize and store the data in Python pickle objects. So the next time I open the Jupyter laptop, I can read the data directly without having to repeat the crawling steps.

import pickle

# deselect to read data from local storage next time.
text_data = pickle.load(open("text_data_CIA_Factobook.p"."rb"))
Use regular expressions to extract GDP/ per capita data from the text dump

This is the core text analysis part of the program, where we use the regular expression module to find what we are looking for in a large text string and extract the relevant numeric data. Regular expressions are now a rich resource in Python (or almost any high-level programming language). It allows strings to be searched/matched in a specific pattern across a large number of texts. Here we use a very simple regular expression method to match exact words, such as “GDP — per capita (PPP):” and then read a few characters, extract the position of specific symbols such as $and (), and finally extract the GDP/ per capita value. This is an idea illustrated by numbers.

Figure: Text analysis diagram.

There are other common expressions in the notebook, such as the fact that whether the number is calculated in billions or trillions of dollars, the total GDP can be correctly extracted.

# 'b' 去捕捉 'billions', 't' 去捕捉 'trillions'
start = re.search('\ $',string)
end = re.search('[b,t]',string)
if(start! =None and end! =None): start=start.start() end=end.start() a=string[start+1:start+end-1] a = convert_float(a)if (string[end]=='t') :# If GDP is in trillions, multiply by 1000
Copy the code

Here is an example of a code snippet. Note the multiple error handling checks placed in the code. This is necessary because HTML pages are extremely unpredictable. Not all countries have GDP data, not all pages are worded exactly the same, not all numbers look the same, not all strings are placed like $and (). Anything can go wrong.

It’s almost impossible to plan and code for all scenarios, but at least have code to handle possible exceptions so that your program doesn’t stop and can continue gracefully to the next page.

Initialize the dictionary that holds the data
GDP_PPP = {}
# Traverse every country
for i in range(1,len(country_names)-1):
    country= country_names[i]
    pos = txt.find('GDP - per capita (PPP):')
    ifpos! = 1:#If the wording/phrase is not present
        pos= pos+len('GDP - per capita (PPP):')
        string = txt[pos+1:pos+11]
        start = re.search('\ $',string)
        end = re.search('\S',string)
        if(start! =None and end! =None):#If search fails somehow
            a = convert_float(a)
            if(a! = 1.0) :#If the float conversion fails somehow
                print(f"GDP/capita (PPP) of {country}: {a} dollars")
                Insert data into the dictionary
                print("**Could not find GDP/capita data! * *")
            print("**Could not find GDP/capita data! * *")
        print("**Could not find GDP/capita data! * *")
print ("\nFinished finding all GDP/capita data")
Don’t forget to use the pandas inner/left Join method

One thing to keep in mind is that all of these sub-analyses will produce data with slightly different sets of countries. Because different countries may not have access to different types of data. One can use a Pandas Left Join to create a data that intersects all common countries of all available/extractable data segments.

df_combined = df_demo.join(df_GDP, how='left')
Ah, now the cool stuff, modeling… But wait! Filter it first!

Now that you’ve done all the HTML parsing, page crawling, and text mining, you’re ready to enjoy the benefits — eager to run regression algorithms and cool visual scripts! But wait, before you can generate these things, you usually need to clean your data (especially for this kind of socioeconomic problem). Basically, you need to filter out outliers, such as very small countries (such as island countries) that can greatly deviate from the values of the parameters you want to plot and do not follow the main basic dynamics you want to study. A few lines of code are nice for these filters. There may be more Pythonic ways to implement them, but I try to keep it extremely simple and easy to follow. For example, the code below creates a filter to exclude small countries with a GDP of less than $50 billion, with a low and high income threshold of $5,000 and $25,000 (GDP/ PER capita), respectively.

Create filtered data frames, x and Y arrays
filter_gdp = df_combined['Total GDP (PPP)'] > 50
filter_low_income=df_combined['GDP (PPP)']>5000
filter_high_income=df_combined['GDP (PPP)']<25000

df_filtered = df_combined[filter_gdp][filter_low_income][filter_high_income]
Copy the code

And finally visualization

We use the Seaborn regplot function to create a linear regression fitting scatter plot (number of Internet users versus GDP per capita) and show the 95% confidence interval band. They look something like this. The result can be interpreted as

There is a strong positive correlation between the number of Internet users in a country and GDP per capita. In addition, the correlation strength of low income/low GDP countries is significantly higher than that of developed countries with high GDP. This could mean that Internet access helps low-income countries grow faster and improve their average citizens better than in developed countries.


This article uses a Python notebook demonstration to show how to crawl a web page used to download raw information by using BeautifulSoup for HTML parsing. On this basis, how to use regular expression module to search and extract important information needed by users is described.

Most importantly, it demonstrates how or why it is impossible to have simple, common rules or program structures when mining cluttered HTML parsing text. We must examine the text structure and set up the appropriate error-handling checks to properly handle all cases to maintain the program’s flow (and not crash), even if it cannot extract data for all of these scenarios.

I hope readers will benefit from the notebook files provided and build upon them according to their own needs and imagination. See my repository for more notes on Web data analysis

If you have any questions and ideas to share, please contact the author at tirthajyoti@gmail.com. You can also check out the author’s GitHub repository for Python, R, or MATLAB and machine learning resources. If you’re as passionate about machine learning/data science as I am, feel free to add me on LinkedIn or follow me on Twitter.

