Global population data for Python data analysis

This article uses pandas to make a simple analysis of the world’s population data. I collected population data of countries around the world from 1960 to 2019, including men and women and different age groups, in 6 files.

Pop_total. CSV: specifies the annual population of each countryPop_female.csv: Female population by country per yearPop_male. CSV: Male population in each country per yearPop_0_14.csv: Population 0-14 years of age in each countryPop_15_64.csv: Population aged 15-64, country per yearPop_65up.csv: Population over 65 years of age in each countryCopy the code

First read the file data using pandas

import pandas as pd

pop_total = pd.read_csv('./data/pop_total.csv', skiprows=4)
pop_total.info()
Copy the code

The pop_total. CSV file stores the annual population of each country in the following format

pop_total.head(2)
Copy the code

In the same way, we read the remaining five files with dataframes pop_female, pop_male, pop_0_14, pop_15_64, and pop_65UP.

To visualize the global population distribution, we used Pyecharts to map the global population distribution in 2019

from pyecharts import options as opts
from pyecharts.charts import Timeline, Map

pop_total_2019 = pop_total[['Country Name'.'2019']]
# Change the Russian English name so pyecharts can recognize it
pop_total_2019.loc[200.'Country Name'] = 'Russia'  pop_world_map = (  Map()  .add("2019", pop_total_2019.values, "world", is_map_symbol_show=False)  .set_series_opts(label_opts=opts.LabelOpts(is_show=False))  .set_global_opts(  title_opts=opts.TitleOpts(title="Global population"),  visualmap_opts=opts.VisualMapOpts(max_=100000000), # Darkest color (red) over 100 million people  ) )  pop_world_map.render_notebook() Copy the code

Because we have 50 years of data, we can also create a GIF of the global population distribution, similar to the global epidemic trend map we wrote earlier. Because the code is similar to the above, here will not paste, source package can be found.

In the graph above, we can only see the population distribution qualitatively. Here, we take a quantitative look at the world’s 10 most populous countries in 2019.

# 10 Most populous countries in 2019
pop_total_2019_ordered = pop_total_2019.sort_values(by="2019" , ascending=False)
pop_total_2019_ordered.head()
Copy the code

After sorting, we find that the Country Name column not only contains a single Country, but also includes the concept of region, which is not what we want. I remember that when I made the epidemic map, I had a list of the corresponding relations between Chinese and English countries, so I can use it here.

from countries_ch_to_en import countries_dict

pop_top10 = pop_total_2019_ordered[pop_total_2019_ordered['Country Name'] \                                   .isin(countries_dict.keys())][:10]
pop_top10
Copy the code

So it looks normal. I’ll draw it with Seaborn

import seaborn as sns
sns.barplot(y=pop_top10['Country Name'], x=pop_top10['2019'])
Copy the code

As you can see, China still has the largest population in the world, followed by India. I am amazed that Pakistan is such a small country with a population of over 200 million and the 5th largest in the world.

After looking at the absolute population rankings, let’s take a look at the population growth rates of countries over the last 20 years, from 2000 to 2019

pop_tmp = pop_total[pop_total['Country Name'] \                    .isin(pop_top10['Country Name'])][['Country Name'.'2000'.'2019']]
pop_tmp['growth(%)'] = (pop_tmp['2019'] / pop_tmp['2000'] - 1) * 100
pop_tmp.sort_values(by="growth(%)" , ascending=False)
Copy the code

It can be seen that although China has a large population base, its population growth rate in the past 20 years is relatively low. The top3 countries with the fastest growth are Nigeria, Pakistan and India respectively.

After looking at the total population, let’s look at the gender distribution, again in 2019

columns = ['Country Name'.'2019']
# Extract data, correlate
pop_sex_2019 = pop_total[columns].merge(pop_male[columns], on = 'Country Name')

Rename the column name
pop_sex_2019.rename(columns={'2019_x': 'total'.'2019_y': 'male'}, inplace=True)  # Screen out countries pop_sex_2019 = pop_sex_2019[pop_sex_2019['Country Name'].isin(countries_dict.keys())]  # Count the female population pop_sex_2019['female'] = pop_sex_2019['total'] - pop_sex_2019['male'] # The difference between the female share and the male share pop_sex_2019['diff'] = (pop_sex_2019['female'] - pop_sex_2019['male']) / pop_sex_2019['total'] * 100  # top15 with more men than women in the population sex_diff_top15 = pop_sex_2019.sort_values(by='diff') [0:15]  sns.barplot(y=sex_diff_top15['Country Name'], x=sex_diff_top15['diff']) Copy the code

Qatar tops the list, with 50% more men than women. China’s gender ratio is also imbalanced, with 2% more men than women.

Then look at countries where women have a higher proportion than men

sex_diff_top15 = pop_sex_2019.sort_values(by='diff', ascending=False) [0:15]
sns.barplot(y=sex_diff_top15['Country Name'], x=sex_diff_top15['diff'])
Copy the code

The difference was significantly smaller, with the top1 only missing by 8%, and none of these countries had large populations. Let’s look at countries with populations of more than 100 million where women outnumber men

pop_sex_2019[pop_sex_2019['total'] > 100000000].sort_values(by='diff', ascending=False) [0:5]
Copy the code

You can see that in Japan, Mexico, Brazil and the United States, four of the most populous countries, there are more women than men.

Now that we know the gender, let’s look at the age distribution. Since I focus on the proportion of young people in each country, let’s first rank the proportion of 0-14 year olds in each country.

pop_0_14_2019 = pop_total[columns].merge(pop_0_14[columns], on = 'Country Name')
pop_0_14_2019.rename(columns={'2019_x': 'total'.'2019_y': '0 _14'}, inplace=True)
pop_0_14_2019['0_14_r(%)'] = pop_0_14_2019['0 _14'] / pop_0_14_2019['total'] * 100

# We still only look at countries with more than 100 million people
pop_0_14_top = pop_0_14_2019[pop_0_14_2019['Country Name'].isin(countries_dict.keys())][pop_0_14_2019['total'] > 100000000] \ .sort_values(by='0_14_r(%)', ascending=False)[:15]  sns.barplot(y=pop_0_14_top['Country Name'], x=pop_0_14_top['0_14_r(%)']) Copy the code

You can see that southeast Asian countries like Philippines, Bangladesh, Indonesia and India have a much higher percentage of 0-14 year olds than China, and even the United States has a much higher percentage than us. We only have 17%, which is why the world’s factories are moving to Southeast Asia in recent years.

Finally, we take a look at the trend of China’s population share in different age groups from 1960 to 2019

Filter the columns we need
pop_0_14_ch = pop_0_14[pop_0_14['Country Name'] = ='China'].drop(['Country Name'.'Country Code'.'Indicator Name'.'Indicator Code', \                                                                  'Unnamed: 64'], axis=1)

# column (year) change rows
pop_0_14_ch_unstack = pop_0_14_ch.unstack()  Rebuild the DateFrame pop_0_14_ch = pd.DataFrame(pop_0_14_ch_unstack.values, \  index=[x[0] for x in pop_0_14_ch_unstack.index.values], columns=['0 _14'])  pop_0_14_ch.head() Copy the code

Do the same thing for the other two ages

# 15-64 years old
pop_15_64_ch = pop_15_64[pop_15_64['Country Name'] = ='China'].drop(['Country Name'.'Country Code'.'Indicator Name'.'Indicator Code', \                                                                  'Unnamed: 64'], axis=1)
pop_15_64_ch_unstack = pop_15_64_ch.unstack()

pop_15_64_ch = pd.DataFrame(pop_15_64_ch_unstack.values,\  index=[x[0] for x in pop_15_64_ch_unstack.index.values], columns=['15 _64'])   Over 65 pop_65up_ch = pop_65up[pop_65up['Country Name'] = ='China'].drop(['Country Name'.'Country Code'.'Indicator Name'.'Indicator Code', \ 'Unnamed: 64'], axis=1) pop_65up_ch_unstack = pop_65up_ch.unstack()  pop_65up_ch = pd.DataFrame(pop_65up_ch_unstack.values, \  index=[x[0] for x in pop_65up_ch_unstack.index.values], columns=['65up']) Copy the code

Age groups are correlated by year, and then the total population and the proportion of the population in each age group are calculated

pop_age_level =  pop_0_14_ch.merge(pop_15_64_ch.merge(pop_65up_ch, left_index=True, right_index=True), left_index=True, right_index=True)
pop_age_level['total'] = pop_age_level['0 _14'] + pop_age_level['15 _64'] + pop_age_level['65up']
pop_age_level['0 _14 (%)'] = pop_age_level['0 _14'] / pop_age_level['total'] * 100
pop_age_level['15 _64 (%)'] = pop_age_level['15 _64'] / pop_age_level['total'] * 100
pop_age_level['65up(%)'] = pop_age_level['65up'] / pop_age_level['total'] * 100
 pop_age_level.head() Copy the code

Finally, let’s draw a stack bar chart to show it

pop_age_level['year'] = pop_age_level.index

pop_age_level.plot.bar(x='year', y=['0 _14 (%)'.'15 _64 (%)'.'65up(%)'], stacked=True, figsize=(15.8),  fontsize=10, rot=60)
Copy the code

In the 1960s and 1970s, China’s population aged 0-14 accounted for more than 40%, which was quite high. With the implementation of family planning in the 1980s, the population aged 0-14 began to decline, and has been reduced to the present 17%, which is pitifully low. Now the country allows the second child, but also hope that we can have more young people in the future, so as to enhance our international competitiveness.

My analysis is here, interested friends can explore, data and source code has been packaged, the public number to reply to the keyword population.

Welcome the public account “du Code” to export the dry goods you can’t see elsewhere.

Global population data for Python data analysis

Related Posts

Kratos uses -protobuf

A simple example of how to step through Cypress code

Mysql > select * from ‘mysql’