This article uses pandas to make a simple analysis of the world’s population data. I collected population data of countries around the world from 1960 to 2019, including men and women and different age groups, in 6 files.
Pop_total. CSV: specifies the annual population of each countryPop_female.csv: Female population by country per yearPop_male. CSV: Male population in each country per yearPop_0_14.csv: Population 0-14 years of age in each countryPop_15_64.csv: Population aged 15-64, country per yearPop_65up.csv: Population over 65 years of age in each countryCopy the code
First read the file data using pandas
import pandas as pd
pop_total = pd.read_csv('./data/pop_total.csv', skiprows=4)
pop_total.info()
Copy the code
The pop_total. CSV file stores the annual population of each country in the following format
pop_total.head(2)
Copy the code
In the same way, we read the remaining five files with dataframes pop_female, pop_male, pop_0_14, pop_15_64, and pop_65UP.
To visualize the global population distribution, we used Pyecharts to map the global population distribution in 2019
from pyecharts import options as opts
from pyecharts.charts import Timeline, Map
pop_total_2019 = pop_total[['Country Name'.'2019']]
# Change the Russian English name so pyecharts can recognize it
pop_total_2019.loc[200.'Country Name'] = 'Russia' pop_world_map = ( Map() .add("2019", pop_total_2019.values, "world", is_map_symbol_show=False) .set_series_opts(label_opts=opts.LabelOpts(is_show=False)) .set_global_opts( title_opts=opts.TitleOpts(title="Global population"), visualmap_opts=opts.VisualMapOpts(max_=100000000), # Darkest color (red) over 100 million people ) ) pop_world_map.render_notebook() Copy the code
Because we have 50 years of data, we can also create a GIF of the global population distribution, similar to the global epidemic trend map we wrote earlier. Because the code is similar to the above, here will not paste, source package can be found.
In the graph above, we can only see the population distribution qualitatively. Here, we take a quantitative look at the world’s 10 most populous countries in 2019.
# 10 Most populous countries in 2019
pop_total_2019_ordered = pop_total_2019.sort_values(by="2019" , ascending=False)
pop_total_2019_ordered.head()
Copy the code
After sorting, we find that the Country Name column not only contains a single Country, but also includes the concept of region, which is not what we want. I remember that when I made the epidemic map, I had a list of the corresponding relations between Chinese and English countries, so I can use it here.
from countries_ch_to_en import countries_dict
pop_top10 = pop_total_2019_ordered[pop_total_2019_ordered['Country Name'] \ .isin(countries_dict.keys())][:10]
pop_top10
Copy the code
So it looks normal. I’ll draw it with Seaborn
import seaborn as sns
sns.barplot(y=pop_top10['Country Name'], x=pop_top10['2019'])
Copy the code
As you can see, China still has the largest population in the world, followed by India. I am amazed that Pakistan is such a small country with a population of over 200 million and the 5th largest in the world.
After looking at the absolute population rankings, let’s take a look at the population growth rates of countries over the last 20 years, from 2000 to 2019
pop_tmp = pop_total[pop_total['Country Name'] \ .isin(pop_top10['Country Name'])][['Country Name'.'2000'.'2019']]
pop_tmp['growth(%)'] = (pop_tmp['2019'] / pop_tmp['2000'] - 1) * 100
pop_tmp.sort_values(by="growth(%)" , ascending=False)
Copy the code
It can be seen that although China has a large population base, its population growth rate in the past 20 years is relatively low. The top3 countries with the fastest growth are Nigeria, Pakistan and India respectively.
After looking at the total population, let’s look at the gender distribution, again in 2019
columns = ['Country Name'.'2019']
# Extract data, correlate
pop_sex_2019 = pop_total[columns].merge(pop_male[columns], on = 'Country Name')
Rename the column name
pop_sex_2019.rename(columns={'2019_x': 'total'.'2019_y': 'male'}, inplace=True) # Screen out countries pop_sex_2019 = pop_sex_2019[pop_sex_2019['Country Name'].isin(countries_dict.keys())] # Count the female population pop_sex_2019['female'] = pop_sex_2019['total'] - pop_sex_2019['male'] # The difference between the female share and the male share pop_sex_2019['diff'] = (pop_sex_2019['female'] - pop_sex_2019['male']) / pop_sex_2019['total'] * 100 # top15 with more men than women in the population sex_diff_top15 = pop_sex_2019.sort_values(by='diff') [0:15] sns.barplot(y=sex_diff_top15['Country Name'], x=sex_diff_top15['diff']) Copy the code
Qatar tops the list, with 50% more men than women. China’s gender ratio is also imbalanced, with 2% more men than women.
Then look at countries where women have a higher proportion than men
sex_diff_top15 = pop_sex_2019.sort_values(by='diff', ascending=False) [0:15]
sns.barplot(y=sex_diff_top15['Country Name'], x=sex_diff_top15['diff'])
Copy the code
The difference was significantly smaller, with the top1 only missing by 8%, and none of these countries had large populations. Let’s look at countries with populations of more than 100 million where women outnumber men
pop_sex_2019[pop_sex_2019['total'] > 100000000].sort_values(by='diff', ascending=False) [0:5]
Copy the code
You can see that in Japan, Mexico, Brazil and the United States, four of the most populous countries, there are more women than men.
Now that we know the gender, let’s look at the age distribution. Since I focus on the proportion of young people in each country, let’s first rank the proportion of 0-14 year olds in each country.
pop_0_14_2019 = pop_total[columns].merge(pop_0_14[columns], on = 'Country Name')
pop_0_14_2019.rename(columns={'2019_x': 'total'.'2019_y': '0 _14'}, inplace=True)
pop_0_14_2019['0_14_r(%)'] = pop_0_14_2019['0 _14'] / pop_0_14_2019['total'] * 100
# We still only look at countries with more than 100 million people
pop_0_14_top = pop_0_14_2019[pop_0_14_2019['Country Name'].isin(countries_dict.keys())][pop_0_14_2019['total'] > 100000000] \ .sort_values(by='0_14_r(%)', ascending=False)[:15] sns.barplot(y=pop_0_14_top['Country Name'], x=pop_0_14_top['0_14_r(%)']) Copy the code
You can see that southeast Asian countries like Philippines, Bangladesh, Indonesia and India have a much higher percentage of 0-14 year olds than China, and even the United States has a much higher percentage than us. We only have 17%, which is why the world’s factories are moving to Southeast Asia in recent years.
Finally, we take a look at the trend of China’s population share in different age groups from 1960 to 2019
Filter the columns we need
pop_0_14_ch = pop_0_14[pop_0_14['Country Name'] = ='China'].drop(['Country Name'.'Country Code'.'Indicator Name'.'Indicator Code', \ 'Unnamed: 64'], axis=1)
# column (year) change rows
pop_0_14_ch_unstack = pop_0_14_ch.unstack() Rebuild the DateFrame pop_0_14_ch = pd.DataFrame(pop_0_14_ch_unstack.values, \ index=[x[0] for x in pop_0_14_ch_unstack.index.values], columns=['0 _14']) pop_0_14_ch.head() Copy the code
Do the same thing for the other two ages
# 15-64 years old
pop_15_64_ch = pop_15_64[pop_15_64['Country Name'] = ='China'].drop(['Country Name'.'Country Code'.'Indicator Name'.'Indicator Code', \ 'Unnamed: 64'], axis=1)
pop_15_64_ch_unstack = pop_15_64_ch.unstack()
pop_15_64_ch = pd.DataFrame(pop_15_64_ch_unstack.values,\ index=[x[0] for x in pop_15_64_ch_unstack.index.values], columns=['15 _64']) Over 65 pop_65up_ch = pop_65up[pop_65up['Country Name'] = ='China'].drop(['Country Name'.'Country Code'.'Indicator Name'.'Indicator Code', \ 'Unnamed: 64'], axis=1) pop_65up_ch_unstack = pop_65up_ch.unstack() pop_65up_ch = pd.DataFrame(pop_65up_ch_unstack.values, \ index=[x[0] for x in pop_65up_ch_unstack.index.values], columns=['65up']) Copy the code
Age groups are correlated by year, and then the total population and the proportion of the population in each age group are calculated
pop_age_level = pop_0_14_ch.merge(pop_15_64_ch.merge(pop_65up_ch, left_index=True, right_index=True), left_index=True, right_index=True)
pop_age_level['total'] = pop_age_level['0 _14'] + pop_age_level['15 _64'] + pop_age_level['65up']
pop_age_level['0 _14 (%)'] = pop_age_level['0 _14'] / pop_age_level['total'] * 100
pop_age_level['15 _64 (%)'] = pop_age_level['15 _64'] / pop_age_level['total'] * 100
pop_age_level['65up(%)'] = pop_age_level['65up'] / pop_age_level['total'] * 100
pop_age_level.head() Copy the code
Finally, let’s draw a stack bar chart to show it
pop_age_level['year'] = pop_age_level.index
pop_age_level.plot.bar(x='year', y=['0 _14 (%)'.'15 _64 (%)'.'65up(%)'], stacked=True, figsize=(15.8), fontsize=10, rot=60)
Copy the code
In the 1960s and 1970s, China’s population aged 0-14 accounted for more than 40%, which was quite high. With the implementation of family planning in the 1980s, the population aged 0-14 began to decline, and has been reduced to the present 17%, which is pitifully low. Now the country allows the second child, but also hope that we can have more young people in the future, so as to enhance our international competitiveness.
My analysis is here, interested friends can explore, data and source code has been packaged, the public number to reply to the keyword population.
Welcome the public account “du Code” to export the dry goods you can’t see elsewhere.