In 2018, the hot World Cup is about to begin. Before the game starts, let’s use Python to analyze the strength of the teams and make bold predictions about the favourites for the World Cup.

There is a way to get the source code for this article.

Through data analysis, many interesting results can be found, such as:

Find out which teams are the dark horses in their first World Cup

Find the 2018 round of 32 teams that have previously qualified for the World Cup but haven’t won a single game in the tournament

Of course, our main task here is to use data analysis to predict the 2018 World Cup favorites.

The analysis was based on data from Kaggle from around 40,000 World Cup matches, World Cup qualifiers, Asian Cup, Euro Cup and friendlies between nations from 1872 to this year.

The environment of this time is

Windows 7 system

Python 3.6

Jupyter Notebook

Pandas version 0.22.0

Let’s start with the data:

%matplotlib inlineimport pandas as pdimport matplotlib.pyplot as pltplt.style.use('ggplot')df = pd.read_csv('results.csv')df.head()Copy the code

The data set contains the following data column information:

The date of

The home team name

The visiting team name

Goals scored by the home team (excluding penalties)

Goals scored by visiting teams (excluding penalties)

Type of competition

City of competition

Country of competition

Whether the neutral

The results are as follows:

1. Get data for all World Cup matches (excluding qualifiers)

Pandas will be used in many different ways in this article. If you are not familiar with pandas, you can review it in my previous article “Using Pandas to Do Data Science Better.”

df_FIFA_all = df[df['tournament'].str.contains('FIFA', regex=True)]df_FIFA = df_FIFA_all[df_FIFA_all['tournament']=='FIFA World Cup']df_FIFA.head()Copy the code

The results are as follows:

Do a preliminary collation of the data,

df_FIFA.loc[:,'date'] = pd.to_datetime(df_FIFA.loc[:,'date'])df_FIFA['year'] = df_FIFA['date'].dt.yeardf_FIFA['diff_score'] = df_FIFA['home_score']-df_FIFA['away_score']df_FIFA['win_team'] = ''df_FIFA['diff_score'] = pd.to_numeric(df_FIFA['diff_score'])Copy the code

Create a new column containing information about the winning team

# The first method to get the winnersdf_FIFA.loc[df_FIFA['diff_score']> 0, 'win_team'] = df_FIFA.loc[df_FIFA['diff_score']> 0, 'home_team']df_FIFA.loc[df_FIFA['diff_score']< 0, 'win_team'] = df_FIFA.loc[df_FIFA['diff_score']< 0, 'away_team']df_FIFA.loc[df_FIFA['diff_score']== 0, 'win_team'] = 'Draw'df_FIFA.head()# The second method to get the winnersdef find_win_team(df):    winners = []    for i, row in df.iterrows():        if row['home_score'] > row['away_score']:            winners.append(row['home_team'])        elif row['home_score'] < row['away_score']:            winners.append(row['away_team'])        else:            winners.append('Draw')    return winners        df_FIFA['winner'] = find_win_team(df_FIFA)df_FIFA.head()Copy the code

The results are as follows:

2. Get top 20 stats for all World Cup matches

2.1 Obtain the data of the top 20 with the most wins in all World Cup matches
s = df_FIFA.groupby('win_team')['win_team'].count()s.sort_values(ascending=False, inplace=True)s.drop(labels=['Draw'], inplace=True)Copy the code

Pandas:

A histogram

S.ead (20).plot(kind='bar', figsize=(10,6), title='Top 20 Winners of World Cup')Copy the code

Horizontal bar chart

S.s ort_values (ascending = True, inplace = True) s.t ail (20). The plot (kind = 'barh', figsize = (1, 6), title='Top 20 Winners of World Cup')Copy the code

The pie chart

S_percentages_percentage s_percentage = s/s.s um (). The tail (20). The plot (kind = 'pie', figsize = (10, 10), autopct='%.1f%%',startangle=173, title='Top 20 Winners of World Cup', label='')Copy the code

Analysis Conclusion 1:

In terms of the number of matches won, Brazil, Germany, Italy and Argentina are the four strongest teams.

Through the above analysis, we can also look at the wins of some countries

s.get('China', default = 'NA')s.get('Japan', default = 'NA')s.get('Korea DPR', default = 'NA')s.get('Korea Republic', default = 'NA')s.get('Egypt', default = 'NA')Copy the code

Run results are ‘NA’, 4,1,5, ‘NA’.

From the result, The Chinese team, in the World Cup competition (excluding qualifiers) has not won. Of course, Egypt, the black horse of this World Cup, has been to the World Cup twice before, but has not won

This is the number of games won. Let’s look at the total number of goals scored.

2.2 Total number of goals scored by each national team

df_score_home = df_FIFA[['home_team', 'home_score']]column_update = ['team', 'score']df_score_home.columns = column_updatedf_score_away = df_FIFA[['away_team', 'away_score']]df_score_away.columns = column_updatedf_score = pd.concat([df_score_home,df_score_away], ignore_index=True)s_score = df_score.groupby('team')['score'].sum()s_score.sort_values(ascending=False, Inplace = True) s_score. Sort_values (ascending = True, inplace = True) s_score. The tail (20). The plot (kind = 'barh', figsize = (1, 6), title='Top 20 in Total Scores of World Cup')Copy the code

Analysis Conclusion 2:

From the total number of goals, Germany, Brazil, Argentina, Italy four teams are the strongest.

The above analysis is based on the statistics of all the teams since 1872. Below, we focus on the statistics of the top 32 of the 2018 World Cup.

3. Analysis of the 32 finalists for the 2018 World Cup

The 2018 FIFA World Cup groups are as follows:

Group 1: Russia, Germany, Brazil, Portugal, Argentina, Belgium, Poland, France

Group 2: Spain, Peru, Switzerland, England, Colombia, Mexico, Uruguay, Croatia

Group 3: Denmark, Iceland, Costa Rica, Sweden, Tunisia, Egypt, Senegal, Iran

Group 4: Serbia, Nigeria, Australia, Japan, Morocco, Panama, South Korea, Saudi Arabia

Get all the data for the top 32

First, determine whether a team has made it to the World Cup for the first time.

team_list = ['Russia', 'Germany', 'Brazil', 'Portugal', 'Argentina', 'Belgium', 'Poland', 'France',             'Spain', 'Peru', 'Switzerland', 'England', 'Colombia', 'Mexico', 'Uruguay', 'Croatia',            'Denmark', 'Iceland', 'Costa Rica', 'Sweden', 'Tunisia', 'Egypt', 'Senegal', 'Iran',            'Serbia', 'Nigeria', 'Australia', 'Japan', 'Morocco', 'Panama', 'Korea Republic', 'Saudi Arabia']for item in team_list:    if item not in s_score.index:        print(item)Copy the code

out:

Iceland

Panama

According to the above analysis, Iceland and Panama qualified for the World Cup for the first time.

Since Iceland and Panama are in the Last 32 of the World Cup for the first time, there are virtually no historical records for either team.

df_top32 = df_FIFA[(df_FIFA['home_team'].isin(team_list))&(df_FIFA['away_team'].isin(team_list))]Copy the code

3.1 The 32 strong data since 1872

The number of courts won
s_32 = df_top32.groupby('win_team')['win_team'].count()s_32.sort_values(ascending=False, inplace=True)s_32.drop(labels=['Draw'], Inplace = True) s_32. Sort_values (ascending = True, inplace = True) s_32. The plot (kind = 'barh', figsize = (8, 12), title='Top 32 of World Cup since year 1872')Copy the code

Goal stats
df_score_home_32 = df_top32[['home_team', 'home_score']]column_update = ['team', 'score']df_score_home_32.columns = column_updatedf_score_away_32 = df_top32[['away_team', 'away_score']]df_score_away_32.columns = column_updatedf_score_32 = pd.concat([df_score_home_32,df_score_away_32], ignore_index=True)s_score_32 = df_score_32.groupby('team')['score'].sum()s_score_32.sort_values(ascending=False, Inplace = True) s_score_32. Sort_values (ascending = True, inplace = True) s_score_32. The plot (kind = 'barh', figsize = (8, 12), title='Top 32 in Total Scores of World Cup since year 1872')Copy the code

Analysis Conclusion 3:

Germany, Brazil and Argentina are the top 32 teams in the World Cup since 1872 in terms of wins and goals scored.

Since 1872 to now, there have been more than 100 years, a large time span, some countries have undergone major changes, the subsequent analysis of the competition since 1978 (nearly 10 sessions) and 2002 (nearly 4 sessions).

The program code is similar; only the visual results are shown here.

3.2 The 32 strong data since 1978

The number of courts won

Goal stats

Analysis Conclusion 4:

Since 1978, Argentina, Germany and Brazil have been the top 32 teams in the World Cup in terms of the number of games won.

In terms of goals scored, the top three teams are the same, but Germany’s statistical advantage is more pronounced.

3.3 The top 32 data since 2002

The number of courts won

Goal stats

Analysis Conclusion 5:

Since 2002, Germany, Argentina and Brazil have been the top 32 teams in the World Cup in terms of wins and goals scored. Among them, Germany’s statistical advantage is more obvious.

4. Comprehensive conclusions

The 32 teams in the 2018 World Cup are predicted to be Germany, Argentina and Brazil based on previous World Cup matches, with Germany being the favorite to win.

Special note: the above data analysis is purely for personal study, and the predicted results may differ greatly from the actual situation, so it cannot be used for other purposes.

This article is a more comprehensive project combat, I hope to give you some inspiration.

The code download

Long press the qr code of “Python Data Path” to reply to PyDataRoad

Get the code for this article.

The Python Data Approach

Making Data More Valuable