Using python3 to do big data dimensionality reduction (factor analysis score) for players, find qualified wingman for ronaldo

The original article is reprinted from liu Yue’s Technology blog v3u.cn/a_id_176

It was well known that Juventus needed a Champions League trophy and Ronaldo wanted another to add to an honorary book. The serie A champions league champions have lost to Lyon in the quarter-finals of the uefa Champions League this year, and the team sacked coach Fabio Sarri and signed Andrea Pirlo, but that is not enough. Yes, they need a powerful centre-forward, and with Gonzalo Higuain out of action, Juventus must bring one in.

The question now is, with the new season just around the corner and the money being squeezed by the pandemic, who is the right candidate? This time we used the Factor_Analyzer library based on Python3 to analyze players and try to find the best players Juventus could recruit.

First let’s draw the line and rule out impossible signings like Lewandowski of Bayern Munich, Harry Kane of Tottenham Hotspur or Hakeem Benzema of Real Madrid, all three world-class centre-forwards whose chances of joining Juventus are almost nil due to their price tag and other factors. Well, let’s be realistic, Luis Suarez of Barcelona, Edin Dzeko of Roma and Alvaro Morata of Atletico Madrid are the likely candidates, and suarez has fallen out with Barca and is almost certain to leave. Morata will also not be at the vicente calderon next season, while dzeko is enjoying a fine spell at Roma but clearly wants more glory.

Data analysis starts with data, so let’s take a look at the three of them from last season.

First up for Luis Suarez and Alvaro Morata in la Liga:

Here we take the two most important stats for a centre-forward, goals scored and conversion rate, and we can see that with a four-point difference, Morata’s conversion rate is just 14.5 percent, behind Suarez’s 19 percent.

As a centre forward, apart from scoring goals, he also needs to be able to play, which can help ronaldo’s forward forward:

Morata also lags behind Suarez in terms of his ability to set him up. Let’s take a look at the figures of Dzeko and Higuain, both in Serie A:

Dzeko was obviously stronger than Higuain in both attacking ability and setting up ability last season.

Now let’s extract some higher-order data. Here we sample goals, conversion rates, and assists as features. Of course, you can add other features if you like. Player nationality, player scandals, individual player goals (or expectations), player injury history and severity, etc., are not referenced.

Therefore, we take pure ability data as the core, and the cost factors such as transfer price and annual salary of players are not calculated by annual cost. Theoretically, we can also make a judgment based on the subjective perspective of transfer news. Similarly, players and their teams’ competitive skills are not used as reference data, because even highly talented players who train with players or coaches who are not of the same level for a long time will have a huge deviation from expectations.

To add data to a dataset:

Import pandas as pd import numpy as NP from pandas import DataFrame,Series 'goal conversion rate:,14,13,10 [19],' assists:,2,7,4 [8],} data = DataFrame (mydata) data. The index = [' suarez ', 'mora tower', 'edin dzeko', 'higuain] print (data)Copy the code

Data matrix:

Goals conversion rate assists Suarez 16 19 8 Morata 12 14 2 Dzeko 16 13 7 Higuain 8 10 4Copy the code

Factor analysis is a statistical method that converts multiple indicators into a small number of mutually unrelated and unobservable random variables (i.e. factors) by studying the internal structure of correlation coefficients of original data, so as to extract most of the information of original indicators. Factor analysis, first of all, the original data standardization process, to establish and calculate the correlation coefficient matrix eigenvalue and characteristic vector, and then select the eigenvalues of the characteristic value greater than or equal to 1 for public factor number, number, or according to the characteristic value to determine the cumulative contribution rate is more than 80% of the public factor, obtained by the orthogonal or oblique factor loading matrix, finally calculate the common factor score and comprehensive score.

The first step is to establish a factor analysis model:

from factor_analyzer import FactorAnalyzer, Rotator  
  
fa = FactorAnalyzer(rotation=None)  
fa.fit(data)

print(fa.loadings_)


Copy the code

The degree of association between the common factor and the original variable index is reflected by the factor load value. Since the structure of the initial factor load matrix is not simple enough, the meaning of each factor is not prominent. For this reason, the maximum variance method is adopted to make each variable produce a high load on one factor and a small load on the other factors.

However, the degree of association between the common factor and the original variable index is reflected by the factor load value. Since the structure of the initial factor load matrix is not simple enough, the meaning of each factor is not prominent. Therefore, the maximum variance method is adopted to make each variable produce a high load on one factor and a small load on the other factors. After iterative convergence of characteristic data, the factor load matrix after rotation is obtained:

Print (" rotator :\n", rotator.fit_transform(fa.loadings_))Copy the code

Then we can simply look at the variance of the variables, which is the sum of the squares of the loads of each original variable in each common factor, which is the ratio of the variance of the original variable determined by the common factor. The variance of a variable consists of common factors and unique factors. The commonality indicates that the variance of the original variable can be explained by common factors. The greater the commonality, the higher the variable can be explained by factors, that is, the more variances of the variable can be explained by factors. The meaning of commonality is to explain the extent to which the information of the original variable can be retained if the original variable is replaced by a common factor.

print(fa.get_communalities())
Copy the code

You can also view the factor correlation matrix and eigenvalues:

print(fa.get_eigenvalues())
Copy the code

Of course, our ultimate goal is to comprehensively score each player according to the factor model, and finally use the proportion of variance contribution rate of each factor in the variance contribution rate of the three factors as the weight for weighted summary, so as to obtain the comprehensive score F of each player, namely:

def F(factors):  
    return sum(factors*fa.get_factor_variance()[1])
Copy the code

Then you can compute them in the matrix:



scores = []  
for i in range(len(fa.transform(data))):  
    new = F(fa.transform(data)[i])  
    scores.append(new)  
  
print(scores)


Copy the code

Get the score array:

[0.7294004536510521, 0.2958329655707666, 0.530110265958429, 0.9636777540387146]Copy the code

We can then add a column of data to the original matrix:

Data [' scores '] = scores print(data)Copy the code

The new matrix is obtained:

Goals Conversion rate assists Composite Score Suarez 16 19 8 0.729400 Morata 12 14 2 -0.295833 Dzeko 16 13 7 0.530110 Higuain 8 10 4-0.963678Copy the code

At the same time, you can specify to sort by the new field column to facilitate data display:

Data = data. Sort_values (by=' data ', Ascending =False)Copy the code

Get the sorted matrix:

Goals Conversion rate assists Composite score Suarez 16 19 8 0.729400 Dzeko 16 13 7 0.530110 Morata 12 14 2-0.295833 Higuain 8 10 4-0.963678Copy the code

We can also visualize the matrix if we wish, as shown in a horizontal histogram:

import matplotlib.pyplot as plt import matplotlib matplotlib.rcParams['font.sans-serif'] = ['SimHei'] ['axes. Unicode_minus '] = axes. Barh (range(4), height=0.7, color='steelblue', Alpha = 0.8) PLT. Yticks (range (4), [' suarez ', 'mora tower', 'edin dzeko', 'higuain]) PLT. Xlim (1, 2) PLT. Xlabel (" score ") PLT. Title (" signings scoring ") for x, Enumerate (scores): plt.text(y + 0.2, x-0.1, '%s' % y) plt.show()Copy the code

According to the overall rating, Suarez is undoubtedly the best choice, dzeko is the next best choice, and Morata is the third choice. In any case, the comprehensive ability of the three is better than higuain in the team, from this point of view, even if the choice of Morata, is better than Higuain to stay.

Epilogue: it must be pointed out that data as a result, which is formed by the player character can never become the main basis of decision making, can only be used as a reference, excessive dependence on data often can backfire, such as “data tactician” has been hailed as a football benitez, according to the data row the line-up formation operation was once noisy, but now? He can only play in the Chinese Super League. At the time of publication in the early hours of 24 September 2020, Juventus have signed Morata on loan, suarez to Atletico Madrid for £6m and Dzeko to stay. Juventus chose Morata, who scored low on the factor analysis model. Can Morata help Ronaldo realize his dream? Who will win the new Champions League? Let’s wait and see.

The original article is reprinted from liu Yue’s Technology blog v3u.cn/a_id_176

Using python3 to do big data dimensionality reduction (factor analysis score) for players, find qualified wingman for ronaldo

Related Posts

Edge detection (2) : Sobel operator and Canny operator

TensorFlow tutorial: The first Step towards nerves

Episode 71: CCTV news Broadcast focuses on Aliyun