Original link:tecdat.cn/?p=6689 

Original source:Tuo End number according to the tribe public number

 

In this paper, 188 countries are aggregated based on these 19 socioeconomic indicators using monte-Carlo K-means clustering algorithm implemented by Python. Clustering can help reduce the amount of work required to identify attractive investment opportunities by grouping similar countries together and generalizing them.

Before discussing clustering countries and drawing conclusions about the results, this paper introduces distance measurement, clustering quality measurement, clustering algorithm, k-means clustering algorithm in detail.

 


 

Cluster theory – Measurement of similarity and distance

Clustering is the process of dividing a set of heterogeneous (different) objects into subsets of homogeneous (similar) objects. The core of cluster analysis is that given any two objects, you can quantify the similarities or dissimilarities between those objects. Distance measures similarity in continuous search space.

Below I write a similarity measure for continuous search space. For each I include formulas (given two vectors, and q) and Python code. All of the Python code used to write this article is available.


class Similarity:
 

    def __init__(self, minimum):
        self.e = minimum
        self.vector_operators = VectorOperations()

    def manhattan_distance(self, p_vec, q_vec):
        """
        This method implements the manhattan distance metric
        :param p_vec: vector one
        :param q_vec: vector two
        :return: the manhattan distance between vector one and two
        """
        return max(np.sum(np.fabs(p_vec - q_vec)), self.e)

    def square_euclidean_distance(self, p_vec, q_vec):
        """
        This method implements the squared euclidean distance metric
        :param p_vec: vector one
        :param q_vec: vector two
        :return: the squared euclidean distance between vector one and two
        """
        diff = p_vec - q_vec
        return max(np.sum(diff ** 2), self.e)
Copy the code

Clustering theory – clustering algorithm class

The two main categories of clustering algorithms are hierarchical clustering and zonal clustering. Hierarchical clustering is formed by combining smaller clusters into larger clusters or dividing larger clusters into smaller clusters. Partitioned clustering is formed by dividing the input data set into mutually exclusive subsets.

The difference between hierarchical and partitioned clustering is mainly related to the input required. Hierarchical clustering requires only similarity measures, whereas partitioned clustering may require many additional inputs, most commonly the number of clusters. Generally speaking, hierarchical clustering algorithm is also more suitable for classifying data. ķ ķ

Hierarchical clustering

There are two types of hierarchical clustering, namely condensed clustering and split clustering. Condensed clustering is a bottom-up approach that involves merging smaller clusters (each input pattern itself) into larger clusters. Splitting clustering is a top-down approach that starts with a large cluster (all input patterns) and breaks them up into smaller and smaller clusters until each input pattern itself is in the cluster.

Partition clustering

In this paper, we will focus on the partition clustering algorithm. The two main categories of partition clustering algorithms are centroid – based clustering and density – based clustering. This paper focuses on clustering based on centroid. Especially the popular K-means clustering algorithm.

 

 


 

Clustering theory – K-means clustering algorithm

K-means clustering algorithm is a partition clustering algorithm based on centroid, which uses mean shift heuristic algorithm. The K-means clustering algorithm consists of three steps (initialization, allocation and update). These steps are repeated until the clustering has converged or the number of iterations has exceeded, that is, the computational budget has been exhausted.

Initialize the

Initializes a random set of centroids in the search space. These centroids must be in the same order of magnitude as the clustered data pattern. In other words, it makes no sense to initialize random vectors with values between 0 and 1 if the values in the data schema are between 0 and 100.


Note: Make sure to normalize data across each attribute,Rather thanEach pattern

distribution

Once the center of mass has been randomly initialized in space, we iterate over each pattern in the dataset and assign it to the nearest center of mass. Try to do this step in parallel, especially if there are a large number of schemas in the data set.

update

Once patterns are assigned to their centers of mass, the mean-shift heuristic is applied. This heuristic replaces each value in each center of mass and replaces the average of that value with the pattern assigned to that center of mass. This moves the center of mass towards the higher-dimensional mean of its pattern. The problem with the mean-shift heuristic is that it is sensitive to outliers. To overcome this problem, k-Medoids clustering algorithm can be used, and standardized data can be used to suppress the influence of outliers,


The iteration

Repeat these three steps for multiple iterations until the clustering has converged on the solution. A nice GIF shows something like this,

PYTHON code – A complement to clustering classes

The following Python method is an extension of the Clustering class, which allows it to perform the K-means Clustering algorithm. This involves updating the centroid using the mean-shift heuristic.

  


Clustering theory – the measurement of clustering quality

Assuming you have a certain degree of similarity and clustering of data, you still need an objective function to measure the quality of that clustering. Most cluster quality metrics attempt to optimize a cluster based on inter-cluster and intra-cluster distances. Simply put, these metrics try to ensure that patterns in the same cluster are closely related and that patterns in different clusters are far apart.

Quantization error

Quantization error measures the rounding error introduced by quantization, which maps a set of input values to a finite, smaller set. This is basically what we do by clustering patterns into k clusters.

Note: The image also assumes that we use the Manhattan distance.

In the above description of the quantization error, we calculate the sum of the square absolute distances between each mode and its assigned center of mass.

Davis-bourdin index

The Davis-Erding criterion is based on the distance ratios within and between clusters of a particular cluster.

Note: The image assumes that we use the Manhattan distance.

In the diagram above of the Davies-Bouldin index, we have three clusters of three patterns.

Silhouette index

The silhouette index is one of the most popular ways to measure the quality of a particular cluster. It measures how similar each pattern is to the patterns in its own cluster and compares it to the patterns in other clusters.

 

def silhouette_index(self, index): # store the total distance to each cluster silhouette_totals = [] # store the number of patterns in each cluster silhouette_counts = [] # initialize the variables for i in range(self.solution.num_clusters): Silhouette_totals. Append (0.0) silhouette_counts. Append (0.0) s = Similarity(self.e) for I in range(len(self.solution.patterns)): # for every pattern other than the one we are calculating now if i ! = index: # get the distance between pattern[index] and that pattern distance = s.fractional_distance(self.solution.patterns[i], self.solution.patterns[index]) # add that distance to the silhouette totals for the correct cluster silhouette_totals[self.solution.solution[i]] += distance # update the number of patterns in that cluster silhouette_counts[self.solution.solution[i]] += 1 # setup variable to find the cluster (not equal to the Pattern [index]'s cluster) with the Smallest_Silhouette = Silhouette = Cliette_Totals [0] / Max (1.0, silhouette_counts[0]) for i in range(len(silhouette_totals)): # calculate the average distance of each pattern in that cluster from pattern[index] silhouette = silhouette_totals[i] / Max (1.0, silhouette_counts[i]) # if the average distance is lower and it isn't pattern[index] cluster update the value if silhouette < smallest_silhouette and i ! = self.solution.solution[index]: smallest_silhouette = silhouette # calculate the internal cluster distances for pattern[index] index_cluster = Silhouette = Self. Silhouette = Silhouette + silhouette_totals[index_cluster] / Max (1.0, silhouettes) silhouette_counts[index_cluster]) # return the ratio between the smallest distance from pattern[index] to another cluster's patterns and # the patterns belong to the same cluster as pattern[index] return (smallest_silhouette - index_silhouette) / max(smallest_silhouette, index_silhouette)Copy the code

The high contour value means that avdeev matches well with its own cluster and badly with its neighbors. One should aim to maximize each pattern in the data set.

Note: The image also assumes that we use the Manhattan distance.

 

After using these metrics over the past few months, I’ve come to the conclusion that none of them are perfect,

  1. Quantization error – This measure has minimal computational complexity, but the measure is biased toward large clusters because clusters become smaller (more compact) as you add more centroids, and in extreme cases you may assign one pattern centroid to each cluster. In this case, the quantization error is minimized. Using this metric, The Good, The results are The most reliable.
  2. Davis – Bourding – as you increase the value of the k k the distance between each centroid will naturally decrease on average. Because this term is in the denominator, so for larger values, finally divide by smaller numeric k k. The result is a metric bias toward solutions with a smaller number of clusters.
  3. Silhouette Index – The computational complexity of this indicator is insane. Suppose you calculated from each model avdeev avdeev I life to each other in the distance, to calculate which cluster is the most close to, and you do this for each model, you need to calculate | Z | | | * Z – 1 | | | | * avdeev avdeev – 1. In this case, that’s 35,156 calculations. If you remember perfectly, the number of calculations per pass is halved.

These biases are well documented in the following analysis of different indicators; Despite the fact that they should measure the same thing, they are almost entirely negatively correlated.

 

X QE D B SI
QE 1.0 0.965 0.894
SB 0.965 1.0 0.949
SI 0.894 0.949 1.0
 

 

PYTHON code – Clustering

Before you can evaluate the adaptability of a given cluster, you need the actual clustering pattern. The Clustering class contains methods for assigning patterns to the nearest center of mass.

 

PYTHON code – Object functions

The ClusteringQuality class measures the quality of a cluster for a given input pattern.

  

Clustering theory – Monte Carlo method in clustering

The two biggest problems of k-means clustering algorithm are:

  1. It’s sensitive to random initialization of the center of mass
  2. Initializing centroid number, k k

For these reasons, the K-means clustering algorithm is often restarted multiple times. Because initialization is (usually) random, we basically sample the random high-dimensional starting position of the center of mass, also known as monte Carlo simulation. To compare solutions for stand-alone simulations, we need to measure cluster quality, such as those discussed earlier.

Deterministic initialization

I say initialization is usually random because k-means clustering algorithms have deterministic initialization techniques.

Random initialization

The difference is that the next random number in a pseudo-random sequence is independent of the previous random number, whereas in a quasi-random sequence the next random number is dependent on the previous random number. Related random numbers cover a larger surface of the search space.

Comparing pseudo-random sequences (left) and quasi-random sequences (right) in two-dimensional space

Pick the right K

In addition to testing different initializations, we can also test different value 114 in the Monte Carlo framework. Currently, there is no best way to dynamically determine the correct number of clusters, although techniques are always being studied for determining the correct k values. I prefer to just try different k values empirically and compare the results, although this is time-consuming, especially on large data sets.

 

Clustering results – visualization and centroid analysis

The final results presented below represent the simulated k’s found in the range of the best clustering k’s ={6, 7, 8} k’s ={6, 7, 8} over 1000 independent per of each value. Euclidean distance and quantization error are distance and quality measures used in Monte Carlo k-means clustering. The data set used to produce the results is a standardized point-in-time data set from 2014, which includes 19 socio-economic indicators identified as positively correlated with real GDP growth.

 

Cluster subdivision and centroid analysis

Each label below breaks down the cluster into the countries it belongs to and compares the center of mass to the central center of mass of each of the 19 socioeconomic indicators we aggregate.

 

 

Countries in this group in 2014

 


 




 


Clustering results – Conclusions and further research

Quantification is not risk management, derivative pricing or algorithmic trading; It’s about challenging things the way they are, and often using statistical and computational methods to find a better way.

In 2004, the US was an outlier and occupied a cluster of its own. The cluster is characterized by low PPP exchange rates, high imports, high exports, high household spending, high industrial production and relatively high government revenues, especially in health. The big difference at this point in time remains that The amount of investment taking place in China is much larger and the population (between 15 and 64 years old) is much larger (obviously). China has also surpassed the United States in industrial production. These are shown in the side-by-side comparison below,

 

 

 

Idioms such as eastern and Western European countries appear on maps and, for lack of a better word, are correct. However, colloquial terms such as BRIC (Brazil, Russia, India, China and South Africa) are clearly driven more by political economy than actual economics. Here are my thoughts on some common spoken English,

  1. Eastern and Western Europe – there appears to be a clear distinction between the countries in group I and those in groups V and II. The past decade has seen changes in Spain, Ireland, the Czech Republic and other nearby countries. This may be the result of the sovereign debt crisis.
  2. East and West – this is an oversimplification. Most Asian countries occupy different clusters, while traditional Western countries such as the US and UK do not actually occupy the same cluster.
  3. The BRIC countries – Brazil, Russia, India, China and South Africa – belong to different clusters. While they may have reached trade agreements, this does not mean that these countries have the same social, demographic and economic composition or the same potential for future real GDP growth.
  4. Africa’s Growth Story – While capital markets have performed well over the past decade, this does not seem to reflect significant changes in the continent’s social, demographic and economic composition. Interestingly, India and Pakistan are no longer clustered with central and South African countries.
  5. North Africa and Southern Africa-North African countries (Morocco, Algeria, Egypt, Libya, etc.) are clearly different from the rest of Africa. Amazingly, South Africa is now grouped with these countries.
  6. Emerging versus developed – that seems an oversimplification. There seem to be some stages of development that will be discussed in the next section.

There are more colloquialisms, and I apologize for not commenting on all of them, but these six are just the ones I often come across in my daily life. If you find other interesting relationships, please comment. Since we do not know the relative importance of each socioeconomic indicator, it is impossible to quantify how well it is in one cluster versus another. In some cases, we can’t even determine whether big or small is good or bad value. For example, if the government is inefficient, is large government spending still effective? Nevertheless, I tried to build a rough metric to rank each cluster:

Ranking = Exports + household spending + imports + improved health + Improved water + Population + population growth between 15 and 64 + total investment + percentage of cities + mobile phone subscriptions + government revenue + government spending + health spending + industrial production + Internet users – PPP’s exchange rate – Unemployment rate – age dependence rate

Based on this metric, the relative ranking of each cluster is shown below,

 

cluster Ranked value rank count
6 10.238 1 2
8 5.191 2 22
1 5.146 3 20
5 3.827 4 20
2 3.825 5 45
4 3.111 6 32
3 3.078 7 4
7 1.799 8 43
 

 

The ranking isn’t perfect, but it reaffirms our view that the world is an unequal place.

What does that mean for investors? I think this means that a distinction should be made between countries at different stages of development. That’s because while most less-developed countries represent investments with the greatest potential for returns, they are also riskier and may take longer to pay off. Ideally, these factors should be weighed against each other and compared with an investor’s appetite for risk and reward.

 

Thank you very much for reading this article, please leave a comment below if you have any questions!