preface

Kmeans is one of the simplest clustering algorithms, but it is widely used. Recently, I often encountered this algorithm in my work. Kmeans is generally used in the early stage of data analysis. After selecting appropriate K and classifying the data, the characteristics of the data under different clustering are studied by classification.

This paper records learning kmeans algorithm related content, including algorithm principle, convergence, effect evaluation cluster, finally with R language example, as a reminder.

Algorithm principle

The calculation method of Kmeans is as follows:

1 Randomly select K center points

2 Iterate over all data and divide each data into the nearest central point

3. Calculate the average value of each cluster and use it as the new center point

4 Repeat 2-3 until the k midline points do not change (converge), or enough iterations have been performed

Time complexity: O(I*n*k*m)

Space complexity: O(n*m)

Where m is the number of fields of each element, N is the amount of data, and I is the number of drops. Generally, I, K and m can be considered as constants, so the complexity of time and space can be simplified to O(n), that is, linear.

The algorithm convergence

According to Kmeans’s algorithm, SSE is actually a strict Coordinate Decendet process. Let the objective function SSE be as follows:

SSE(

.

,… .

) =

Euclidean distance is used as clustering function between variables. One variable at a time

Find the optimal solution, which is the partial inverse, and then it equals 0, and you get

c_i=

Where m is the number of elements in c_i’s cluster

That is, the mean value of the current cluster is the optimal solution (minimum value) of the current direction, which is the same as the process of each iteration of Kmeans. Therefore, this ensures that SSE will decrease with each iteration and eventually make SSE converge.

Because SSE is a non-convex function, it cannot guarantee that a global optimum solution can be found, but only a local optimum solution. However, kmeans can be repeated several times, and the minimum SSE can be selected as the final clustering result.

0-1 normalization

Due to the dimensional difference between data, it is not convenient to compare. For example, the online duration and active days of game users, the former is in seconds and the value is usually thousands, while the latter is in days and the value is usually in ones or tens. If these two variables are used to represent the active status of users, it is obvious that the role of active days can be ignored. Therefore, data should be uniformly placed in the range of 0~1 and converted into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted. The specific calculation method is as follows:

Among them

Belong to A.

Outline of the coefficient of

Silhouette coefficients, which combined Cohesion and Separation of clustering, were used to assess the effect of clustering. The value is between -1 and 1. The larger the value is, the better the clustering effect is. The specific calculation method is as follows:

  1. For the ith element x_i, calculate the average distance between X_i and all other elements in the same cluster, denoted a_i, to quantify the degree of cohesion within the cluster.
  2. Select a cluster B outside X_i, calculate the average distance between X_I and all points in B, traverse all other clusters, and find the nearest average distance, which is denoted as B_I, to quantify the separation degree between clusters.
  3. For x_i, the contour coefficient s_i = (b_i — a_i)/ Max (a_i,b_i)
  4. Calculate the contour coefficients of all x, and the average value is the overall contour coefficients of the current cluster

From the above formula, it is not difficult to find that if s_i is less than 0, it indicates that the average distance between X_I and the elements in the cluster is less than that of the nearest other clusters, indicating that the clustering effect is not good. If a_i tends to 0, or b_i is large enough, s_i tends to 1, indicating good clustering effect.

K value selection

In practical application, Kmean is generally used for data preprocessing or auxiliary classification and labeling. So k is usually not set very high. Through enumeration, k can be made from 2 to a fixed value such as 10, and kmeans can be repeatedly run on each k value for several times (to avoid local optimal solution), and the average contour coefficient of current K can be calculated. Finally, k corresponding to the maximum contour coefficient can be selected as the final cluster number.

 
Copy the code
%%%%%%%%%%%Add cell breathing

%The UAV coverage radius is R_m = 100m

%The area is 100*100m

clear;clc;

lengthOfArea =400;

R_m = 100;

n_users = 50;

users_dist = rand(n_users,2) * lengthOfArea; % The coordinate of every user

min_users_inaUAV = 5;

max_users_in_aUAV = 10;

power_Trans = 40;

R_cov_all = 0;

iteration = 10;

rate = zeros(iteration,1);

SINR_threshold = -5;% The threshold is dm

Rate2_low = [];

powers_collect = [];

%plot(users_dist(:,1),users_dist(:,2)3'r*');

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%===================Kmeans===========

stop=0;

k_index_old = 0;

powers_all_old = 0;

%%%%%%% Kmeans tryting

for k_cluster = 3:40;

change = 1;

while change == 1;

height_all = zeros(k_cluster,1);

[k_index,k_center] = kmeans(users_dist,k_cluster);

if k_index == k_index_old

change = 0;

end

k_index_old = k_index;





%===================================

%%%%%%%%%%%%%%%%%%%

%----------In order to avoid too much overlapped area, I modify Kmeans by

%-----------------selecting the centroid from the middle of farthest users in

%-----------------one cluster

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%=========find the farthest distance in one cluster

max_k_all = [];

k_means_center = [];

rate = [];

users_data=[];

for k_k = 1:k_cluster

max_all = [];

users_in_acluster = zeros(size(users_dist(k_index==k_k,:),1),2);

users_in_acluster = users_dist(k_index==k_k,:);

for user_index = 1:size(users_in_acluster,1)

[max_dista, max_index]= max((users_in_acluster(user_index,1) - users_in_acluster(:,1)).^2+(users_in_acluster(user_index,2) - users_in_acluster(:,2)).^2);

max_all = [max_all;[sqrt(max_dista), user_index, max_index]];

end

[max_k,max_index]= max(max_all(:,1));

max_index_in_acluster = max_all(max_index,2:3);

coor_center = sum(users_in_acluster(max_index_in_acluster,:),1)/2;

radiusOfCenter = norm(users_in_acluster(max_index_in_acluster(1),:)-users_in_acluster(max_index_in_acluster(2),:))/2;

k_means_center = [k_means_center; coor_center radiusOfCenter];%store the locations and radius of each cluster

end

%---------------------------------The locations of clusters have been fixed----------------------

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%------------------The overlapped area---------------------------------

all_overlap_area = 0;

all_area = 0;

for k = 1:length(k_means_center)-1

for k_left = k+1:length(k_means_center)

radius_two_cluster = sqrt(sum((k_means_center(k,1:2)-k_means_center(k_left,1:2)).^2));

if radius_two_cluster < (k_means_center(k,3)+k_means_center(k_left,3))

r_1 = k_means_center(k,3); r_2 = k_means_center(k_left,3); D = radius_two_cluster;

theta_1 = acos((r_1^2+D^2-r_2^2)/(2*r_1*D));

theta_2 = acos((r_2^2+D^2-r_1^2)/(2*r_2*D));

overlap_area = theta_1*r_1^2 + theta_2*r_2^2-r_1*D*sin(theta_1);

all_overlap_area = all_overlap_area+overlap_area;

end

end

end

%---------------------------------------------------------------------

%r_distance = zeros(length(users_dist),length(k_means_center));

users_sum = 0;

%%%%%%%%%%%%%%%%%%%%%%

%--------Cell breathing to constrain the number of users.

users_overlap = 0;

for k = 1:k_cluster

r_distance = 0;

users_in_cell = users_dist(k_index==k,:);

for k_users = 1: size(users_in_cell,1)

r_distance = norm(users_in_cell(k_users,:)-k_means_center(k,1:2));

if r_distance>k_means_center(k,3)

k_means_center(k,3) = r_distance;

end


end

end

%--------------cell breathing

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%计算所有重叠部分的用户数

for k = 1:k_cluster

users_in_cell = users_dist(k_index==k,:);

for k_users = 1: size(users_in_cell,1)

for k_cluster_else = 1:k_cluster

if k_cluster_else~=k && norm(users_in_cell(k_users,:)-k_means_center(k_cluster_else,1:2))<k_means_center(k_cluster_else,3)

users_overlap = users_overlap+1;

end

end

end

end


% if length(users_in_cluster)>= max_users_in_aUAV

% [~,dis_label] = sort(r_distance(:,k));

% k_means_center(k,3) = r_distance(dis_label(max_users_in_aUAV),k);

% end

% users_sum = users_sum + sum(r_distance(:,k)<=k_means_center(k,3));


%--------------This function has been closed.

%%%%%%%%%%%%%%%%%%%%%%



%--------------For Power Allocation

index_users = users_data(:,1);

for cluster_for_p = 1:k_cluster

users_PL_cluster_p = users_data(index_users==cluster_for_p,2:end);

SINR_cluster = [];

for user_in_c_p = 1:size(users_PL_cluster_p,1)

SINR_interf = 0;

SINR_signal = powers_all(cluster_for_p) - users_PL_cluster_p(user_in_c_p,1);

count = 1;

for user_intf = 1:k_cluster

if user_intf~=cluster_for_p

count = count+1;

SINR_interf = SINR_interf + 10.^((powers_all(cluster_for_p)-users_PL_cluster_p(user_in_c_p,count))/10);

end

end

SINR_user = SINR_signal- 10*log10(SINR_interf);

SINR_cluster = [SINR_cluster; SINR_user 10*log10(SINR_interf)];


end

[min_SINR, min_user] = min(SINR_cluster(:,1));

powers_all(cluster_for_p) = SINR_threshold + SINR_cluster(min_user,2) + users_PL_cluster_p(min_user,1);

end

powers_collect = [powers_collect; powers_all'];

SINR_iter = powerAllocation(k_cluster, users_dist,k_index,powers_all,k_means_center,height_all);

end

%--------------End

%%%%%%%%%%%%%%%%

break

end

end

end %While end


if stop == 1

break

end

end

%rate(iter) = rate(iter)/k_cluster;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%---------------Calculate the new SINR

Rate2_low = powerAllocation(k_cluster, users_dist,k_index,powers_collect(size(powers_collect,1)-1,:),k_means_center,height_all);

figure; hold on; plot(rate,'r*');plot(Rate2_low,'ko');

figure; hold on;

for power_iter = 1:size(powers_collect)

plot(powers_collect(power_iter,:),'.','MarkerSize',18)

end
Copy the code

Complete code or simulation consultation QQ1575304183