preface
Kmeans is one of the simplest clustering algorithms, but it is widely used. Recently, I often encountered this algorithm in my work. Kmeans is generally used in the early stage of data analysis. After selecting appropriate K and classifying the data, the characteristics of the data under different clustering are studied by classification.
This paper records learning kmeans algorithm related content, including algorithm principle, convergence, effect evaluation cluster, finally with R language example, as a reminder.
Algorithm principle
The calculation method of Kmeans is as follows:
1 Randomly select K center points
2 Iterate over all data and divide each data into the nearest central point
3. Calculate the average value of each cluster and use it as the new center point
4 Repeat 2-3 until the k midline points do not change (converge), or enough iterations have been performed
Time complexity: O(I*n*k*m)
Space complexity: O(n*m)
Where m is the number of fields of each element, N is the amount of data, and I is the number of drops. Generally, I, K and m can be considered as constants, so the complexity of time and space can be simplified to O(n), that is, linear.
The algorithm convergence
According to Kmeans’s algorithm, SSE is actually a strict Coordinate Decendet process. Let the objective function SSE be as follows:
SSE(
.
,… .
) =
Euclidean distance is used as clustering function between variables. One variable at a time
Find the optimal solution, which is the partial inverse, and then it equals 0, and you get
c_i=
Where m is the number of elements in c_i’s cluster
That is, the mean value of the current cluster is the optimal solution (minimum value) of the current direction, which is the same as the process of each iteration of Kmeans. Therefore, this ensures that SSE will decrease with each iteration and eventually make SSE converge.
Because SSE is a non-convex function, it cannot guarantee that a global optimum solution can be found, but only a local optimum solution. However, kmeans can be repeated several times, and the minimum SSE can be selected as the final clustering result.
0-1 normalization
Due to the dimensional difference between data, it is not convenient to compare. For example, the online duration and active days of game users, the former is in seconds and the value is usually thousands, while the latter is in days and the value is usually in ones or tens. If these two variables are used to represent the active status of users, it is obvious that the role of active days can be ignored. Therefore, data should be uniformly placed in the range of 0~1 and converted into dimensionless pure values, so that indicators of different units or magnitudes can be compared and weighted. The specific calculation method is as follows:
Among them
Belong to A.
Outline of the coefficient of
Silhouette coefficients, which combined Cohesion and Separation of clustering, were used to assess the effect of clustering. The value is between -1 and 1. The larger the value is, the better the clustering effect is. The specific calculation method is as follows:
- For the ith element x_i, calculate the average distance between X_i and all other elements in the same cluster, denoted a_i, to quantify the degree of cohesion within the cluster.
- Select a cluster B outside X_i, calculate the average distance between X_I and all points in B, traverse all other clusters, and find the nearest average distance, which is denoted as B_I, to quantify the separation degree between clusters.
- For x_i, the contour coefficient s_i = (b_i — a_i)/ Max (a_i,b_i)
- Calculate the contour coefficients of all x, and the average value is the overall contour coefficients of the current cluster
From the above formula, it is not difficult to find that if s_i is less than 0, it indicates that the average distance between X_I and the elements in the cluster is less than that of the nearest other clusters, indicating that the clustering effect is not good. If a_i tends to 0, or b_i is large enough, s_i tends to 1, indicating good clustering effect.
K value selection
In practical application, Kmean is generally used for data preprocessing or auxiliary classification and labeling. So k is usually not set very high. Through enumeration, k can be made from 2 to a fixed value such as 10, and kmeans can be repeatedly run on each k value for several times (to avoid local optimal solution), and the average contour coefficient of current K can be calculated. Finally, k corresponding to the maximum contour coefficient can be selected as the final cluster number.
Copy the code
%%%%%%%%%%%Add cell breathing
%The UAV coverage radius is R_m = 100m
%The area is 100*100m
clear;clc;
lengthOfArea =400;
R_m = 100;
n_users = 50;
users_dist = rand(n_users,2) * lengthOfArea; % The coordinate of every user
min_users_inaUAV = 5;
max_users_in_aUAV = 10;
power_Trans = 40;
R_cov_all = 0;
iteration = 10;
rate = zeros(iteration,1);
SINR_threshold = -5;% The threshold is dm
Rate2_low = [];
powers_collect = [];
%plot(users_dist(:,1),users_dist(:,2)3'r*');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%===================Kmeans===========
stop=0;
k_index_old = 0;
powers_all_old = 0;
%%%%%%% Kmeans tryting
for k_cluster = 3:40;
change = 1;
while change == 1;
height_all = zeros(k_cluster,1);
[k_index,k_center] = kmeans(users_dist,k_cluster);
if k_index == k_index_old
change = 0;
end
k_index_old = k_index;
%===================================
%%%%%%%%%%%%%%%%%%%
%----------In order to avoid too much overlapped area, I modify Kmeans by
%-----------------selecting the centroid from the middle of farthest users in
%-----------------one cluster
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%=========find the farthest distance in one cluster
max_k_all = [];
k_means_center = [];
rate = [];
users_data=[];
for k_k = 1:k_cluster
max_all = [];
users_in_acluster = zeros(size(users_dist(k_index==k_k,:),1),2);
users_in_acluster = users_dist(k_index==k_k,:);
for user_index = 1:size(users_in_acluster,1)
[max_dista, max_index]= max((users_in_acluster(user_index,1) - users_in_acluster(:,1)).^2+(users_in_acluster(user_index,2) - users_in_acluster(:,2)).^2);
max_all = [max_all;[sqrt(max_dista), user_index, max_index]];
end
[max_k,max_index]= max(max_all(:,1));
max_index_in_acluster = max_all(max_index,2:3);
coor_center = sum(users_in_acluster(max_index_in_acluster,:),1)/2;
radiusOfCenter = norm(users_in_acluster(max_index_in_acluster(1),:)-users_in_acluster(max_index_in_acluster(2),:))/2;
k_means_center = [k_means_center; coor_center radiusOfCenter];%store the locations and radius of each cluster
end
%---------------------------------The locations of clusters have been fixed----------------------
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------The overlapped area---------------------------------
all_overlap_area = 0;
all_area = 0;
for k = 1:length(k_means_center)-1
for k_left = k+1:length(k_means_center)
radius_two_cluster = sqrt(sum((k_means_center(k,1:2)-k_means_center(k_left,1:2)).^2));
if radius_two_cluster < (k_means_center(k,3)+k_means_center(k_left,3))
r_1 = k_means_center(k,3); r_2 = k_means_center(k_left,3); D = radius_two_cluster;
theta_1 = acos((r_1^2+D^2-r_2^2)/(2*r_1*D));
theta_2 = acos((r_2^2+D^2-r_1^2)/(2*r_2*D));
overlap_area = theta_1*r_1^2 + theta_2*r_2^2-r_1*D*sin(theta_1);
all_overlap_area = all_overlap_area+overlap_area;
end
end
end
%---------------------------------------------------------------------
%r_distance = zeros(length(users_dist),length(k_means_center));
users_sum = 0;
%%%%%%%%%%%%%%%%%%%%%%
%--------Cell breathing to constrain the number of users.
users_overlap = 0;
for k = 1:k_cluster
r_distance = 0;
users_in_cell = users_dist(k_index==k,:);
for k_users = 1: size(users_in_cell,1)
r_distance = norm(users_in_cell(k_users,:)-k_means_center(k,1:2));
if r_distance>k_means_center(k,3)
k_means_center(k,3) = r_distance;
end
end
end
%--------------cell breathing
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%计算所有重叠部分的用户数
for k = 1:k_cluster
users_in_cell = users_dist(k_index==k,:);
for k_users = 1: size(users_in_cell,1)
for k_cluster_else = 1:k_cluster
if k_cluster_else~=k && norm(users_in_cell(k_users,:)-k_means_center(k_cluster_else,1:2))<k_means_center(k_cluster_else,3)
users_overlap = users_overlap+1;
end
end
end
end
% if length(users_in_cluster)>= max_users_in_aUAV
% [~,dis_label] = sort(r_distance(:,k));
% k_means_center(k,3) = r_distance(dis_label(max_users_in_aUAV),k);
% end
% users_sum = users_sum + sum(r_distance(:,k)<=k_means_center(k,3));
%--------------This function has been closed.
%%%%%%%%%%%%%%%%%%%%%%
%--------------For Power Allocation
index_users = users_data(:,1);
for cluster_for_p = 1:k_cluster
users_PL_cluster_p = users_data(index_users==cluster_for_p,2:end);
SINR_cluster = [];
for user_in_c_p = 1:size(users_PL_cluster_p,1)
SINR_interf = 0;
SINR_signal = powers_all(cluster_for_p) - users_PL_cluster_p(user_in_c_p,1);
count = 1;
for user_intf = 1:k_cluster
if user_intf~=cluster_for_p
count = count+1;
SINR_interf = SINR_interf + 10.^((powers_all(cluster_for_p)-users_PL_cluster_p(user_in_c_p,count))/10);
end
end
SINR_user = SINR_signal- 10*log10(SINR_interf);
SINR_cluster = [SINR_cluster; SINR_user 10*log10(SINR_interf)];
end
[min_SINR, min_user] = min(SINR_cluster(:,1));
powers_all(cluster_for_p) = SINR_threshold + SINR_cluster(min_user,2) + users_PL_cluster_p(min_user,1);
end
powers_collect = [powers_collect; powers_all'];
SINR_iter = powerAllocation(k_cluster, users_dist,k_index,powers_all,k_means_center,height_all);
end
%--------------End
%%%%%%%%%%%%%%%%
break
end
end
end %While end
if stop == 1
break
end
end
%rate(iter) = rate(iter)/k_cluster;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%---------------Calculate the new SINR
Rate2_low = powerAllocation(k_cluster, users_dist,k_index,powers_collect(size(powers_collect,1)-1,:),k_means_center,height_all);
figure; hold on; plot(rate,'r*');plot(Rate2_low,'ko');
figure; hold on;
for power_iter = 1:size(powers_collect)
plot(powers_collect(power_iter,:),'.','MarkerSize',18)
end
Copy the code