Weakly supervised algorithms are weaker than supervised algorithms in processing data point information. Unlike marker points, they take as input similarity judgments of tuples of data points, such as pairs of similar and dissimilar points.
General API
The input data
In the following paragraphs, for generalisation purposes, the subject of tuples is discussed, which can be pairs, triples, quads, etc., depending on the metric learning algorithm.
Basic form
The input for each weakly supervised algorithm is primitives with sample points, and you can label those primitives if necessary. The progenitors of these sample points can also be called constraints. A label is some information about the set of points (e.g., “These two points are similar”)
The tuple argument is the first argument to each method (such as the XXX argument in the classic sciKit-learn method), and the second argument is the label, whose semantics depend on the algorithm used. For example, for a sample pair learner, YYY is a label that indicates whether the samples are similar or dissimilar.
A weakly supervised metric learner can then be matched on this tuple as follows:
As in the classic setup, XXX is divided into training set and test set, here tuple is divided into training set and test set:
3 – dimensional tuple array
The most intuitive way to represent tuples is to provide the algorithm with a 3-dimensional array tuple of the shape :(n_tuples, tuple_size, n_features). N_tuples represents the number of tuples, tuple_size represents the number of elements in a tuple, and n_features represents the number of features in each sample point.
Note: This approach is not recommended for large numbers of primitives because it is redundant and requires a large amount of memory.
Indicators + Preprocessor for the 2-dimensional array
A more efficient approach is to represent the tuple in terms of the index of the sample points, resulting in an array that is 2D due to the loss of the characteristic dimension:
To accommodate metric learning algorithms with this type of input, it is necessary to give the estimator the sample point XXX of the original data set so that it knows the points to which the index points.
Fit, transform, and so on
The goal of weakly supervised metric learning algorithms is to transform sample points into a new space in which tuple-level constraints can be complied with.
Or (with a preprocessor) :
Now that the Estimator is installed, you can use it to process the new data.
First, we can use transform to transform data into the learning space. The following example converts two sample points to the new embedding space:
The metric learner learns the distance between two sample points, which can be obtained by the following two methods:
- use
score_pairs
The function returns the distance between a pair of sample points:
- Or can be used
get_metric
Function constructs measure functions:
The Mahalanobis_matrix function can be used to obtain the Mahalanobis_matrix distance measure:
Prediction and scoring
Weak supervision can also predict the tag (sample pair) or order (quad) of a given tuple after fitted, they also have scoring mode.
Scikit – learn compatibility
Weakly supervised estimator compatible with sciKit-learn programs (sklearn.model_selection. Cross_val_score, Sklearn.model_selection.GridSearchCV)
Sample pair learning
Some metric learning algorithms are based on sample pairs. In this case, N_samples pairs and their corresponding labels should be provided to the algorithm. The label value of 1 or -1 indicates whether the given sample pairs are similar.
fitted
Here is a FITTED example:
One metric learned here is that in the transformed space, the first two points are closer together and the last two points are farther apart.
To predict
When the sample fitted to the learner, it predicts:
Prediction threshold
To predict whether a pair of new samples represent similar or dissimilar samples, it is necessary to set a threshold of learning distance. In the learning space, distances smaller than this threshold are predicted to be similar, while those exceeding the threshold are predicted to be dissimilar. There are several ways to achieve this threshold:
- Make corrections at fitUse:
calibrate_threshold
Set the threshold on the training setfit
Method to specify calibration parameters directly, usingthreshold_params
Parameters. Note that calibration on the training set may lead to some overfitting, to avoid this situation calibration can be done after fitted.
- Calibrate on the validation set: call
calibrate_threshold
Calibrate the threshold on the validation set to obtain a specific score between classical classification scores (accuracy, F1 scores…). .
- The manual thresholdUse:
set_threshold
Manually set the threshold to a special value:
score
The sample pair metric learner can also return a decision_function for paired composition, with the fraction corresponding to the opposite value of the distance in the new space (high score means sample points are similar, low score means sample points are different).
This allows for dichotomy using generic scoring functions, such as sklear.metrics. Accuracy_score can be used in cross validation routines:
The sample also has a default score for the learner, which basically returns sklearning.metrics. Roc_auc_score.
algorithm
ITML
Information Theoretic Metric Learning (ITML)
ITML Minimization (differential) relative entropy, also known as kullback-Leibler divergence, forms a Bregman optimization problem by minimizing LogDet divergence under linear constraints between multivariate Gaussian equations with correlated Mahalanobis distance constraints. The algorithm can deal with all kinds of constraints and selectively add priors into the distance function. Unlike other methods, ITML does not rely on eigenvalue calculations or semidefinite programming.
Given a Mahalanobis distance MMM, its related multivariate Gaussian distribution can be expressed as:
ZZZ is the normalized constant, and the inverse matrix M−1M^{-1}M−1 is the covariance of the Gaussian distribution.
Given similar pair SSS and dissimilar pair DDD, the distance metric learning problem is to minimize LogDet divergence, which is equivalent to minimizing KL(P(x; M) ∣ ∣ p (x; M))KL(P(x; M)||p(x; M))KL(P(x; M) ∣ ∣ p (x; M))
Where, u,lu, LU and L are the upper and lower limits of similar and dissimilar pairs, M0M_0M0 is the prior distance measure, and Dld(⋅)D_{ld(\cdot)}Dld(⋅) is the logarithmic determinant.
SDML
Sparse High-Dimensional Metric Learning (SDML)
SDML is an efficient sparse metric learning that uses double regularization in high-dimensional Spaces: Use L1-penalization and the logarithmic determinant divergence between M,M0M,M_0M,M0 (set to III or ω −1\Omega^{-1} ω −1, where Omega \Omega is the covariance matrix) for the off-diagonal elements of Markenstein matrix MMM
Formal optimization based on semi-positive definite matrix MMM is convex:
XXX for training data set, ∣ ∣ ⋅ ∣ ∣ 1 | | \ cdot | | _1 ∣ ∣ ⋅ ∣ ∣ 1 is a diagonal L1 norm.
RCA
Relative Components Analysis (RCA)
Based on a weighted in-Chunklets covariance matrix, RCA learns a full Scale Mahalananer distance measure, which uses a global linear map to assign larger weights to related dimensions and lower weights to unrelated dimensions. These related dimensions are estimated using chunklets(subsets of points known to belong to the same class).
For the training set of NNN size in KKK chunklets, the algorithm calculates:
Here, the chunklet JJJ by {xji}, I = 1 nj \ {x_ {ji} \} ^ {nj} _ {I = 1} {xji} I = 1 nj, its average for m ^ j \ hat {m} _jm ^ j. C−1C^{-1} the inverse of C−1 is the Mahalanobis distance.
MMC
Metric Learning with Application for Clustering with Side Information (MMC)
MMC minimizes the sum of distances squared between similarities and forces the sum of distances between different points to be greater than one. This leads to convex functions which effectively solve the problem of local minimum optimization. However, this algorithm involves the calculation of eigenvalues, which is a major speed bottleneck. Because it was originally designed for clustering applications, one of the implicit assumptions of MMC is that all classes form a compact set that follows a single-peak distribution, which limits the possible range of applications of this approach, but it remains one of the earliest and still frequently cited techniques.
The algorithm aims to minimize the sum of distances between all similarity points while constraining the sum of distances between different points:
Triplet learning
fitted
Here is an example of a FITTED triplet:
To predict
score
The triplet metric learner can also return a decision_function for a set of triples that corresponds to the distance between the first two points minus the distance between the first and last point of the triplet (the higher the value, the more similar the first point is to the second point). The score can be interpreted as measuring the likelihood of this triad getting a +1 prediction:
In the example above, for the first triplet in the triplet test, the first point is expected to be less similar to the second point than to the last point (they are farther apart in the transformation space).
Unlike pair learners, triplet learners are not allowed to give YYY in fitting: it is assumed that the ordering of triplet midpoints makes all trained triples positive. Therefore, it is not possible to use the Scikit-learn scoring feature (such as F1_ score) on triplet learners.
However, triplet learners do have a default scoring function, which basically returns the accuracy score for a given test set, the percentage of triples with the correct prediction order.
algorithm
SCML
Sparse Compositional Metric Learning (SCML)
SCML provides a set of KKK rank-One PSD bases with a coefficient Mahalanopis distance from triplet constraints by optimizing the sparse positive weights. This can be formalized as an optimization problem with only KKK parameters and solved by an efficient random combination scheme.
Mahalanobis distance MMM is based on a basis set B={bi} I =1… ,KB=\{b_i\}_{i=1,… ,K}B={bi}i=1,… ,K is constructed by a KKK vector w={wi} I =1… ,Kw=\{w_i\}_{i=1,… ,K}w={wi}i=1,… ,K to give the weight:
Optimization questions about the WWW can be formalized with a classic marginal based Hinge loss function, including triples of the set CCC. Add a regularization L1L_1L1 to produce a combination of coefficients, formalized as follows: