Multimodal Learning with Incomplete Modalities by Knowledge Distillation
The paper links
Introduction to the
A modal is defined as a collection of heterogeneous features collected from different fields or extracted from different feature extractors. Due to the rapid growth of heterogeneous data, multimodal learning has gained great attention in recent years.
The modal feature set describes the same subject and provides shared and complementary information about the subject. Multimodal learning improves the performance of the learning model by integrating the prediction information of different modes. Different modes are extracted from different domains or feature extractors, so the representations of modes can vary greatly from one another.
This paper presents a new multimodal learning framework for integrating supplementary information in multimodes, based on Knowledge Distillation, which can be used with all samples including incomplete modal samples. The main steps of this method are as follows:
- The model is first trained separately for each mode and all available data.
- The trained models are then used as teachers to train a student model, a multimodal learning model that incorporates supplementary information from multiple modes.
Soft labels and real Sudoku one-hot labels from teacher models are used in the training of student models. The characteristic of this approach is that the incomplete modal samples are neither discarded nor filled, but are used to train the teacher model and ensure that the teacher model is an expert.
The research methods
Introduction to Knowledge Distillation
Knowledge distillation is used to impart teachers’ “dark knowledge” to students. In order to achieve knowledge transmission, teachers need to be trained based on data sets. The trained teacher model is defined as Te(ϕ)Te(ϕ)Te(\phi)Te(ϕ), where ϕ\phi is the parameter of the teacher model. Then, the student model is trained based on the training data set. The training aim is to imitate the teacher’s output.
Given data set D={{X1,y1},{X2,y2}… ,{XN,yN}}D=\{\{X_1,y_1\},\{X_2,y_2\},… ,\{X_N,y_N\}\}D={{X1,y1},{X2,y2},… ,{XN,yN}}, used to train students, teachers first apply these data and mark them with logits. If there are a total of CCC types, the annotation set can be obtained: ZI =Te(Xi; ϕ) z_i = Te (X_i; \phi)zi=Te(Xi; ϕ), where ZI ∈RC×1z_i \in \mathbb{R}^{C \times 1}zi∈RC×1 is the logits marked by the teacher model pair sample XiX_iXi. Student model passes real Sudoku hot label {y1,y2… ,yN}\{y_1,y_2,… ,y_N\}{y1,y2,… , yN} and logits {z1, z2,… ,zN}\{z_1,z_2,… ,z_N\}{z1,z2,… ZN} for training.
Suppose the student model is a deep neural network f(θ)f(\theta)f(\theta)f(θ) with parameters, whose input is XiX_iXi and output is a C×1C \times 1C×1 logit vector. Then, add a Softmax function to the logit vector to output the probability that XiX_iXi is classified as CCC.
PS: Logits is the output result of a layer of neural network in deep learning. This output is usually followed by a probability output after normalize by Softmax Layer, which is used for multiple classification and comes from links.
The loss function of the training student network is:
Where LCL_CLC refers to classified loss:
Here HHH is negative cross-entropy loss, Mathbb {R}^C \rightarrow \mathbb{R}^Cσ(x):RC→RC
ld(Xi,zi; Theta) l_d (X_i z_i; \theta)ld(Xi,zi; θ) is distillation loss. Examples of distillation loss are negative cross entropy loss and KL-divergence, which is used in this paper as distillation loss:
Here the sigma (x; T)\sigma(x; T) sigma (x; T) is softmax function with temperature coefficient TTT (refer to temperature parameter in deep learning) :
Through TTT, the output probability is readjusted and smoothed. The larger TTT value is, the smoother the probability value is. Sigma T (zi; T)\sigma_T(z_i; T (T) sigma zi; T) is called soft label, which is labeled based on sample XiX_iXi through the teacher model. Soft label contains more information than hot label.
Multimodal learning with missing modes
First, two modes are used as examples, and then several modes are extended.
Given two labeled modes {X1∈Rn1×d1,X2∈Rn2×d2}\{X^1 \in \mathbb{R}^{n_1 \times d_1},X^2 \in \mathbb{R}^{n_2 \times D_2} \} {X1 ∈ Rn1 x d1, X2 ∈ Rn2 x d2} :
- Assume that samples with complete modes are: {X1c ∈ Rnc X d1, X2c ∈ Rnc X d2, yc ∈ Rnc} \ {X ^ {1} c \ \ mathbb in ^ {R} {n_c \ times d_1}, X ^ {2} c \ \ mathbb in ^ {R} {n_c \ times d_2}, y \ ^ c in \ \ mathbb {R} ^ {nc}} {X1c ∈ Rnc x d1, X2c ∈ Rnc x d2, yc ∈ Rnc}.
- Assume that the sample with only the first mode is: {X1u ∈ Rn1u X d1, y1u ∈ Rn1u} \ {X ^ u {1} \ \ mathbb in ^ {R} {n_ u {1} \ times d_1}, y ^ u {1} \ in ^ \ mathbb {R} {n_ u} {1} \} {X1u ∈ Rn1u x d1, y1u ∈ Rn1u}.
- Assume that the sample with only the second mode is: {X2u ∈ Rn2u X d2, y2u ∈ Rn2u} \ {X ^ 2 u} {\ in \ mathbb {R} ^ {n_ {u} 2 \ times d_2}, y ^ 2 u} {\ in ^ \ mathbb {R} {n_ {2 u}} \} {X2u ∈ Rn2u x d2, y2u ∈ Rn2u}.
At this point, n1 + n1un_1 = = nc n_c + n_ = {u} 1 n1 + n1u nc, n2 = n2un_2 = n_c nc + + n_ {u} 2 n2 = nc + n2u.
As shown in Figure 1(a), the sample in the blue dotted box has a complete mode, while the sample in the yellow dotted box has only one mode.
In order to utilize all the samples, all available data including missing modal samples are first used to train two single-modal models, which are teacher models in the framework designed in this paper. Assume two teachers for neural network g1 (ϕ 1) g_1 \ phi_1 g1 (ϕ 1) and g2 (ϕ 2) g_2 \ phi_2 g2 (ϕ 2), ϕ 1, 2 \ phi_1, ϕ \ phi_2 ϕ 1, ϕ 2 for the input parameters. G1 (ϕ 1) g_1 (\ phi_1) g1 (ϕ 1) will [X1c, X1u] [X ^ {1} c, X ^ u {1}] [X1c, X1u] in the samples as input, output as logits, similarly g2 (ϕ 2) g_2 (\ phi_2) g2 (ϕ 2). Teachers train by minimizing the following loss function:
Then two teachers were used to mark the samples in {X1c and X2c}\{X^{1c} and X^{2c}\}{X1c and X2c}. The logits of sample I −thi-thi−th was:
Here the
For teachers
On the sample
Logit of the tag.
In order to aggregate the supplementary information of different modes, a student model was trained through a multimodal DNN (M-DNN) network. The m-DNN of two modes consists of two branches, each taking one mode as input, followed by several nonlinear fully connected layers. The output of all branches is linked together to form a joint representation. This union representation is then connected to a linear layer and outputs logit ZZZ.
The student network is defined as f(θ)f(\theta)f(θ), where θ\thetaθ is a parameter and its loss function is:
Among them,
Is the distillation loss,
Two tunable parameters to control how much the student model learns from the teacher model, if the parameter value is large, it means that the student model needs more knowledge.
FIG. 2 gives an overview of the whole framework, and the pseudo-codes of the proposed method in this paper are shown in Algorithm 1.
The method of two modes is extended to multimode
X1∈Rn1×d1X^1 \in \mathbb{R}^{n_1 \times d_1}X1∈Rn1× D1, X2∈Rn2×d2X^2 \in \mathbb{R}^{n_2 \times d_2}X2∈Rn2×d2, . , Xm∈Rnm×dmX^m \in \mathbb{R}^{n_m \times d_m}Xm∈Rnm×dm. This data set can be divided into NNN parts:
- Xic∈Rnc×di, I ={1,2… , m} X ^ {IC} \ \ mathbb in ^ {R} {n_c \ times d_i}, I = \ {1, 2,… , m \} Xic ∈ Rnc x di, I = {1, 2,… , m}.
- Sample with single mode: Xiu∈Rnui×di, I ={1,2… , m} X ^ {10} \ \ mathbb in ^ {R} {n_ {UI} \ times d_i}, I = \ {1, 2,… , m \} a Xiu ∈ Rnui x di, I = {1, 2,… , m}.
- Samples with two modes: Xku {ij} ∈ Rnu {ij} X dk, X ^ {ku \ {ij \}} \ in \ mathbb {R} ^ {n_ {u} \ {ij \} \ times d_k}, Xku {ij} ∈ Rnu {ij} X dk, I, j = {1, 2,… , n}, k = {I, j}, k = \ {I, j \}, k = {I, j}, Xku X ^ {ij} {ku \ {ij \}} Xku {ij} to include the ith a modal and the first j a modal of a subset of the sample in the first k mode.
- Xku{M∖ I}∈Rn{M∖ I}×dk, I ={1,2… , m} X ^ {ku_ {\ \ {m setminus I \}}} \ in \ mathbb {R} ^ {n_ {\ \ {m setminus I \}} \ times d_k}, I = \ {1, 2,… , m \} Xku ∖ I} {m ∈ Rn ∖ I} {m x dk, I = {1, 2,… , m}.
Use {M}\{M\}{M} to represent the index set of all MMM modes, {M}={1,2… M, m} \ {\} = \ {1, 2,… M, m \} {} = {1, 2,… ,m}, then {m ∖ I}\{m \setminus I \}{m ∖ I} means the set without index III, KKK is an index in {m ∖ I}\{m \setminus I \}{m ∖ I}. Xku{M∖ I}X^{ku_{\{M \setminus I \}} Xku{M∖ I} is the KKK th mode in the subset consisting of samples containing {M∖ I}\{M \setminus I \}{M∖ I} modes.
The teacher models are trained in a hierarchical way:
- The teacher model is trained for each mode and Tei, I ={1,2… Te_i, m}, I = \ {1, 2,… Tei, m \}, I = {1, 2,… , m}.
- Using these teacher models to teach the teacher model with two modes, the teacher model Teij, I,j={1,2… Te_, m} {the ij}, I, j = \ {1, 2,… , m \} Teij, I, j = {1, 2,… , m}.
- Use all TeijTe_{ij}Teij to teach the teacher model with three modes… And so on, finally get all the teachers through the method of stratification.
Define teachers with HHH modal training as H-level teachers. {Ch}\{C_h\}{Ch} represents the set consisting of all combinations of HHH indexes sampled from the set MMM. The size of {Ch}\{C_h\}{Ch} is (mh)\tbinom{m}{h}(hm). The H-level teacher model is trained according to the modal formula of the element index in {Ch}\{C_h\}{Ch}. The t teacher model in the H-level teacher is defined as TeCht(ϕ HT)Te_{C_{HT}}(\phi_{ht})TeCht(ϕ HT \phi_{ht}ϕ HT is a network parameter, ChtC_{ht}Cht is the t th element of the set {Ch}\{C_h\}{Ch}. TeCht Te_ (ϕ ht) {C_ {ht}} (\ phi_ {ht}) TeCht ϕ (ht) by minimizing the loss function of training below:
Including 1 ∣ Ch – ∣ | C_ {h – 1} | ∣ Ch – 1 ∣ for collection of Ch – 1 C_ {} h – 1 Ch – 1, the size of the NChtN_ {C_ {ht}} NCht to ChtC_ {ht} to Cht for the index of the modal of the size of the sample. After you have all the teachers, you can use them to train the final student model.
An implicit problem is that when the number of modes is large, a large number of teacher models may be required. For MMM modes, the maximum number of teacher models is 2m−12^m-12m−1. So training all the teacher models requires a lot of computational cost. To solve this problem in the future, this paper proposes to improve the scalability of the framework by pruning the teacher model, selecting only those teachers with high performance to train the second-level teachers, and training the remaining teachers in the same way.