This is the 27th day of my participation in the August More Text Challenge
I have talked about the formula derivation of Attention in this article, which is mainly illustrated in the form of diagrams
Here is an Encoder schema, s0s_0s0 is equal to hMH_mhm in value, but with a different name
First we need to combine s0s_0s0 with all hi (I =1… ,m)h_i\ (i=1,… ,m)hi (i=1,… ,m) calculate a “correlation”, for example, calculate the correlation between S0s_0s0 and H1h_1h1 and calculate α1=align(h1,s0)\alpha_1= text{align}(h_1, s_0)α1=align(h1,s0)
After calculating m correlations α I \alpha_iα I, weighted average these values with HIH_IHI, i.e
To give you an intuitive sense of what this does, for those αk\alpha_kαk with large values, eventually a large portion of c0C_0C0 will also come from HKH_kHK. C0c_0c0 actually takes into account HHH at all times, except that it might pay more attention to some moments and less attention to some moments, and that’s the attention mechanism
After s0, C0, X1 ‘s_0, C_0, X’ _1S0, C0, X1 ‘as t= 0T = 0T =0 moment Decoder input, calculate s1S_1S1, and then calculate S1S_1S1 and all HI (I =1,… ,m)h_i\ (i=1,… ,m)hi (i=1,… M) between α I \alpha_iα I
Similarly, the weighted average of the newly calculated α I \alpha_iα I and hih_iHI yields the new context vector C1C_1c1
Repeat the above steps until the Decoder finishes
This is actually the end of Seq2Seq(with Attention), but there are some details. For example, how to design the align() function? How to apply CIC_ICI to Decoder? Here are the explanations
align()
How is the function designed?
There are two approaches, and in the original paper, Bahdanau’s paper, his design approach is shown in the figure below
The prevailing approach, which is also used in the Transformer architecture, is shown below
How to apply to Decoder?
Without further ado, see the picture below