The article source | turbine cloud community
Both of the original address | is better than a: table sequence encoder extract a combination of entities and relationships
The original author | Mathor
code
Abstract
For joint entity relationship extraction, many researchers reduce the joint task to a form-filling problem, and they mainly focus on learning a single encoder to capture the information required by two tasks in the same space (one table extracts entities and relationships). In this paper, we propose a novel table-sequence encoder, in which two different encoders (Table and Sequence encoder) are designed to help each other in the process of representation learning, and prove that two encoders are more advantageous than one. Table structure is still used, and attention weight in BERT is introduced to learn the representation of elements in the table.
1 Introduction
In several kinds of combined extraction method, the NER and RE problem is transformed into a form, for the formation of the 2 d table, each entry in the table to capture the interaction of the sentences in two separate words, NER tasks to be converted into a sequence tags problem, namely the diagonal entries is label, and RE considered label other entries in the table; This approach integrates NER and RE into a single table, enabling potentially useful interactions between the two tasks.
The author thinks a table to solve two problems might be affected by the characteristics of mixed (a task to extract feature may be consistent with the characteristics of another task or conflict, leading to learning become chaos) model, secondly the structure didn’t make full use of the table structure, because this method is still a sequence table structure can be converted to, and then use the method of sequence tags fill out a form, As a result, key structural information in the 2D table may be lost during transformation (the same tags are shared in the lower left corner of Figure 1).
Aiming at the above problems, this paper proposes a new method to solve the above limitations. Two different structures (sequence and table) are used to represent NER and RE separately;
Not only can these two separate representations be used to capture task-specific information through this structure, but the authors have designed a mechanism for the two subtasks to interact in order to take advantage of the inherent connections behind the NER and RE tasks.
2 Model
2.1 PROBLEM FORMULATION
Gold Entity tags yNERy^{NER}yNER is BIO; RE is a table fill task
Formally, x = [xi] 1 or less a given input sentence I Nx = [I] x_ or less _ {1 I N} or less or less x = [xi] 1 I N or less, or less maintenance label table: YRE = [yi, jRE] I acuities were I, j Ny or less ^} of {RE = [y ^ {RE} _ {I, j}] _ {I acuities were I, j N} or less yRE = [yi, jRE] I acuities were I, j, N or less
Suppose from mention xib… , xiex_ {I ^ b},… , x_ {I} ^ e xib,… Xie to mention XJB,… , xjex_ {j ^ b},… , x_} {j ^ e XJB,… ,xje has relation r, then:
Which I ∈ (ib, ie) Sunday afternoon j ∈ I \ [jb, je] in [I ^ ^ I b, e] \ wedge j \ [^ b j, j ^ e] in the I ∈ (ib, ie) Sunday afternoon j ∈ (jb, je), an \ bot an words to said there was no relationship
Table 3 shows the details of the two encoders at each level and how they interact
In each layer, the table encoder constructs the table representation using the sequence notation, which is then used by the sequence encoder for contextual processing of the sequence notation.
2.2 the TEXT EMBEDDER
For a sentence containing N words x=[xi]1≤ I ≤Nx=[x_i]_{1≤ I ≤N}x=[xi]1≤ I ≤N, where the word embedding is: xw∈RN×d1x^w\in \mathbb{R}^{N\times d_1}xw∈RN×d1, LSTM calculates the character embedding as: Xc ∈RN×d2x^c\in \mathbb{R}^{N\times d_2}xc∈RN×d2, xl∈RN×d3x^l\in \mathbb{R}^{N\times d_3}xl∈RN×d3
S0∈RN×HS_0\in \mathbb{R}^{N\times H}S0∈RN×H:
TABLE 2.3 ENCODER
The table encoder shown on the left side of Figure 3 is a neural network for learning table representations (N by N vector tables), where the vectors of the ith row and JTH column correspond to the ith and JTH words of the input sentence. This article first joins the two vectors represented by the sequence, then builds a non-context table through a full join layer to halve the hidden size.
Formally, for the l layer (the l layer of table representation and sequence representation interaction), we have Xl∈RN×N×HX_l\in \mathbb{R}^{N\times N\times H}Xl∈RN×N×H:
Next, the multi-dimensional recursive neural network of GRU is used for context processing of XlX_lXl, and the hidden state of each cell is iteratively calculated to form a table representing TlT_lTl of the upper and lower cultures, where:
Multidimensional GRUs typically take advantage of context along the dimensions of layers, rows, and columns, that is, taking into account not only the cells of adjacent rows and columns, but also the cells of the level above. For the table method, where the sentence length is N, then the time complexity becomes ON×NO^{N\times N}ON×N. Diagonal entries can be computed simultaneously (defining diagonal entries as offsets at positions (I,j)) and then optimized by parallelization to reduce the complexity to ONO^NON.
By figure 4 intuitively as you can see, let network can omni-directional bearing surroundings can improve performance, so need 4 RNN to complete the work, from four directions to access context for the 2 d table model, however, the writer found on experience only consider a and c setting effect and the four kinds of full consideration of the performance of about the same, Therefore, in order to reduce the amount of calculation, the author uses two directions, and the final table representation is the series of hidden states of two RNN:
2.4 the SEQUENCE ENCODER
Learn a Sequence of Vectors through Sequence Encoder, where the i-th vector represents the i-th word in the input sentence. This structure is similar to transformer structure, as shown in the right part of Figure 3. In this paper, table-guided Attention is used instead of scaled dot product Attention.
The calculation process of general attention is as follows:
For each query, the output is the weighted sum of these values, where the weight assigned to each value is determined by the correlation between query and all keys:
Where U is the learnable vector, g is the function that maps each Query-key pair to the vector, and the output of f in Figure 5 is the attention weight of the query-key pair construction.
In table guided attention in this paper, the input is the sequence of the upper layer representing Sl−1S_{L-1}Sl−1. Table guided attention is self-attention, and the scoring function f of attention is:
Advantages of tables to guide attention:
(1) Do not calculate g function, because TlT_lTl is already obtained from table encoder
(2) TlT_lTl is the representation of the context along the row, column and layer dimensions, corresponding to query, key and value respectively. Such context information enables the network to better capture the more difficult Word-word dependency relationship
(3) The table encoder is allowed to participate in the learning process, thus forming a bidirectional interaction between the two encoders. The rest of the sequence encoder is similar to Transformer. For the LLL layer, a feedforward neural network FFNN is used after self-attention, and residual connections and layer normalization are used to obtain the output sequence representation:
Pre-trained ATTENTION WEIGHTS
The dashed lines in Figure 2 and 3 come from the information in the form of attention weight of BERT, the pre-training model. In this paper, the attention of all heads and all layers is superimposed. Tl∈RN×N×(Ll×Al)T^{l}\in \mathbb{R}^{N\times N\times (l ^ L \times A^ L)}Tl∈RN×N×(Ll×Al), where LlL^ LlL is the number of transformer layers, AlA^lAl is the number of heads in each layer. In this paper, TlT^lTl is used to form the input of md-rnn in the table encoder. Formula 1 in section2.3 is replaced by:
2.6 TRANING AND EVALUATION
SLS_LSL and TLT_LTL are used in this paper to predict the probability distribution of entity and relational labels:
YNER, YREY^{NER}, Y^{RE}YNER, YRE are the random variables of the prediction label, PθP_\thetaPθ is the probability estimation function, θ\thetaθ is the model parameter
Cross entropy was used to calculate the loss:
YNERy ^{NER}yNER and yREy^{RE}yRE are Gold tags. The target function is LNER+LREL_{NER}+L_{RE}LNER+LRE
In the evaluation process, the prediction of relationships depends on the prediction of entities, so the actual measurement is predicted first and then the relational probability table Pθ(YRE)P_\ Theta (Y^{RE})Pθ(YRE) is searched to see if there is a valid relationship between the predicted entities. Specifically, the entity tag of each word is predicted by selecting the category with the highest probability:
The entire tag sequence can be converted to entities with their boundaries and types. For a given span two entities: (ib, ie) and (jb, je) (i_b, i_e) and (j_b j_e) (ib, ie) and (jb, je), the relationship given by the type:
Where orthogonal complement \bot orthogonal complement means no relation
3 Experiments
3.1 the MODEL STEP
According to ACE05 verification set, other data sets use the same Settings. Glove is used to initialize word embedding, BERT uses default parameters and stacks three coding layers with independent parameters (Figure 2, each layer contains GRU units). For table encoders, two independent MD-RNN are used. For the sequence encoder, 8-head attention representation is used, and the results are as follows:
4 revelation
- The paper is written really good, strong expression ability
- Parts of the Transformer framework can be used elsewhere
- It’s still a cliche that you need a lot of video memory
- The interaction strategies for table representation and sequence representation can be modified to other decoding methods to implement interaction