Abstract: This article will explain the core of M-SQL: a multi-task representation learning method for converting natural language into SQL statements.
This article is shared from Huawei cloud community “[Cloud resident co-creation] M-SQL, a super multi-task presentation learning method, you deserve it”, author: Qiming.
Overview of the data set
Definition is introduced
Text toSQL, as the name implies, is in the given database (or table) under the premise, according to the user’s question, generate SQL statements. Its mathematical description is as follows:
Let X represent the user’s natural language question, D represent the database (or table) associated with the question, and Y represent its corresponding SQL statement.
The SQL statement generation task can be expressed as: for each independent set of (X,D,Y), map (X,D) to the corresponding Y.
Take a familiar scenario. Suppose we have a table of student information, we can ask in natural language: who are the students older than 18 years old, and the model needs to return an SQL statement associated with it:
SELECT name FROM student WHERE age > 18
Scene classification
There are many types of Text to SQL, one of which is by question:
One is context-free (there is no correlation between the questions) : spiders
One is contextual (there is some correlation between the two questions) : SparC
“Context-free” means that there is no correlation between the questions, whereas “context-relevant” means that there is some referential relationship or connection between the two questions.
Again, let’s take a simple example:
Question 1: What is the ID of a doctor with an appointment?
SELECT physician FROM appointment
Question 2: What are their names?
SELECT T2.name FROM appointment AS T1 JOIN physician AS T2 ONT1.Physician = T2.EmpoyeeID
Doctor ID = doctor ID = doctor ID = doctor ID = doctor ID = doctor ID
The other is to classify ** by field: ** single field or multiple fields
If all the questions are about aviation, this is a single domain data set. However, there may be many fields in the training set and many fields in the test set, but the fields in the training set and the fields in the test set do not coincide, which requires our model to have a certain generalization ability.
The third way is to classify by database:
Single-table database: WikiSQL, which asks questions for only one table or for only one table in the database
Multi-table database: Spider, which targets a database with many tables and generates SQL statements that may involve joins between multiple tables.
The fourth kind of classification is classified according to the annotation type:
End result: WikiTableQuestion
SQL statements: WikiSQL, Spider
Some datasets do not give the associated SQL statement, but rather give a result. For example, in the previous example, “Who are the students older than 18?”, the output may be given the SQL statement, or may be the final result: the names are listed, but the SQL statement is not given. This involves “weakly supervised learning”, which will not be explained in detail.
TableQA data set
This paper uses a TableQA data set, which is also a single table data. That is, each question is a question for only one table. TableQA and WikiSQL have many similarities, but there are also some differences, as shown in the following figure:
Summary of papers
After the introduction of the data set, we will give a brief introduction to the model presented in the paper.
First, consider this question: What are the options for generating SQL from a natural language? In fact, the simplest idea is: the input side is a natural language sentence, and the output side is the corresponding SQL statement (according to the SQL statement is generated according to one token one token).
For example, the Encoder is “coder > 18” and the output is SELECT name FROM XX table Y condition.
This approach is simple, but comes with a few problems: SQL statements are structured query languages, and they have a structure that is different from normal language generation tasks.
For a typical language generation task, if you change one or two of these words, the semantics might not change much. But for SQL statements, if a word is different, it may not be able to continue execution. So we need to take advantage of some of the syntax information inside the SQL statement, that is, the structure information. According to its structure to generate, which is proposed in the paper, according to the SQL framework to generate.
M-SQL
Because the TableQA dataset is only for a single table, the equivalent From sentence can be omitted. It can be roughly divided into two parts, one is the Select clause and the other is the Where clause.
The Select clause has two parts: one is the name of the selected table and the other is the aggregation operation. For example, if we want a maximum or minimum value of a column, or if we want to sum a column, we need an aggregation operation to do this.
For the Where clause, let’s look at it in detail:
$WOP: where (and /or /Null)
$COLUMN: specifies the COLUMN name in the database
$AGG: Select column operation (Null, AVG, MAX, MIN, COUNT, SUM)
$OP: column value in the WHERE clause
According to the statistics of TableQA data set, limit the maximum occurrence of 2 columns in SELECT and the maximum occurrence of 3 conditions in WHERE:
SELECT (
COLUMN)*
WHERE
COLUMN
VALUE)*
M – SQL model
The model can be roughly divided into three pieces from the bottom up.
Encoder: Encode an input; A simple Bert model, wwWM-ext, was used. WWM means that it uses a full-word coverage approach, while Ext expands its training set and increases its training deployment.
Its input parts include: question, column name. As shown in the figure above, T1 to TL is followed by the column name of each column in the table. For example, the table may have a name, student number or age. Also, unlike Bert’s input, it replaces [CLS] with [XLS].
Column representation: The representation of columns is enhanced; Since each column may consist of multiple tokens, for example, a column called “name” may have two characters each embeding, so how do you combine those two characters into one embeding as the representation of the column? We can use the previous REPRESENTATION of XLS to enhance the representation of columns as follows:
All token representations in the column are calculated using the previous REPRESENTATION of XLS, and when attention is calculated, the sum of the two is added to the previous representation of EMBEding, which forms the enhanced representation of the column.
After the above steps, we have a representation for each token in the question and for each column in the table.
Submodels: 8 submodels and 1 multi-task learning for these 8 submodels.
As mentioned earlier, we can split the SQL statement into different parts and generate each part separately, resulting in eight sub-tasks:
-
Select the number of columns
-
Where column number and connector
-
Select the column
-
Select the column operation
-
Where the column
-
Where operations on each column
-
The value of extraction
-
Values match
Next, we introduce each of the eight sub-tasks.
S-num: Specifies the number of columns in Select. [1, 2] (2 categories)
The first is the number of columns present in the Select. For TableQA data sets, the number of columns in the Select can only be one or two, so we can treat it as a dichotomous problem: the probability of obtaining it by sigmoid using the embeding linear transformation of XLS.
W-num-op: the number of connectives and conditions in Where. [null-1, AND-1, OR-1, AND-2, OR-2, AND-3, OR-3]
The second task is the number of hyphens and conditions that appear in Where. The so-called “connector” refers to “And” or “And so on; The number of conditions refers to the number of conditions Where “>”, “<“, “=”, etc. We can divide them into seven categories, with the hyphen before the “-” and the number of conditions after the “-“. Of course, you can separate the two tasks, but if you do the two tasks separately, the effect is much less than if you do the two tasks together.
So there are 7 types in total, which can be regarded as the problem of 7 categories. Therefore, XLS represents a linear transformation, and then softmax can get the probability distribution of these 7 categories.
The third and fourth subtasks are the columns that appear in the Select and Where clauses. We have predicted the number of cases in Select, and the number of cases in Where, so we can predict the probability of each case separately in this part.
Task 3: S-col: Select columns
Columns appearing in Select: Using our previously enhanced representation of each column, after a linear transformation and another Softmax, we can get the probability of this column appearing.
Task 4: W-col: Columns appearing in the Where condition
For columns in the Where condition: again, different linear changes are used to obtain the probability of the occurrence of this column.
Task 5: S-col-AGG: operator of columns appearing in Select
[Null, AVG, MAX, MIN, COUNT, SUM]
The fifth task is the operators that appear in Select, also known as aggregate operations. For example, we can maximize or minimize all the data in this column or average or sum and so on.
In TableQA, the 5 operators plus NULL make a total of 6, which we can see as a 6-class problem. Similarly, we perform a linear transformation on the enhanced representation of each column, and then softmax will get the probability distribution of each category.
Task 6: W-col-op: Conditional operator for columns appearing in the Where condition
[> / < / == /!=]
The same is true for the operators present in the Where condition. These operators, including the category of greater than one number or less than one number, or equal to a certain value, or not equal to a certain value, there are four classes, and we can think of it as a four-class problem. The method is the same as the operator in the previous Select, and the enhancement of the column is represented by a linear map, and then softmax gets the probability distribution of each of the 4 categories, and selects the largest one from.
The last two subtasks serve conditional value prediction. Again, take our previous example, “Who are the students older than 18?” The result should be Where one of the conditions is age > 18, so how do we get 18? The author divided the problem into two sub-tasks:
Task 7: Extract phrases that might be values from the question
The tokens in question are marked with 0/1 (1 for value, 0 for non-value), with each successive set of tokens marked with 1 as a whole
The first step is to extract phrases from the question that might have values. For example, the question “Who are the students older than 18?” the subtask is to extract “18” from the question. We can use 0 and 1 to mark the tokens in the question. For example, “18” in “Who is older than 18” was marked as 1, and then all the other tokens were marked as 0, and a linear transformation was made for the expression of one token in the question. Use sigmoID to predict whether it’s a 1 or a 0.
Task 8: Match the extracted phrase with the columns that appear in Where
Based on task 7, we need to match the extracted phrases with the columns that appear in where.
In the previous step, we labeled “18” with a 1, thus generating the “18” token sequence. It’s a value that might appear in one of the conditions, but what column does it appear after, that’s what we’re going to determine in this step.
Match the extracted phrase with the column that appears in Where. If the phrase is made up of multiple tokens, average the text representation of all tokens. As shown below, this formula is equivalent to finding a similarity between the phrase and the column in where, followed by a sigmoID. The previous u is a learnable parameter, and a sigmoID is used to determine whether the phrase matches the column: if it matches, take the phrase as the column value. For example, 18 matches age, and then we can write age> 18.
Execution-GuidedDecoding
We have given a brief introduction to the above eight sub-tasks. With these 8 sub-tasks, we can get an SQL statement and ensure that it is syntactic. However, the SQL statement it generates may still not be executed.
There may also be some constraints inside the SQL statement:
-
‘sum’, ‘min’, ‘Max’, ‘sum’, ‘min’, ‘Max’, ‘sum’, ‘min’, ‘Max’
-
If a column of type string appears in the WHERE clause, the corresponding operation cannot be ‘<‘, ‘>’.
-
The columns that appear in the SELECT and WHERE clauses are different from each other (analyzing the data set)
Under these limitations, we can adopt the method of execution-guided Decoding: In decode, SQL statements that cannot be executed, such as empty SQL statements or statements that cannot be executed at all, can be discarded in favor of SQL statements that have the highest probability of being executed.
The experimental results
Here are the results.
First of all, we will briefly introduce the evaluation indicators adopted, which are LX, X and MX.
LX is the accuracy of its logical form. If the generated SQL statement is exactly the same as the standard answer SQL statement, then the above two operations are correct; If there is a slight difference, such as the wrong “>”, or the wrong column, then this example is an error.
X is the accuracy of its execution results. If two SQL statements, perhaps with different logical forms (there may be some differences between the two SQL statements), execute the same result, then the prediction is correct.
MX is the average of the preceding LX and X. It has two models, a single model and an integration model (Ens later). Ensemble is used to integrate the results of multiple training and finally get a better result. We can see from the figure that it is better than the previous models.
Because the previous model is based on WikiSQL implementation. The TableQA we used is a little different and a little more difficult than WikiSQL, so the previous models didn’t work very well on TableQA for datasets.
Performance of subtasks
The following figure compares the performance of 8 different sub-models:
We can see that in each sub-model, the effect is very good, now the Ensemble can achieve better effect.
Ablation experiments
In the last part of the experiment we did a series of ablation experiments.
From the experimental results, we can see that the version using bert-wwM-Ext is better than bert-Base, and XLS is better than CLS. The lower part of the figure is some of the value extraction methods used, as well as some of the value matching methods, we will give you a more detailed introduction below.
Detail processing in reproduction
Next, we’ll cover some of the details of the reproduction process.
The first part is data preprocessing. For this dataset, the data is not very regular, and it is possible to have the following (ambiguity is indicated in parentheses) :
-
Number: which city clinch a deal on a week a first-hand room exceed 150 thousand? (15, 15)
-
Year: Do you know the land turnover area in 10 years? (10 Years, 2010)
-
Unit: Which cities have more than 50,000 units of newly sold inventory in the last week? (50,000, 50,000)
-
Which company was established on December 28th, 2018? (December 28, 2018, December 28, 2018)
-
Synonym: Can you help me calculate the total number of views for Mango? (Mango, Mango TV)
The first few problems can be directly converted in accordance with certain rules; And these can be replaced by going to the database to find the relevant category word.
The value of the sampling
In the part of “value extraction”, we tried many methods, such as Bert + CRF method, Bert + Bilstm + CRF method, and Bert + half-pointer method. In the end, the 0/1 notation was used because it worked best.
-
Bert + CRF, Val_ACC: 0.8785
-
Bert + BilstM + CRF, VAL_ACC: 0.8801
-
Bert + half-pointer, val_ACC: 0.8891
-
Bert + 0/1 flag, val_ACC: 0.8922
How does the 0/1 approach work? We started with the question “Where is Qingxiu South City Department Store Limited?” Let’s take an example.
Query: Where is Qingxiu South City Department Store Limited?
Bert_tokenizer: [‘ [like] ‘, ‘green’, ‘show’, ‘south’, ‘city’, ‘best’, ‘goods’,’ a ‘, ‘limited’ and ‘male’, ‘department’, ‘in’, ‘o’, ‘? ‘, ‘[SEP]]
Value: Qingxiu South City Department Store Co., LTD
Tag: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
First, Tokenizer this problem, and then get a sequence of tokens. If the value “Qingxiu Nancheng Department Store Co., LTD.” appears in the SQL statement, these tokens are marked as 1. Anything else that does not appear in the SQL statement is marked as 0.
Detail processing
The Value to retrieve
Because in the extraction of value, the extracted value may not be too standard, or the problem is not consistent with the database. Take renren and Renren in the picture below:
Query1: What is the weekly rise or fall of Renren?
Value: everyone
In this case, we need to do a search of all the values in the value and SQL columns, and select the word closest to it as the final value. There are a number of methods available for retrieval, such as rouge-L matching, and several machine learning methods: logistic regression, SVR, and Bayes. By comparing the results, we found that logistic regression was the best method, with 97% accuracy.
Table-Column Enhanced information:
The final section uses the contents of the table to enhance the presentation of the columns.
As shown in the figure above, for example, in the category of region, we randomly select a column value, such as “Guangxi”, our column is represented as “region, Guangxi”, and the whole column is represented as a representation of this column, and send it to the input end, and then further obtain the representation of the column. Enhancing the column in this way results in a 0.4 improvement.
Problems and suggestions in reoccurrence
1. The data set is not standardized, so it is suggested to extract and select some standardized data for training and prediction;
2. Do not reproduce from 0. Instead, base on existing models and refer to existing code.
M-sql: A multi-task presentation learning method for converting natural language into SQL statements
To view the video of this paper interpretation and algorithm link, please click:
Marketplace.huaweicloud.com/markets/aih…
Click to follow, the first time to learn about Huawei cloud fresh technology ~