“This is the third day of my participation in the First Challenge 2022, for more details: First Challenge 2022”.
Introduction:
This paper briefly introduces some statistics and data set characteristics of WikiSQL data set, and briefly summarizes the baseline: SEQ2SQL model of the data set. This article will introduce a better known model, SQLNet, after seq2SQL.
Sqlnet: Generating structured queries from natural language without reinforcement learning
SQLNet profile
The SQLNet model is one of the better known Baseline following the WIkiSQL dataset. Because THE SQL in the WikiSQL dataset is relatively simple, as shown in the figure below for an example in WikiSQL, SQLNet solves the task of predicting an SQL statement into predicting the six parts that make up the SQL statement.
As shown in the following figure, SQLNet divides SQL statements in WikiSQL into the following parts: aggregate after SELECT, column used, column after WHERE clause, OP operator, VALUE, etc.
SQLNet model
SQLNet uses the idea of slot filling to introduce a sketch (the template shown above) that states all slots that need to be predicted.
In SQLNet, there are six modules to predict:
- SELECT_COL (which column name to fill in the main sentence)
- AGG (which aggregate function is used in the main clause)
- #COND (the number of conditions in the WHERE clause)
- COND_COL (which column is the condition for a WHERE clause)
- OP (which operator is used for a condition of a WHERE clause, such as ><=)
- COND_VAL (conditional value of a WHERE clause)
By modeling this way, the Text2SQL task is transformed into a task that fills each slot in the template separately.
For example, the OP and VALUE to be predicted in the WHERE clause are highly dependent on the prediction of column in the WHERE clause. Therefore, SQLNet makes full use of the dependency graph to predict a slot using only the information (modules) relevant to the current token prediction.
skills
There are two important techniques in SQLNet: SEq2Set and Column attention. Its brief overview is as follows:
- Seq2set: The WHERE statement will contain different conditions. The order does not affect the result. The question is converted to “predict which columns should be contained in the WHERE statement”.
- Column attention: For the prediction of a specific Column, different parts of Question play different roles. Attention mechanism is introduced to express this correlation. Use “Question” to do a attention for each column.
The results of
Through the introduction of these techniques, SQLNet achieved a huge improvement over SEQ2SQL in this data set, achieving SOTA at the time.
TypeSQL: Knowledge-based Type-Aware Neural Text-to-SQL Generation
TypeSQL profile
The previous two models completely ignored the data types of individual columns, but data types are an important piece of information. For example, when predicting WHERE clauses, only numeric columns, not string values, can compare sizes. With this in mind, TypeSQL takes full advantage of the type information for each word in question (e.g., a word is a column name, integer value, etc.) and achieves a new SOTA.
Acquisition and utilization of type information
Type information in the TypeSQL model is obtained by breaking up questions and searching them in Freebase, the column name of a table, and the table Content.
As with SQL Net, the idea of slot filling is used to fill each slot of Sketch, and the BiLSTM used is greatly simplified (12->6). To better model the rare entities and numbers that occur in text, TypeSQL explicitly assigns each word type.
Its type recognition process is as follows: the question is segmented n-gram (n is 2 to 6), and search database tables, columns. If the match is successful, the column type is INTEGER, FLOAT, DATE, and YEAR. For named entities, search FREEBASE to identify five types: PERSON, PLACE, COUNTREY, ORGANIZATION, and SPORT. These five types cover most entity types. When database content is accessible, matching entities are further marked as specific column names (not just column types).
The proposed framework
SQLNet provides a separate model for each of the six components in the template, while TypeSQL improves on this. Similar components, such as SELECT_COL and COND_COL, and #COND (condition number), have dependencies that can be better modeled by merging them into a single model. TypeSQL uses three separate models to predict template fill values:
- MODEL_COL: SELECT_COL, #COND, COND_COL
- MODEL_AGG: AGG
- MODEL_OPVAL: OP, COND_VAL
The results of
The results show that TypeSQL improves model performance and achieves new SOTA through efficient use of type information.
conclusion
SQLNet and TypeSQL are two important early baseline of WikiSQL. SQLNet transforms Text2SQL tasks into slot filling segmented tasks based on the characteristics of WikiSQL data set SQL, which has a great impact on subsequent research. TypeSQL takes advantage of the neglect of type information in previous work and introduces type information to achieve a new SOTA. This also inspires us to be good at summarizing the characteristics of the problem in the follow-up work, and summarize and explore the negligence and shortcomings of predecessors for innovation.
This blog introduces some of the early work of WIkiSQL, before pre-training models like BERT were widely used. The next blog will introduce several ways to use the pre-training model.