The series of articles on big data are presented in three sections: technical ability, business foundation and analytical thinking. You will gain:
1. Improved my confidence, handled the interview freely, and got the internship position or offer successfully;
2. Master the basic knowledge of big data and communicate with other colleagues barrier-free;
3. Have some practical project ability, and get directly involved in big data work;
Big data engineer series columns: interview questions, development experience, tuning strategies
Knowledge system of Big data Engineers:
What does a Big data engineer do? In the era of big data, there is a key position that has to be mentioned, that is big data engineer. You may also be curious about what big data engineers do on a daily basis.
1. Data collection Find out the data describing users or helpful to business development, define the relevant data format, and hand over to the business development department to collect the corresponding data. 2.ETL project carries out various cleaning, processing, transformation and other operations on the collected data to complete format transformation, which is convenient for subsequent analysis and ensures data quality, so as to obtain reliable results. 3. Build a data warehouse to effectively manage data, build a unified data warehouse, connect data with data, and collide with greater value. 4. Data modeling Based on existing data, the complex relationship between data is sorted out, and an appropriate data model is established to facilitate the analysis of valuable conclusions. 5. Statistical analysis Statistical analysis of data in various dimensions, establish an indicator system, systematically describe the current state of business development, find problems in the business, and find new optimization points and growth points. 6. User portrait Based on the user’s various aspects of data, to establish a comprehensive understanding of the user, to build a portrait of each specific user, in order to complete fine operation for each individual.
Essential skills for big data engineers
So, the question is, if you want to become a big data engineer, competent for the above work, what kind of conditions need to have? What kind of knowledge do you have?
Since this is the basics, the main content of this article is to explain the basics of SQL for data analysis. This paper will explain it from four aspects:
Part 1, some basic concepts of database system. We’ll cover some important database-related concepts to help you understand how SQL is written and, more importantly, to communicate SQL with others.
Part 2, basic syntax for SQL queries. We’ll show you how to use the SELECT clause and do some simple statistical analysis of a single table.
Part 3, database functions, predicates, and CASE expressions. We’ll look at some common functions, predicates for true and false, and CASE expressions for multi-condition tests.
Part 4, associated query and subquery. We will show you how to join multiple tables, obtain more information by crossing tables, and use subqueries to continue analysis on the results of a query.
Due to different database manufacturers, engine implementation is different, SQL syntax, keywords, functions and so on are slightly different, so this paper only take MySQL, which is most widely used in Internet companies, as an example to explain, the SQL and examples involved in this paper are running in MySQL. Now let’s get into the formal knowledge.
1. Some basic concepts of database system
The core goal of this section is to enable you to understand some of the core basic concepts that are often mentioned and used in your daily work so that you can communicate and collaborate with other team members smoothly and happily. Since the focus of this article is ON SQL, we will only cover concepts related to relational databases.
1.1 Database and database management system
When we talk about a database, we have two meanings: one is the warehouse where the data is stored, and the other is the database that holds a set of tables at the abstract level. Of these, the latter is the more common. Database management system, usually refers to the management of a database system, through it can achieve the definition of data, insert, update, delete, query and other operations, it provides a window for data analysts and data interaction.
Table 1.2
A table in a relational database is usually a two-dimensional table composed of rows and columns used to store data. Tables are the direct carriers of data storage, and our data is usually stored in tables. Databases are basically organized by tables. Therefore, tables are also the most direct objects for us to query and retrieve data.
For tables, there are several features:
A. A table is composed of multiple correlated columns and can store N rows of data. Each row of data is called a record
B. The column names in the table must be unique
C. A column in a table can store only one type of data
1.3 Data Types
Different database management systems support slightly different data types. This article takes MySQL as an example to introduce several of the most common data types, as shown below:
Numeric types
The date type
String type
These are just a few of the most commonly used types, so if you come across other types that are not commonly used, you can do your own research online.
1.4 the primary key
A primary key is a combination of one or more columns that identifies a unique record in a table. Therefore, one of its natural properties is that it is not repeatable and cannot be NULL. We usually use an incremented integer value as the primary key, which is maintained by the database management system to ensure uniqueness and ease of use. The primary key of a table is also commonly used as an object referenced by other tables, which is a foreign key, as discussed later.
1.5 the foreign key
A foreign key is usually used to establish an association between two tables. The foreign key of one table is usually the primary key of the other table with which it is associated. In this way, the two tables can be connected to form an intermediate table through the relationship between the foreign key and primary key of the two tables, and the information of the two tables can be fused to generate greater value in associated query.
1.6 the index
If you want to quickly find the part of a book that you’re interested in, you go to the table of contents, and the table of contents helps you quickly locate the page that you want to read. For a table in a database, an index is the table’s directory. The main purpose of its existence is to speed up the query. Of course, there are other uses for indexes as well, and they are designed in a very clever way, which we will cover in more detail in our next advanced SQL article.
1.7 table relationships
In a relational database, there are usually three types of relationships between tables: 1 to 1, 1 to many, and many-to-many. For the sake of description, we assume that we have two tables, table A and table B.
1 to 1 means that after tables A and B are associated by A certain field, one record in table A corresponds to at most one record in table B, and one record in table B corresponds to at most one record in table A. One-to-many: after tables A and B are associated by A field, one record in table A may correspond to multiple records in table B, and one record in table B may correspond to at most one record in table A. Many-to-many means that after tables A and B are associated by A field, one record in table A may correspond to multiple records in table B, and one record in table B may correspond to multiple records in table A. One-to-one and one-to-many relationships can be expressed by referring to the primary key of the corresponding table with a foreign key. The many-to-many relationship is usually expressed by using an intermediate table, which records the correspondence between the primary keys of two tables.
1.8 the view
If an SQL result is frequently used in a daily query, we usually consider using a view to store it, and then the next time we use the view, we will execute the SQL statement corresponding to the view, which is very simple and convenient. So, a view is a virtual table. It is important to note, however, that views store SQL statements, not the results of SQL execution, which are dynamically generated with each execution and may change with each read.
1.9 the collection
A collection means “the sum of (various) things” in mathematics and a collection of records in databases. Specifically, the execution results of tables, views, and queries are collections of records. Is a set, you can carry out set operations, such as union, intersection, difference set, etc. In addition, the result of the query is also a collection, so you can treat the result of the query as a table and continue to analyze it based on that table. This is the theoretical basis of subqueries. In Part 4, we’ll cover subqueries in more detail.
2 Basic syntax of SQL queries
SQL, short for Structural Query Language, is the Language for developers to talk to database management systems. SQL A statement that describes an operation using a combination of keywords, table names, column names, and operators. SQL is an international standard, so its universality is self-evident.
2.1 the keyword
SQL has many keywords, and each keyword has a different meaning and usage. This article only lists the basic keywords and their meanings commonly used in data analysis work related to query analysis, some of which are explained in detail in the following paragraphs, as shown in the table below:
2.2 Writing Rules
SQL writing rules are very simple and flexible, but if you are not careful, it is easy to make mistakes. Common rules in work are as follows:
A. Keywords, table names, and column names are case-insensitive;
B. Write in full English half corners (keywords, Spaces, symbols);
C. SQL statements end with semicolons (;).
D. Use half-corner Spaces or line breaks to separate words and operators in SQL statements.
E. The function name and parentheses are integrated. There must be no Spaces between them.
F. Write numeric constants directly, such as 20;
G. Date and string constants need to be enclosed in single quotation marks, for example, ‘2002-10-01 12:23:21’, ‘Lily’;
H. Comments can be written in three ways: single-line comments (#, –) and multi-line comments (/* */). Single-line comments are recommended with “–“.
2.3 Simple Query
The simplest query statement is “SELECT * FROM A”, where A stands for table name A
SQL means to query all data for all columns from table A. *” is shorthand for all columns in table A. Let’s start with this simple SQL and work our way up to a complex SQL by adding keywords.
The simple queries we are going to cover are all queries for a single table. Queries against single tables, while relatively simple, are the basis for complex queries. Student = student; student = student; student = student;
The meanings of the columns from left to right are student number, name, age, English score, math score and total score.
2.4 the filter
If we want to filter the data in a table to keep only the data that meets our requirements, we use the WHERE keyword. The WHERE keyword is followed by one or more expressions concatenated by the logical operator, each with a final result of TRUE or FALSE, reserving only lines WHERE the expression results in TRUE.
For example, we want to obtain the name and student number of the student whose English score is not qualified, then the corresponding SQL is
SELECT sno, name
FROM student
WHERE eng_score < 60
Copy the code
The running results are as follows:
2.5 the operator
Operators, as their name implies, are symbols used to perform operations. There are three common types of operators, comparison operators, arithmetic operators, and logical operators.
Not equal to the judgment, the vast majority of database management system vendors are also supported! The =” operator “has the same meaning as “<>”.
Arithmetic operators can be written next to field names, such as eng_score-math_score. Therefore, field and table names cannot be named with a hyphen (“-“), otherwise it will be mistaken for subtraction.
When multiple logical operators exist, parentheses are used to delimit the order of execution to avoid ambiguity. Expressions organized in parentheses are more readable. It is recommended that you do not memorize the priority of the logical operator, which is easy to remember and mix, and the SQL written is less readable. It is best to use parentheses to clarify the relationship between multiple logical conditions, which is clear and easy to understand, readable and not prone to error.
With these operators in mind, we can combine them to write complex expressions behind WHERE to satisfy our filtering needs.
2.6 Group Aggregation
Group aggregation means that we can group the data in a table according to one or more columns, and then aggregate the values of other columns, such as counting, summing and averaging. The keyword used is GROUP BY, and we can also use HAVING to filter the results after grouping.
For example, from the Student table, find the average of the number of ages, the total score in English, and the score in math, and filter them out. The corresponding SQL is
SELECT age, COUNT(sno) AS student_num,
SUM(eng_score) AS sum_eng_score,
AVG(math_score) AS avg_math_score
FROM student
GROUP BY age
HAVING avg_math_score >= 60
Copy the code
The result after running is as follows:
It is important to note that the fields or formulas that appear after group by must appear after the corresponding SELECT, and the select cannot be followed by other fields except these fields or formulas. Only aggregate functions can be used.
2.7 to heavy
The DISTINCT keyword is used to de-duplicate one or more columns, returning results that exclude duplicate rows. DISTINCT When multiple columns are deleted, duplicate rows are deleted only when all columns are the same. DISTINCT does not filter out NULL values, but only one NULL value is retained in the result after deduplication.
For example, from the student table, find several ages of students. The corresponding SQL is
SELECT DISTINCT age FROM student
Copy the code
The result after running is as follows:
2.8 the sorting
In everyday life scenarios, we are often interested in various rankings, focusing on the top of the list. In a database, the ORDER BY clause is used to find the ranking. ORDER BY is usually used in conjunction with ASC and DESC. You can ORDER BY one or more columns in ascending or descending ORDER, and then use LIMIT to retrieve the first N rows that satisfy the condition.
For example, from the student table, find the names, ages, and grades of the top 3 students in math. The SQL is as follows:
SELECT name, age, math_score
FROM student
ORDER BY math_score DESC
LIMIT 3
Copy the code
The results after running are as follows:
2.9 Adding constant columns
To add a constant column, we add a fixed constant value as a column to our result data. This is usually done when all rows in the result set have the same value on an attribute, and you can augment that column by adding a constant column. We use the following example to demonstrate the syntax.
For example, select student from student where English score > 80 and assign them to class A. The SQL is as follows:
SELECT sno, name, 'A' AS class FROM student WHERE eng_score > 80
Copy the code
The results after running are as follows:
AS you can see from the example, it is very convenient to add constant columns directly by “constant AS new column name”.
Database functions, predicates, and CASE expressions
SQL has a powerful analytical expression, one of the important reasons, is that it has rich functions, through the combination of these functions can realize the complex processing of data, and finally get the data we want. On the other hand, SQL also has a wealth of predicates to judge data and match it to our needs. A CASE expression is a multi-condition judgment expression that can return different values based on different conditions, similar to IF ELSE in programming languages.
3.1 Aggregate Function
Aggregation function, also known as analysis function, is a group of values through aggregation analysis to obtain a value, hence the name aggregation function. The five most frequently used aggregation functions are shown in the following table
A common feature of aggregate functions is that NULL values are ignored in the calculation, because aggregation of NULL is meaningless. The COUNT, SUM, and AVG functions can also be used with DISTINCT, which means that a target column is de-weighted first and then the result is aggregated. SUM and AVG can only be applied to one column, and the data type of the column is numeric. MIN and MAX can also be applied to only one column, but string and date types are supported as well as numeric ones. COUNT can be applied to one or more columns, and there is no restriction on the type of column.
3.2 Arithmetic Function
Arithmetic functions used to perform various mathematical operations on numeric types. SQL provides a series of arithmetic functions in addition to the four operators (+-*/), as shown in the following table:
Here is a list of some commonly used functions that can be combined to perform complex calculations. If the above table does not meet your analysis needs, you can Google or check the official documentation to find a matching arithmetic function.
3.3 Date function
In daily analysis work, it is often necessary to add and subtract dates, format dates and other processes, which cannot be separated from powerful date processing functions. The commonly used date functions are as follows:
Date function Most functions used to get the current date and time are empty parameter functions. Although the parameter is empty, the parentheses after the function name cannot be omitted. Although database vendors also provide some of the same attribute values as the function name, it can be called without parentheses, but I suggest that it is better to use the function with parentheses, which is more recognizable and more readable.
3.4 String Functions
String is an important carrier of information, which contains a large amount of important information, so the processing of string is very important, and the corresponding string processing functions are also very rich, we list some of the most commonly used functions below:
3.5 Conversion Function
When the type of some data is not the type we want, we can use the type conversion function to convert it to the type we want. There are two commonly used type conversion functions, CAST and CONVERT, which do the same thing with slightly different syntax. The CAST function is used AS CAST(field AS data type) and CONVERT(field, data type).
However, not all types are interchangeable, and some conversions result in a loss of accuracy, so use them with caution.
3.6 Other Functions
(1) There are also some functions that are used for specific purposes. This paper also lists several commonly used functions in data analysis work.
(2) MD5 function, its function is to generate equal length information summary. In data analysis work, it is often used for desensitization of sensitive information, because it is difficult to infer the content before encryption through THE MD5 value, so it is very secure. It is used in MD5(STR), which returns the checksum string calculated by MD5 algorithm on STR.
IFNULL(expr1, expr2) : If expr1 is not NULL, expr1 is returned; otherwise, expr2 is returned. Usually used to fill a NULL value for a field, also called missing value fill.
(4) IF(expr1, expr2, expr3) : returns the value of expr2 IF expr1 is not equal to 0 or not NULL, otherwise returns the value of expr3. Equivalent to an IF ELSE condition statement in a programming language
3.7 the predicate
In simple terms, a predicate is a keyword used to determine the truth or falsehood of a statement about the relationship between two objects. The return value is only true or false. That might be a little abstract. So let’s do a couple of predicates so you can see.
For example, the comparison operators we talked about earlier belong to the category of predicates. Some other predicates are shown in the following table:
3.8 CASE Expressions
CASE expressions in SQL statements correspond to conditional branches in programming languages, and play the role of multi-condition judgment to return multiple values. Its grammatical form is:
CASE
WHEN < expression 1> THEN < expression 1>
WHEN < expression 2> THEN < expression 2>
WHEN < expression 3> THEN < expression 3>
.
ELSE < expression > END
(1) The execution process is to judge whether the value returned by the evaluation expression after WHEN is true or false according to the written order. If the returned value is false, continue searching downward; If the value returned is true, THEN is executed, and the value returned is returned. The CASE expression exits. If all WHEN clauses are not satisfied, the expression following the ELSE is executed, returns the value, and the CASE expression exits.
(2) Understand the execution process, so when writing CASE expressions, we must pay attention to the order problem. If the value of < 1> is false and the value of < 2> is true, THEN the value of < 1> is false and < 2> is true. If the third THEN is executed, the actual effect is that both < evaluate expression 1> and < evaluate expression 2> are false, while < evaluate expression 3> is true, and so on.
3.9 NULL Value Determination
A NULL value must be judged using the predicate IS, because a NULL comparison with any other value (including a NULL value) results in NULL, which corresponds to false. This is easy to understand; you can think of NULL as unknown. If you compare the unknown with any value, you’re still going to get an unknown, and if you compare the unknown with the unknown, you’re going to get an unknown.
4 Associated query and subquery
Armed with the knowledge base of the previous three parts, we are ready to start learning about complex queries in SQL. There are two complex queries to be discussed in this article, one is associated query, one is subquery. First of all, let’s look at their theoretical basis, set operations.
4.1 Set Operation
(1) In Part 1, we mentioned that in the database world, a collection is the sum of a set of records, which can refer to a representation, a view, or the result of a query execution. Therefore, both tables and queries can participate in collection operations if the result of their execution is a collection. In other words, the result of query execution can be regarded as an intermediate or temporary table, which continues to participate in the operation. This is the theoretical basis of subquery.
(2) Set operations mainly include four kinds, union, intersection, difference and Cartesian product.
(3) Union, is to find the set after the combination of two sets. In MySQL, use the keyword UNION or UNION ALL. The difference between the two is that UNION will remove the redundant duplicate values in the set after merging, and only keep one copy. And UNION ALL, it doesn’t eliminate duplicates. Therefore, the UNION operation may result in a decrease in the number of records after it finishes running.
(4) Intersection, is to find the set of elements commonly owned by two sets. There are no specific keywords provided in MySQL, but rather they are implemented through internal associations, which are covered in the next summary.
(5) Difference set is to find the set of elements that exist in one set but do not exist in another set. Error set calculation is directional. Similarly, MySQL does not provide keywords for error set calculation, but requires left/right correlation and filtering out unsuccessful records.
(6) The Cartesian product is the combination of records in two sets, which is equivalent to the multiplication of sets. It is the mathematical theory foundation of associative query. You can simply say that the process of a relational query is to take the Cartesian product and filter out the records that meet the criteria through the on condition. Of course, the actual execution process will not be so simple, but it is optimized on the basis of this process to reduce the amount of calculation.
In the operation of union, intersection and difference set of sets, we need to pay attention to the following:
(1) The number of columns in the records of the two sets involved in the operation must be the same.
(2) The column types of the corresponding positions of the two sets involved in the operation must be consistent.
(3) If the ORDER BY clause is used, it must be written at the end.
4.2 Table Association Type
INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN
(2) The syntax of association is relatively simple. Take INNER JOIN as an example, write it as: A INNER JOIN B ON expr. Where A and B represent the names of two tables, which can also be subqueries. The expr following ON represents an association condition, usually an expression consisting of the associated fields of tables A and B.
(3) INNER JOIN, which usually can omit INNER without writing, means that after the multiplication of the left and right sets, only records that meet the association conditions behind ON are retained. Therefore, an inner join can be used to calculate the intersection of two sets by writing the fields of the set elements in the association condition following ON.
LEFT OUTER JOIN LEFT OUTER JOIN LEFT OUTER JOIN LEFT OUTER JOIN LEFT OUTER JOIN LEFT OUTER JOIN LEFT OUTER JOIN LEFT OUTER JOIN Therefore, the left outer join can be used to calculate the difference set of the set. All we need to do is filter out the records that have been successfully associated, leaving the records that have not been successfully associated in the left table, which is the difference set we want.
RIGHT OUTER JOIN (OUTER JOIN) RIGHT OUTER JOIN (OUTER JOIN)
FULL OUTER JOIN FULL OUTER JOIN FULL OUTER JOIN FULL OUTER JOIN FULL OUTER JOIN FULL OUTER JOIN
4.3 Association of Multiple tables
The essence of a multi-table association is still a pair association. For example, associating table B with table C in table A is actually equivalent to associating table B with table A, with the result being an intermediate table and then associating with table C. So, the execution process is still pairwise.
4.4 Precautions for Table Association
Table association is a relatively complex query method. When writing, we need to construct the appearance of the set after association in our mind and select the connection method to be used accordingly. The following is a summary of practical work experience prone to error points, I hope you pay attention to.
A. Using UNION may reduce the number of records, and may lead to deviation in calculation when using aggregate functions
B. When using 1-pair many-to-many or many-to-many tables for association, the number of records may increase, which may also lead to calculation bias
C. Left outer join and right outer join have connection direction problem, table position is influential to the result, especially when multi-table association, must pay attention to the order of writing, as far as possible to do the first inner join and then left/right outer join.
D. Avoid cross-connections
4.5 the subquery
(1) subquery, is a query SQL statement nested by parentheses, usually a complete SELECT statement.
(2) Sub-query in different positions, play different roles. It usually appears after SELECT, after FROM/JOIN, and after WHERE/HAVING.
(3) When a subquery appears after a SELECT, it usually adds a column to the result. Note here, however, that the subquery used after the SELECT can only return a single column, and ensure that the subquery returns only a single row if the condition is met. Attempts to retrieve multiple columns or return multiple row results raise an error.
(4) The subquery appears after FROM/JOIN, which is the most commonly used way for us to take the results of the subquery as an intermediate table and continue to do analysis based on this table. When a subquery appears after WHERE/HAVING, the result returned by the subquery is filtered. According to the number of results returned by subquery, there are three cases, namely 1 row, 1 column, N row, 1 column, and N row, N column.
(5) When the return result is 1 row and 1 column, it actually returns a specific value. This subquery is also called standard quantum query. The result of scalar quantum query can be calculated directly by comparison operator.
(6) When the return result is N rows and 1 column, it actually returns a set of values of the same type. Therefore, you can use the IN predicate, as well as the keywords ANY, SOME, ALL, and so on.
(7) If the result is N rows and N columns, a temporary table is returned. The value comparison cannot be performed. Instead, the EXISTS predicate is used to determine whether the set returned is null.
Sorting is not easy, [comments, likes, favorites] is the biggest support for me!!
Recently in view of the Internet company interview asked knowledge points, summed up the Java programmer interview involves most of the interview questions and answers to share with you, I hope to help you review before the interview and find a good job, but also save you on the Internet to search for information time to learn. Full Version of the Java InterviewJAVA backend test integration
Content covers: Java, MyBatis, ZooKeeper, Dubbo, Elasticsearch, Memcached, Redis, MySQL, Spring, SpringBoot, SpringCloud, RabbitMQ, Kafka, Linux and other technology stacks.
Original link: blog.csdn.net/weixin_3903…