Yucong, OceanBase tools team technical expert Sun Xien, also known as Yucong, has been engaged in infrastructure and middleware research and development for a long time. Now I am working in OceanBase tool team, responsible for the research and development of basic components & tools, and committed to building a complete set of ecological tool system for OceanBase. In Java, distributed, infrastructure, middleware, “O” and other fields have a wealth of practical experience in research and development.
The introduction
In enterprise applications, not all businesses use online migration/synchronization tools to process data. Some services use the export tool to export data from the database and import it to other libraries for service analysis. Some services use export tools to export data from the database and transfer files to other systems for processing…. Including regular logical backup and recovery, daily offline migration and other operations, we need to export the schema or data in the database to a file. Only by accurately understanding the specifications of different data formats and skillfully using import and export tools can we make better use of these data in our daily work.
This article mainly introduces common data formats, best practices, common cases, and precautions based on OceanBase’s open source import and export tool.
The body of the
OBLOADER is a client import tool developed in Java that provides the ability to import structure and data into the OceanBase database. The tool provides very flexible command-line options that allow you to import structures and data into OceanBase databases in a variety of complex scenarios. OBLOADER is mainly used with OBDUMPER. However, in external services, OBLOADER also supports import of SQL or CSV files exported by tools such as Navicat, Mydumper and SQLDeveloper. OBLOADER takes full advantage of the OceanBase distributed system and optimizes the performance of import.
File size limits and recommendations
It is strongly recommended that you do not directly use files that are too large, such as 100 GB or larger data files. If the derivative tools you use (such as mysqlDump and SQLDeveloper, etc.) do not export as a file block, we can look for third-party shard tools to shard large files. The following describes how to do data sharding under different systems.
Linux or macOSX
Using the built-in split command, you can split a large CSV file into several smaller files. The split [-a suffix_length] [- bbyte_count | m] [k] [-l line_count] [-p pattern] [file] [name] sample statement as shown below: split -l 100000 pagecounts-20210723.csvpages
The above example shards the data file pagecounts-20210723.csv according to the number of lines. Assuming the current file size is 8 GB and contains 10 million lines of data, split the large file into 100 sub-files with a name suffix of PAGES and the size of each sub-file will be 80MB (1,000,000/100,000=100) with 100,000 lines.
Note: The split command splits the file according to the number of lines, and a split error may occur when the data is broken.
Windows
The Windows platform does not have a built-in shard tool. But you can search the web for a number of third-party sharding tools and scripts for shredding large files.
Note: 1. When using OBDUMPER to export data, the program limits the file size to 1 GB by default in order to avoid writing all the data in a table to a single file, resulting in the creation of a large file in the disk. If the volume of exported data exceeds 1 GB, the program generates a new file and continues writing. 2. When using OBLOADER to import data, files are logically split by default to improve the import performance. Note that this is just a logical split, and no temporary files are generated. Even with large files (100 GB or more), the program can split the file in a very short time so that each sub-file can be parsed in parallel and then imported into the database concurrently. Therefore, you do not need to start multiple OBLOADER processes for concurrent imports.
Accurately identify data formats
OBDUMPER supports three common data formats in scenarios such as offline migration and logical backup, including CSV, Insert, and Delimited Text. File formats and file name extensions are different concepts. File format, usually refers to the organization of the content in the file. For example, 123. CSV files usually store data in CSV format. CSV is used as the file name extension to improve identification. You can also store CSV data in a 123.txt file. This does not affect the parsing of the data file by the software program, but it may be difficult to identify the contents of the file by the file name. The fastest way to accurately identify the data format is to preview part of the data content. In this section, we will teach you how to recognize these data formats and avoid frequent errors in daily operations.
CSV format
CSV is the most common data format in the industry, and it is also the recommended data format. In addition to the basic format specification (see RFC4180), there are two pitfalls to be aware of. 1. The data contains a special character. The basic symbols in CSV format are delimiters, column delimiters, and row delimiters. If the preceding symbols or NULL values exist in the data, the escape parameter must be set in the exported command; otherwise, the generated CSV file cannot be parsed correctly.
2. Differences between CSV format and Excel format. CSV is text format and can be opened for preview with any editor. Excel format is binary format and can only be opened with Microsoft Excel software. Because Excel software can parse data in CSV format, many users are used to preview the content of CSV files with Excel software. However, this requires you to be familiar with the flexibility of CSV format and have high proficiency in Excel software. For example, when Excel parses a CSV file, the default delimiters are double quotation marks, column separators are commas, and escape characters are double quotation marks. If you open a CSV file that does not use the default symbols of Excel software, the resulting preview will show the wrong content. In addition, The Windows version of Excel supports CSV configuration, but the MacOS version of Excel does not.
The following is an example CSV format:
CSV Header indicates that the first line of the CSV file stores column names in the table. You can choose not to insert column headers. (Optional) CSV Record INDICATES that a CSV file stores data in tables. Delimiters, separators, newlines, and NULL in the data need to be escaped.
The Insert format
The Insert format is the Insert SQL statement we are most familiar with, and to ensure that the SQL syntax is correct, you must set the correct escape symbol. Insert SQL statements are prone to syntax errors:
1. The object name or column name is an SQL keyword. This problem can cause SQL statements to fail. The solution is to add an escape symbol to the object or column names in the program logic when creating Insert SQL statements. For example, Oracle syntax uses double quote escape (“) and MySQL syntax uses backquote escape (‘).
2. Single quotation marks exist in service data. This problem can cause Insert SQL parsing errors. The solution is to escape data with special characters in the program logic when generating Insert SQL statements. When exporting data using the OBDUMPER tool, the data is escaped by default, so you are guaranteed to end up with a syntactically correct Insert SQL statement.
The following is a sample Insert format:
Delimited Tex T format
The Delimited Text format is the simplest Text format and the one that is most easily confused with CSV format. A CSV delimiter can contain only one character. The default delimiter is a comma. Delimited Text Can be a single character or multiple characters. The basic symbols of the Delimited Text format are column delimiters and newlines, without delimiters. Characters that conflict with the base symbol must not appear in the data, or the program will not parse properly. For example, the Delimited Text format requires one line of data to be an exclusive line in the file. Assuming there are newlines in the data, two lines of data are parsed.
The following is a sample Delimited Text format:
Processing of semi-structured data
The semi-structured data formats we typically encounter are JSON or XML data types. These two data types are themselves a composite data definition format. For example, you can store data in any format such as CSV or Insert SQL statements in JSON or XML data nodes. So when you prepare data, be sure to do a good job of delimiting and escaping data. Currently, data of this type in files cannot be correctly split. Therefore, if the OBLOADER is used to import files containing JSON or XML data, you can adjust the file splitting threshold so that the program can skip automatic file splitting.
Processing of binary large objects
Large BINARY object types such as RAW, BINARY, and LOB (Large ObjectBinary) are not uncommon in enterprise-level business applications. Some database import and export tools do not support handling these data types. It is highly recommended that you encode binary data into hexadecimal strings for file storage and parsing, as tools such as OBDUMPER and MySQLDump do by default. In extreme scenarios, loBS can store hundreds of MB or even gigabytes of data, which poses significant performance and storage challenges for import and export tools. In daily operations, you can try to separate forms with binary large object data types.
Date-time type processing
In Oracle or OceanBaseOracle mode, date-time data types are complex and error-prone. Such as accuracy, formatting, and time zone errors.
Before exporting data from the source library, set the expected date and time format. Before importing data from the target library, set the expected date and time format. Given that the date-time format of the source and target libraries is inconsistent, date-time data cannot be imported correctly. In everyday operations, this kind of problem is very common. Second, date-time types vary from database to database. For example, the DATE type represents different DATE and time information in different databases such as MySQL, DB2, Oracle, and OceanBase. For these differences, you can only consult the database documentation to resolve compatibility issues, otherwise there may be precision loss or write failures.
You can format the date and time data using the following parameters:
NLS_DATE_FORMAT YYYY-MM-DDNLS_TIMESTAMP_FORMAT YYYY-MM-DD HH24:MI:SSXFF AMNLS_TIMESTAMP_TZ_FORMAT YYYY-MM-DD HH24:MI:SSXFF AM TZR
Write in the last
When exporting data from the database, you only need to follow the above instructions carefully, and check and set them. Then whatever derivative tool is used, it will produce the required data file. Again, you can easily use these data files for subsequent imports. Don’t try to cut corners!