preface

Many tools are emerging to handle the growing amount of data on the Internet, the most popular being the Hadoop architecture. In addition to familiar Hadoop components such as HDFS, MapReduce, HBase and Hive, general big data processing platforms often use Kafka or other message queue tools, Redis or other caching software, Flink or other real-time streaming data processing software. MongoDB, Cassandra and other NoSQL databases are also used for storage. Such a typical big data processing platform can basically deal with references in the Internet industry, such as typical user portraits, public opinion analysis and so on.

Naturally, after the Internet of Things, the Internet of vehicles and the Industrial Internet, everyone thinks of using a common big data processing platform to process their data. Big data platforms such as Internet of Things and Internet of vehicles, which are popular in the market, are almost without exception such architectures. This method has proved to work completely. But how effective is this universal approach? It can be said that there are many deficiencies, mainly manifested in the following aspects:

**1) Low development efficiency: ** Because it is not a single software, it needs to integrate at least four modules, and many modules are not standard POSIX or SQL interfaces, and have their own development tools, development language, configuration, etc., which requires a certain learning cost. And because data flows from one module to another, data consistency is vulnerable. At the same time, these modules are basically open source software, there are always various bugs, even with the support of technical forums and communities, once stuck by a technical problem, it will always cost engineers a lot of time. Generally speaking, a good team needs to be built to assemble these modules smoothly, so it takes a lot of human resources.

**2) ** Low operation efficiency: The existing open source software is mainly used to deal with the unstructured data on the Internet, but the data collected by the Internet of Things are all sequential and structured. Processing structured data with unstructured data processing techniques consumes much more resources, both for storage and computation. For example, the smart electricity meter collects the current and voltage, which are stored in HBase or other KV-type databases. The Row Key is usually the ID of the smart electricity meter plus other static label values. The key of each collection quantity consists of Row key, Column Family, Column Qualifier, timestamp, key value type, and so on, followed by the value of the specific collection quantity. This stores data in overhead and wastes storage space. And if you want to do the calculation, you need to analyze the specific collection amount first. For example, to calculate the average voltage over a period of time, you need to parse the voltage out of KV’s memory, put it in an array, and then do the calculation. The overhead of resolving KV structure is very large, resulting in a significant decrease in the efficiency of calculation. The biggest benefit of KV storage is schemaless. You don’t need to define data structures before writing data, so you can record as much as you want, which is an attractive design for Internet applications that are updated almost every day. However, for Internet of Things, Internet of vehicles and other applications, there is little interest, because the schema of data generated by Internet of Things devices is generally unchanged, even if changed, the frequency is very low, because the corresponding configuration or firmware needs to be updated.

**3) ** High operation and maintenance cost: each module, no matter Kafka, HBase, HDFS or Redis, has its own management background and needs to be managed independently. In the traditional information system, a DBA only needs to learn how to manage MySQL or Oracle, but now a DBA needs to learn how to manage, configure and optimize many modules, and the workload is much larger. Moreover, locating a problem becomes more complicated due to the excessive number of modules. For example, a user finds that a piece of collected data is lost. Is the loss Kafka, HBase, Spark, or application loss? It often takes a long time to find out the cause by associating the logs of each module. And the more modules there are, the less stable the system as a whole becomes.

**4) slow launch of **** application, low profit: ** Due to low efficiency of research and development and high operation and maintenance costs, the product takes longer to market, making enterprises lose business opportunities. Moreover, these open source software are evolving, and it takes a certain amount of manpower to keep up with the latest versions. Except for Internet head companies, the human resource costs of small and medium-sized companies on big data platforms are generally much higher than the product or service costs of professional companies.

**5) ** For scenarios with small data volume, private deployment is too heavy: In the scenarios of Internet of Things and Internet of vehicles, since the security of production and operation data is involved, most private deployment is still adopted. The amount of data handled by each privatized deployment varies widely, from a few hundred connected devices to tens of millions. For scenarios with small amount of data, general big data solutions are too bloated and the input and output are not proportional. Therefore, some platform providers often have two sets of schemes: one is for big data scenarios, using a general big data platform; the other is for small-data scenarios, using MySQL or other DB to handle everything. But this leads to higher development and maintenance costs.

General big data platform has the above problems, is there a good way to solve? Then we need to make a detailed analysis of the scenario of the Internet of Things. A closer look reveals that the data generated by all machines, devices and sensors is temporal, and many have location information. These data have 12 distinct features:

**1) ** Data is sequential, must have a timestamp;

**2) ** Data is structured;

3) Data is rarely updated or deleted;

** * Data source is unique;

**5) ** Compared with Internet applications, write more and read less;

** Users focus on trends over a period of time rather than values at a particular point in time;

7) Data has a retention period;

**8) ** Data query analysis must be based on time period and geographic region;

**9) ** In addition to storing queries, various statistical and real-time computing operations are often required;

**10) ** Flow is smooth and predictable;

**11) ** often requires some special calculations such as interpolation;

**12) ** The volume of data is huge, with more than 10 billion pieces of data collected in a single day.

If we take full advantage of the above features, we can develop a special big data platform optimized for iot scenarios. This platform will have the following characteristics:

**1) ** Make full use of the data characteristics of the Internet of Things, make various technical optimizations, greatly improve the performance of data insertion and query, and reduce the cost of hardware or cloud services;

**2) ** must be horizontal expansion, with the increase of data volume, only need to increase the server capacity expansion;

**3) ** must have a single management background, is easy to maintain, try to achieve zero management;

**4) ** must be open, with industry-popular standard SQL interfaces, and provide Python, R, or other development interfaces to facilitate integration of various machine learning, artificial intelligence algorithms, or other applications.

Taos DATA TDengine is a full-stack big data processing engine developed by making full use of 12 characteristics of Internet of Things data. It has several characteristics mentioned above and is expected to solve the shortcomings of general big data platform in processing Internet of things data. According to the design idea, TDengine should greatly simplify the architecture of the big data platform of the Internet of Things, shorten the research and development cycle, and reduce the operating cost of the platform.