The following article comes from JD retail Technology, by Zheng Sicheng
Link: mp.weixin.qq.com/s/OFkRnbWEa…
What is the JDNoSQL platform
The JDNoSQL platform is a distributed column-oriented KeyValue millisecond storage service that stores structured data and non-organizational data, supports random read/write and update, flexible dynamic column mechanism, supports horizontal expansion in architecture, and provides high concurrency, low latency, high availability, and strong consistency database services, which can meet various business scenarios. Perfect platform support, support self-service table building, view monitoring, online DDL and so on.
1.1 Ecological position of JDNoSQL
As you can see from the figure above, JDNoSQL is a distributed, column-oriented storage system built on HDFS. You can use JDNoSQL when you need real-time read and write and random access to very large data sets. Some relational databases on the market today are built without considering the characteristics of super-size and distribution. Many vendors extend their databases beyond the boundaries of a single node by copying and partitioning, but these features are often added as an afterthought and are complex to install and maintain. At the same time, it can affect RDBMS specific functions, such as joins, complex queries, triggers, views, and foreign key constraints, which are expensive or impossible to implement on a large RDBMS. JDNoSQL approaches scalability issues from another perspective. It scales by adding nodes from bottom to top in a linear fashion. JDNoSQL is not a relational database, nor does it support SQL. You can currently support SQL through JDPhoenix, but it has its own strengths that RDBMSS cannot handle. JDNoSQL has the following features:
-
Large: A table can have hundreds of millions of rows and millions of columns.
-
Column-oriented: List (cluster) oriented storage and permission control, column (cluster) independent retrieval.
-
Sparse: NULL columns take up no storage space, so tables can be designed to be very sparse.
-
No schema: Each row has a sortable primary key and as many columns as possible. Columns can be added dynamically as needed, and different rows in the same table can have distinct columns.
-
Multiple versions of data: There can be multiple versions of data in each cell. By default, the version number is automatically assigned. The version number is the timestamp when the cell was inserted.
-
Single data type: Data in JDNoSQL is all strings with no type.
Application scenarios
The use of NoSQL in JINGdong mainly involves the following scenarios:
-
Sequential business (monitoring, IOT)
-
Message orders (orders/policies, chat logs)
-
CUBE analysis (real-time wide tables, reports, search recommendations)
-
Monitoring (UMP/MDC/CAP/JDH)
-
Feeds stream service (comment information, Q&A information, waterfall stream, moments of friends)
-
AI Storage (User characteristics, NLP corpus, model Storage)
-
Spatio-temporal data (tracks, meteorological networks)
-
Financial services (association analysis, credit analysis, Risk control/IOUS/Payment/asset management)
2.1 Advertising real-time calculation system based on NoSQL
2.1.1 Major features of online advertising:
Compared with traditional advertising, network advertising presents some of its own characteristics, understanding these characteristics, is the essence of network advertising marketing strategy. The characteristics of network advertising are as follows:
-
Wide spread: the spread of network advertising is wide, not limited by time and space, advertising information can be transmitted around the world through the Internet uninterrupted. China has a huge number of netizens, but also in the rapid development, these netizens have high consumption power, is the audience of online advertising, can browse ads on the Internet anywhere in the world, this communication effect is any kind of traditional media can not achieve.
-
Non-forced dissemination of information: The nature of online advertising is on-demand advertising, which has the nature of newspaper classified advertising, but does not require the audience to browse thoroughly. It can be freely queried and actively presented and displayed according to the needs of potential customers, so as to save the attention resources of the whole society and improve the pertinence and effectiveness of advertising.
-
Audience data can be accurate statistics: the traditional media advertising, it is difficult to accurately know how many people are exposed to advertising information, Internet advertising, can through the traffic statistics system authority and justice, accurate statistics the number of each AD browsing before these users refer to time and geographical distribution, thus is advantageous to the correct assessment of advertising effects, further optimize advertising strategy.
-
Flexible timeliness: Internet advertising can update advertising content as needed.
-
Intense exchange and senses: the carrier of basic Internet advertisements are multimedia, hypertext, etc., need to audience interested in the products, only need to click to learn more about more information, more detailed, more vivid, even can let consumer experience personally products, services and brands, through virtual reality technology, can let customers immersive.
2.1.2 Data types of online advertising:
There are many kinds of collected data related to online advertising, among which there are four most critical types: display, click, behavior, and third-party data monitoring.
- AD presentation data
Advertising display data refers to the display data obtained from advertising space, which is generally sent to the server for statistical analysis of advertising display volume (ADPV). General data contains information such as date, user ID, AD ID, and IP. Here is a data format for the AD presentation, with JSON field extensions:
The 2015-01-13 19:11:55 {00 d81d1d - 00 a291 e2300 0-87 dbce0da90} {" adia ":"31769"."asid": "2"."aspid":"0"."ptime": "14"."ag":"Four,5.20, 26.1908"."ecode": "15"."type":"2"."dp1": "1"."adpid":"0"."dsp": "0"."source": "s"} tianjin, TianjinCopy the code
- AD click data
Advertising click data refers to the user click data obtained by each advertising space. Generally, this data also needs to be sent to the server for statistical analysis of advertising click (Adclick). Click data typically contains information such as date, user ID, AD ID, and IP. Here’s a data format for AD clicks, not much different from AD display:
2015-01-13 00:11:06{D33333C3-000C84-2345FB-DB768EC56} {"wid":"13"."aid": "103297"."vid":"1446779"."adid": "29260"."asid":"1"."aspid": "1"."mid":"16507"."mg": "155"."area":"13"."dsp": "3"} changsha city, Hunan ProvinceCopy the code
- Advertising behavior data
Advertising behavior data refers to the data obtained by the user download, installation or transaction of advertising space. Generally, this data also needs to be sent to the server side for the analysis of other advertising behavior (AdAction). Common behavior data contains information such as date, user ID, AD ID, and IP. Here is an AD action data format that is not much different from the AD display data, except that the JSON extended fields differ in some general information:
2015-01-13 09:59:39{00567D26AD-565D01-C2238-F99000C0A0} {"adid":"234555"."asid": "562"."aspid":"12"."type": "1"} ningde City, Fujian ProvinceCopy the code
- Third-party Monitoring Data
In order to facilitate advertisers to understand the target consumption of online media browsing habits, into the probability of customers, and obtain fair, objective, authoritative statistical information, it is very necessary to use a third-party advertising monitoring company to participate in the monitoring of advertising. Third-party monitors also produce monitoring data, including dates, AD ids, and user ids. The following is an example of third-party monitoring data:
2014-12-31 108A451BD3787_22E6_D020_786DF2695B {000AD54073-19DDC2-971F26-36F4119425}
Copy the code
2.1.3. Challenge of advertising data
The value of data decreases over time, so events have to be processed as soon as they appear, preferably as soon as they appear, one event at a time, rather than cached and processed in batches. In the data flow model, The input data to be processed (all or part of it) is not stored on randomly accessible disks or memory, but arrives as one or more “continuous streams.” Data is different from the traditional storage relationship model, which mainly has the following characteristics:
-
Data elements in the stream arrive online and require real-time processing
-
The system has no control over the order in which newly arrived data elements will be processed, whether within a single data flow or across multiple data flows; That is, the replayed data stream may not have the same element order as the last data stream.
-
The potential size of data flows may be endless.
-
Once an element in the data stream has been processed, it is either discarded or archived for storage.
2.1.4 Main System Functions
The system only serves the advertising industry at present, and requires that advertising display data and advertising click data can be reflected in the inventory system in real time. The inventory system can calculate the advertising strategy according to the existing investment volume. At the same time, it can provide statistics of the daily display amount of certain advertisements in a month, and can be divided into three dimensions: province, city and user. On the premise of meeting the preceding functions, the system performance delay is required to be within 30 seconds, and access requests whose peak value is 500W are supported.
2.1.5. System Architecture
According to the previous demand analysis, design objectives and main function requirements, the whole advertising real-time computing system is divided into six layers: log receiving layer, producer layer, consumption queue layer, consumer layer, business logic layer and storage layer. Jingdong JDQ real-time data pipeline is used for message queue to provide high-throughput distributed message queue based on Kafka for streaming computing scenarios. Jingdong JRC streaming computing is used for business logic layer to provide flink-based streaming computing engine for streaming computing. High concurrency, low latency and high availability are used for storage. NoSQL distributed storage with high QPS throughput and random read and write. The architecture diagram is as follows:
- Log receiving layer
This layer is the data source and produces local log files through the log receiver tool. The commonly used receiving tools include Scribe, Nginx, syslog-ng, and Apache Http Server. After receiving these data flows, they are stored in local disk files.
- Producers layer
This layer is the data transfer layer, which is used to generate log files from the local to the Kafka cluster, monitor the specified file or directory in real time, extract incremental data and send it to the Kafka cluster.
- Message queue layer
This layer is a Kafka cluster, which is responsible for load balancing and message buffering of input data, and has high throughput and good horizontal scalability. The message queue layer chose Kafka because Kafka focuses on throughput and buffering.
- The consumer layer
The application consumes the messages in the Kafka queue and inputs the number of messages to the business logic layer, which is the connecting sub-layer. Since the business logic layer uses the Flink framework, all consumer layers need to connect the Kafka and Flink clusters.
- Business logic layer
This layer is an important sub-layer to realize requirements. Using the Flink framework, it is very convenient to deploy business requirements of different rules and realize fast calculation.
- Storage layer
The distributed storage NoSQL used by the target storage meets the service requirements of high throughput, low latency, real-time update, search for certain scenarios, and horizontal expansion.
2.1.6. Table design
To meet the real-time query and periodic statistics requirements of the final results, the result data is stored in NoSQL, and the structure of the table needs to be defined first. Because the data includes two unrelated types of data, AD display and AD click, and the business direction is different, two tables need to be created to store the statistical structure of the two types of data.
- Advertising real-time display statistics
The structural design of advertising real-time display statistical table is as follows:
Among them, the design of row construction is very critical. This table contains three types of row construction, which are distinguished by province name, city name and UID respectively, for more efficient statistics of the data of these three dimensions. The family of columns and the number of columns is 1. Here is an example of a row of data from the AD real-time statistics table, where the value field is represented in hexadecimal bytecode and is a long integer.
29260_{2EEBEE83-EEE4-EAE6-1F0D-A27AB14549FC}_20150117 column=pv:cnt,timestamp=1390261754783,
value= \x00\x00\x00\x00\x00\x00\x00\x02
Copy the code
- AD real-time click statistics
The structure of advertising real-time click statistics table is as follows:
Compared with advertising real-time display statistics, real-time click statistics is obviously simpler, there is only one type of line construction: ADID_ plus date, a very conventional design scheme; Both the column family and the column data are 1. The following is an example of a row of data from the AD live click statistics table, where the value field is represented in hexadecimal bytecode and is a long integer.
36713_20150117 column=clk:cnt, timestamp=1390374472961, value=\x00\x00\x00\x00\x00\x00\x00\x06
Copy the code
2.1.7. Use NoSQL statistics
According to the description and implementation of the above table structure design, this structure supports the following multiple real-time query requirements:
- The current amount of advertising in a province.
- The current amount of advertising in a particular city.
- The current amount of an AD being placed on a user client
- The current number of clicks on an AD
- Historical advertising trends in a province over a cumulative period of time (e.g., a month)
- The historical placement trend of an AD in a city over a cumulative period of time (e.g., a month)
- The historical placement trend of an AD on a user client over a cumulative period of time (e.g., a month)
- The trend in clicks on an AD over a period of time (e.g., a month)
These requirements mentioned above can be easily implemented by encapsulating NoSQL clients and meeting real-time requirements. Front-end data visualization can be quickly realized with the help of open source JavaScript frameworks, such as echarts, HighCharts, D3.js, etc
conclusion
According to Gartner, the global non-relational database (NoSQL) is expected to maintain a high growth rate of around 30% from 2020 to 2022, much higher than the overall database market. With the rise and development of NoSQL and big data technology, low-cost one-stop data processing platform based on NoSQL and NoSQL ecology is booming. Currently, NoSQLAPI, PhoenixSQL, OpenTSDB, full text search Solr/ES, Space-Time GeoMesa, Graph HGraph, and Analysis Spark on HBase are supported. With the rapid development of NoSQL, NoSQL user groups are becoming larger and larger. In the future, NoSQL and NoSQL ecosystem will better meet various business scenarios.
Welcome to pay attention to the public number: the growth of the old boy road, selected dry goods regularly every week!