1: Project background

In daily work, business students need to compare data of various latitudes at any time and make decisions quickly. (Analysis page funnel, real-time traffic, exposure click/transform) So we need to build a set of real-time components to lower the threshold of data usage, improve decision-making efficiency, and drive business growth in a certain direction.

2: Real Time Olap

On-line Analytical Processing (Olap) is a software technology that enables analysts to quickly, consistently, and interactively view information from all aspects in order to gain in-depth understanding of data. It has the characteristics of Fast Analysis of Shared Multidimensional Information (FASMI), i.e. Fast Analysis of Shared Multidimensional Information. Where F is Fast, which means that the system can respond to most of the user’s analysis requests within a few seconds; A is for Analysis, which allows the user to define new specialized calculations as part of the Analysis, without programming, and report them in the way the user wants; M is multi-dimensional, which means to provide multidimensional views and analysis of data analysis. I stands for Information, which refers to the ability to obtain Information in a timely manner and manage a large volume of Information. OLAP classification, detailed as follows:

OLAP classification Related introduction Related technical components
MOLAP The data used for analysis is physically stored in the form of multidimensional arrays, forming a CUBE structure. The attribute value of the dimension is mapped into the subscript or subscript range of the multidimensional array. In fact, the value of the multidimensional array is stored in the array unit. The advantage is fast query, but the disadvantage is that the data amount is not easy to control, which may cause the dimension explosion. Druid, Kylin, Doris
ROLAP The data used for analysis is stored in a relational model, which has the advantages of small storage volume and flexible query mode. However, the disadvantages are also obvious. Each query requires the aggregation calculation of the data. Clickhouse, Presto, Impala, Spark SQL, Flink SQL, GreenPlum, Elasticsearch
HOLAP Hybrid OLAP, which is a fusion of MOLAP and ROLAP. When querying aggregated data, MOLAP technology is used. When querying detailed data, use the ROLAP technique. Under the premise of given usage scenarios, to achieve the optimization of query performance. The relevant introduction is relatively little

3: according to clickhouse

component Component is introduced Advantages of components Component faults
Kylin Complete precomputation engine by enumerating all combinations of dimensions. Build various cubes and aggregate ahead of time. HBase – based OLAP engine. Hive or Kafka provides data. Generally, day – and hour-level OLAP is used. Large data scale (dependent on HBase) Ease-of-use, high performance standard SQL, and fast query speed. The latest hbase version has been deleted. You can focus on the community version. It is less flexible, does not support adhoc queries, and has no secondary index. The performance of filtering is mediocre, join and data update are not supported, and the processing mode is complex, so Cube precalculation needs to be defined. Storage can explode when there are more than 20 dimensions; The detailed data cannot be queried, the maintenance is complicated, and the real-time performance is poor. In many cases, the data can only be queried one day or several hours ago.
Impala MPP engine, integrated Kudu generally do minute level OLAP. Generally, day – and hour-level OLAP is integrated with Hive. Based on memory operations, there is no need to write intermediate results to disk, saving a lot of I/O overhead. You can access metaStore of Hive to directly analyze Hive data. Full memory implementation requires a large number of machines to do the support. This is both its strength and its weakness.
Presto MPP engine, integrated Hudi generally do minute level OLAP. Generally, day – and hour-level OLAP is integrated with Hive. Presto is an SQL computing engine that separates the computing layer and the Storage layer. It does not store data, but accesses various data sources (Storage) through Connector SPI. Supports data source richness. Same as above
SparkSQL/FlinkSQL Computing engine, real-time depends on the data source Large data scale (non-storage engine) High flexibility and ease-of-use, standard SQL, multi-table join, and simple window function processing, no pre-processing, and no redundant data The performance is poor. When the query complexity is high and the data volume is large, the response level may be minutes. The shuffle engine is not a storage engine and has no local storage. When joining, the shuffle has high overhead and poor performance. SparkSql is only a computing engine, which needs to load data from external sources. The real-time performance of data cannot be guaranteed, and the performance of multi-table join is difficult to respond in seconds.
ClickHouse Store the database, save the original detail data, MergeTree data store localization High performance, column compression ratio, through the index to achieve second response real-time strong, support Kafka import processing is simple, without pretreatment, save detailed data. The flexibility of data scale is generally poor, arbitrary Adhoc query is not supported, join support is not easy to use, SQL syntax is not standard, window function is not supported, and maintenance costs are high.

Clickhouse is component-independent, supports distributed storage, and is responsive. It is less memory intensive and has rich query support (funnels, retention functions). Detailed see: clickhouse. Tech/docs/useful/SQL… Clickhouse was finally selected to support real-time report analysis.

4: What should real-time analytics do

1: global attribute construction

Clickhouse has poor join support for multiple tables, and it is recommended that all buried events be maintained in one large and wide table. If the burial point system of app is designed reasonably, and there are no properties with the same name and different types. All you need to do is maintain a global property sheet that maps the clickHouse data type with the Hive burial site data type. Details are as follows:

Hive Data Types Ck data types
Double Nullable(Float64)
String Nullable(String)
Boolean Nullable(Int8)
Bigint Nullable(Int64)
Bigint Nullable(UInt64)

Our BI system has strong consistency management for all buried metadata. History has the same name attribute, and historical data processing is done for different types of data. Note After the data type is changed, you need to perform an INSERT overwrite operation for the full table. Otherwise, data type inconsistency may occur when loading historical data to ClickHouse. The following is the relevant construction sentence:

CREATE TABLE dbname.distributed_table_name on cluster clustername
(
    distinct_id String,
    event String,
    staticdate Date,
    project String,
    event_date_time Nullable(DateTime)
) ENGINE = Distributed('clustername'.'default', localtablename, rand());
Copy the code
2: Those buried data is written to ClickHouse

The construction of our app historical burial site is quite chaotic, and it has gone through N rounds of products. Everyone’s understanding of the burial point is not uniform, the simplest way is to add new burial point. We can only standardize your system construction in the new burial site, gradually replace the old burial site system. Therefore, not all buried points are written to ClickHouse, but when the business selects the corresponding event, it determines whether clickHouse has synchronized the relevant data. If no historical data is synchronized to ClickHouse using spark Launcher, real-time data is written to clickHouse every other day. (During system initialization, the common burial point analysis has been initialized.) If you need to initiate spark tasks in actual work, you are advised to use LIvy. (The REST service can complete the interaction,Tell me more about me)

Note that ClickHouse requires a unique key to write data. We use user ID + Event +event_time to form a unique key, and our app does not allow multiple logins.

 val frame: DataFrame = spark.sql(readSQL).na.fill("Primary key field if null give default value")
    frame.write.format("jdbc")
      .mode(SaveMode.Append)
      .option("driver", clickHouseProperties.getProperty("jdbc.driver"))
      .option("url", clickHouseProperties.getProperty("jdbc.url"))
      .option("dbtable", projectMappingMap(eventInfo.getProject).getCkSchema)
      .option("user", clickHouseProperties.getProperty("jdbc.username"))
      .option("password", clickHouseProperties.getProperty("jdbc.password"))
      .option("batchsize"."100000")
      .option("isolationLevel"."NONE")
      .option("numPartitions", partition)
      .save()
Copy the code
3: Flink real-time data is written

Clickhouse does not support transactions for writing data. If the real-time task fails, the hive data cannot be consistent with the clickHouse data. Our task scheduler was a DolphinScheduler, which added a survival mechanism when a task was suspended. The day’s historical data is deleted before the scheduled restart, the kafka consumption time node is reset, and all data is rewritten.

Clickhouse Stream starts by deleting the clickhouse-client --port data9002 -u xxx --password xxx -h xxx.com --query "alter table tablename on cluster clustername delete where staticdate >= '${1}'"The ${2}} is a dynamic elapsed time point kafka-consumer-groups.sh --bootstrap-server XXX:9092 --group ClickHouseSyncStream --reset-offsets --topic event --execute --to-datetime ${2}T15:59:00000.
Copy the code

Flink writes to clickhouse and adds a window trigger to trigger the related sink operation (note that we encountered a failure to write to Clickhouse in the actual production process, so it is recommended to set socket_timeout=600000 larger).

 eventStream.process(new EventFilterProcess())/ / filter
                .connect(broadcast)
                .process(new EventParseProcess(startSyncTime, eventInfoStateDescriptor))
                .keyBy(new EventKeyBy(100))
               .window(TumblingProcessingTimeWindows.of(Time.seconds(10))).trigger(new CustomCountTriggerWithEventTime<>(3000))/ / triggers
Copy the code

3: How to query the service system

The business system must be designed so that business students can operate, and support filtering of relevant data. For daily real-time analysis, relevant scenarios can be established through BI, and relevant funnels and custom reports can be shared with businesses. Clickhouse data query response has a certain time difference, so it is recommended that the front-end request pass the rotation training. The service side changes the configuration to asynchronous when processing requests. Async of SpringBoot can be configured to complete related asynchronous operations. The system only needs to introduce The windowFunnel of ClickHouse to complete funnel calculation, and the customized report can complete the display of grouped data through the groupArray of ClickHouse.

@EnableAsync
public class ClickhouseServiceApplication {}Copy the code

Clickhouse JdbcTemplate: JDBC connection to ClickHouse JdbcTemplate:

@Configuration
public class ClickhouseDataSourceConfig implements BeanFactoryPostProcessor.EnvironmentAware {
    private ConfigurableEnvironment environment;
    @Override
    public void postProcessBeanFactory(ConfigurableListableBeanFactory configurableListableBeanFactory) throws BeansException {
        ClickHouseProperties clickHouseProperties=resolverSetting();
        createDataSourceBean(configurableListableBeanFactory, clickHouseProperties);
    }

    public void createDataSourceBean(ConfigurableListableBeanFactory beanFactory, ClickHouseProperties sqlProperties) {

        DataSource baseDataSource = clickHouseDruidDatasource(sqlProperties);
        register(beanFactory, new JdbcTemplate(baseDataSource), "clickhouseJdbcTemplate"."clickhouse");
    }

    private void register(ConfigurableListableBeanFactory beanFactory, Object bean, String name,
                          String alias) {
        beanFactory.registerSingleton(name, bean);
        if(! beanFactory.containsSingleton(alias)) { beanFactory.registerAlias(name, alias); } } @Override publicvoid setEnvironment(Environment environment) {
        this.environment = (ConfigurableEnvironment) environment;
    }

    private ClickHouseProperties resolverSetting() {

        Iterable<ConfigurationPropertySource> sources = ConfigurationPropertySources.get(environment);
        Binder binder = new Binder(sources);
        BindResult<ClickHouseProperties> bindResult = binder.bind("clickhouse", ClickHouseProperties.class);
        ClickHouseProperties clickHouseProperties= bindResult.get();
        return  clickHouseProperties;
    }

    public DruidDataSource clickHouseDruidDatasource(ClickHouseProperties clickHouseProperties) {
        DruidDataSource datasource = new DruidDataSource();
        datasource.setUrl(clickHouseProperties.getUrl());
        datasource.setUsername(clickHouseProperties.getUsername());
        datasource.setPassword(clickHouseProperties.getPassword());
        datasource.setDriverClassName(clickHouseProperties.getDriverClassName());
        // pool configuration
        datasource.setInitialSize(clickHouseProperties.getInitialSize());
        datasource.setMinIdle(clickHouseProperties.getMinIdle());
        datasource.setMaxActive(clickHouseProperties.getMaxActive());

        datasource.setMaxWait(clickHouseProperties.getMaxWait());
        datasource.setTimeBetweenEvictionRunsMillis(clickHouseProperties.getTimeBetweenEvictionRunsMillis());
        datasource.setMinEvictableIdleTimeMillis(clickHouseProperties.getMinEvictableIdleTimeMillis());
        datasource.setValidationQuery(clickHouseProperties.getValidationQuery());
        returndatasource; }}Copy the code

4:

Clickhouse has very high write performance and relatively reasonable memory and CPU usage. If the company’s business is in the million level, and machine resources are not very loose, it is recommended to use Clickhouse, you can complete the support of related business. Of course, Impala + Kudu is widely used by many large Internet companies. In my opinion, the one that suits me is the best.