Apache Druid is a real-time analytical database designed to quickly query and analyze large data sets (” OLAP “queries).

Druid is most commonly used as a database to support real-time uptake, high query performance, and high stability in application scenarios. For example, Druid is often used as a data source for graphical analysis tools to provide data, or when there is a need for a back-end API with high clustering and high concurrency. Druid is also well suited for event-oriented data.

Common systems that can use Druid as a data source include:

  • Click traffic Analytics (Web or mobile analytics)
  • Network Monitoring and Analysis (Network Performance Monitoring)
  • Server Storage Specifications
  • Supply Chain Analysis (Production data indicators)
  • Application performance index
  • Digital advertising analysis
  • Business Integration/OLAP

Druid’s core architecture combines the concepts of data warehouses (Data warehouses), timeseries Databases (TIMESeries Databases), and log Analysis Systems (LogSearch Systems).

If you are not familiar with the various data types and databases above, we recommend that you do some searching to understand some of the definitions and features provided.

Some of Druid’s key features include:

  1. Columnar Storage Format Druid uses Columnar storage, which means it only needs to query for specific columns in a particular data query. This design greatly improves the performance of partial column query scenarios. In addition, each column of data is stored optimally for a specific data type, enabling fast scanning and aggregation.
  2. Scalable distributed systems Druid is typically deployed in clusters of tens to hundreds of servers and can provide millions of levels of data imports per second and hold trillions of data, while providing query delays ranging from 100ms to a few seconds.
  3. High-performance Massively Parallel processing Druid can process queries in parallel in the whole cluster.
  4. Realtime or batch ingestion Druid can import ingestion databases in Realtime (data that has been imported and ingested can be immediately queried) or import ingestion data in batches.
  5. Self-healing, self-balancing, easy to operate for cluster operators, to scale the cluster simply add or remove services, the cluster will automatically rebalance itself in the background without causing any downtime. If any Druid server fails, the system automatically bypasses the damaged node and keeps running uninterrupted. Druid is designed to run 7 by 24 without requiring scheduled outages for any reason (such as configuration changes or software updates).
  6. Cloud-native, fault-tolerant architecture that won’t lose data once Druid gets it, The obtained data will be stored securely in deep storage (usually cloud storage, HDFS, or shared file systems). Even if a single Druid service fails, your data can be recovered from deep storage. For limited failures that affect only a few Druid services, a saved copy ensures that queries can still be made during system recovery.
  7. Druid uses Roaring or Just a few pages to compress bitmap Indexes, which are later created to support fast filtering and searching across multiple columns.
  8. Druid first partitions data by Time. It can also be partitioned by other fields. This means that time-based queries will only access partitions that match the query time range, which greatly improves time-based data processing performance.
  9. Approximate algorithmsDruid applies approximationscount-distinct, approximate sorting and approximate histogram and quantile calculation algorithm. These algorithms have a limited memory footprint and are usually much faster than accurate calculations. For scenarios where precision is more important than speed, Druid also provides exact count-distinct and exact ranking.
  10. Druid provides optional summarization of data during the data ingest phase. This summarization partially preaggregates your data, resulting in significant cost savings and improved performance.

www.ossez.com/t/druid/136…