An overview of the

Presto architecture

Presto is a distributed query engine that does not store data itself, but can access multiple data sources and support cascading queries across them.

Presto’s architecture is divided into:

Coodinator: Parses SQL statements, generates execution plans, and distributes execution tasks to Worker nodes.

Discovery Server: After a Worker node is started, it registers with the Discovery Server service, and the Coordinator obtains a working Worker node from the Discovery Server.

Worker: Performs actual query tasks and accesses the underlying storage system.

Storage: Presto data can be stored in HDFS/OBS. You are advised to store hot data in HDFS and cold data in OBS.

Memory tuning

Principles of Memory Management

Presto has three memory pools: GENERAL_POOL, RESERVED_POOL, and SYSTEM_POOL.

GENERAL_POOL: Physical operators for normal queries. The GENERAL_POOL value is total memory (Xmx value) – reserved (max-memory-per-node) – system (0.4 * Xmx).

SYSTEM_POOL: reserved memory for reading and writing buffers, worker initialization, and task execution. The size is specified by resources. Reserved-system-memory in config.properties. The default is JVM Max Memory * 0.4.

RESERVED_POOL: The Query is used only when both of the following conditions are met. The Query that occupies the most memory is obtained from all the queries and executed in the RESERVED_POOL. Note that the RESERVED_POOL can only be used for one Query. The size is specified by the query.max-memory-per-node in config.properties. The default value is JVM Max memory * 0.1.

1. A node in the GENERAL_POOL is blocked. That is, the node has no memory

2. The RESERVED_POOL is not used

  • Query. max-memory: Represents the maximum sum of memory available for a single query distributed across all related nodes.
  • Query. max-memory-per-node: indicates the maximum amount of user memory available for a single query on a single node.
  • Query. max-total-memory-per-node: indicates the maximum amount of memory and system memory available on a node for a single query. System memory is the memory used by readers, writers and network buffers during execution.
  • Memory.heap-headroom-per-node: This memory is mainly allocated by third-party libraries and cannot be tracked by statistics. The default value is -xmx * 0.3

Note:

1, query.max-memory-per-node is smaller than query.max-total-memory-per-node.

The sum of query.max-total-memory-per-node and memory.heap-headroom-per-node must be less than -xmx configured in JVM Max memory (jVm. config).

Presto memory configuration

Memory tuning parameters

Operating scenarios

Presto is a completely memory-based calculation that often appears in OOM and requires memory adjustment.

Modify the parameters

Common OOM error

Query exceeded per-node total memory limit of xx

Add query.max-total-memory-per-node appropriately.

Query exceeded distributed user memory limit of xx

Increase query.max-memory appropriately.

Could not communicate with the remote task. The node may have crashed or be under too much load

The node crashed due to insufficient memory. You can view /var/log/message.

parallelism

Operating scenarios

Adjust the number of threads to increase the concurrency of tasks to improve efficiency.

Modify the parameters

Metadata caching

Operating scenarios

Presto supports Hive Connector. Metadata is stored in Hive MetaStore. Adjust metadata cache parameters to improve metadata access efficiency.

Modify the parameters

Hash optimization

Operating scenarios

Optimization for Hash scenarios.

Modify the parameters

Optimized OBS parameters

Operating scenarios

Presto supports on OBS. During OBS reading and writing, you can adjust OBS client parameters to submit read and write efficiency.

Modify the parameters


Click to follow, the first time to learn about Huawei cloud fresh technology ~