Spark interview questions
- Spark Interview Question 1
- Spark Interview Questions (2)
- Spark Interview Question (3)
- Spark Interview Questions (4)
- Spark interview question (5) — Data skew tuning
- Spark Interview question (6) — Tuning Spark resources
- Spark interview question (7) — Tuning Spark application development
- Spark interview question (8) : Optimize the Spark Shuffle configuration
1. How many deployment modes does Spark have? (being fostered fostered fostered fostered)
1) Local mode Spark does not need to run in a Hadoop cluster. It can be specified in the local mode with multiple threads. The Spark application is run locally in multi-threaded mode to facilitate debugging. There are three local modes: start only one Executor local[k]: Start k Executor Local [*] : 2) Standalone mode Distributed deployment cluster, complete with its own services. Resource management and task monitoring are monitored by Spark itself. This mode is the basis for other modes. 3) Spark on YARN Mode Distributed cluster deployment, resource and task monitoring is managed by YARN. Currently, however, only coarse-grained resource allocation mode is supported, including cluster and client mode. Cluster is suitable for production, driver runs on cluster sub-nodes and has fault tolerance. The client is suitable for debugging and the dirver runs on the client. 4) Spark On Mesos mode. This model is officially recommended (one reason, of course, is blood). Because Spark was developed with Mesos support in mind, Spark is now more flexible and natural running on Mesos than on YARN. 70 You can use one of the following scheduling modes to run your applications: coarse-grained Mode The running environment of each application consists of a Dirver and several Executors. Each Executor occupies several resources and can run multiple tasks (corresponding to the number of “slots”). Before each task of an application program is formally run, it is necessary to apply for all resources in the running environment, and occupy these resources during the running process, even if they are not used, and finally reclaim these resources after the program is run. Fine-grained Mode: Whereas coarse-grained Mode wastes resources, Spark On Mesos provides another scheduling Mode: fine-grained Mode. This Mode is similar to current cloud computing, and the idea is to allocate resources On demand.
2. Why is Spark faster than MapReduce? (being fostered fostered fostered fostered)
1) Memory based computing to reduce inefficient disk interaction; 2) Efficient scheduling algorithm, based on DAG; 3) Fault tolerant mechanism Linage, the essence of which is DAG and Lingae
3. What are the similarities and differences between Hadoop and Spark shuffle? (being fostered fostered fostered fostered)
1) From a high-level perspective, there is no big difference between the two. Partition the output of mapper (ShuffleMapTask in Spark). Different partitions are sent to different reducer (Reducer in Spark can be a ShuffleMapTask or a ResultTask in the next stage). Reducer Use memory as the buffer, shuffle and aggregate data, and reduce() when data aggregate is good (a series of subsequent operations may be performed in Spark). 2) From a low-level perspective, the two are quite different. Hadoop MapReduce is sort-based and entering records for Combine () and Reduce () must first sort. The advantage of combine/ Reduce () is that it can process large-scale data, because its input data can be obtained through outpouring (mapper sorts each data first and reducer shuffle merges each sorted data). Currently, Spark uses hash-based aggregate by default. Generally, It uses HashMap to aggregate data generated by shuffle without sorting data in advance. If the user needs sorted data, he calls something like sortByKey() himself. If you are a Spark 1.1 user, set spark.shuffle.manager to sort the data. In Spark 1.2, sort will be implemented as the default Shuffle. 3) There are also some differences from an implementation perspective. Hadoop MapReduce divides the processing process into map(), spill, Merge, shuffle, sort, and reduce(). Each stage plays its own role and can realize the functions of each stage one by one according to the procedural programming idea. In Spark, there are only stages and a series of transformations (). Therefore, spill, merge, and aggregate operations must be included in Transformation (). If the process of data partitioning and persistent data on the Map end is called Shuffle Write, and the process of data reading and aggregate data is called Shuffle Read. In Spark, the question becomes how to add shuffle Write and Shuffle Read processing logic to the job logic or physical execution diagram. And how can the two processing logic be implemented efficiently? Shuffle Write The Shuffle write task is simple: Partition data and persist it. Persistence is used to reduce the pressure of memory storage space and fault-tolerance on the other hand.
4. How does Spark work? (being fostered fostered fostered fostered)
The Driver creates a SparkContext
② SparkContext applies for the Executor resource from the resource manager (such as Standalone, Mesos, and Yarn). Resource manager start StandaloneExecutorbackend (Executor) (3) Executor SparkContext application Task (4) SparkContext distributed application to Executor (5) SparkContext builds a DAG graph. DAGScheduler parses the DAG graph into stages. Each Stage has multiple tasks, forming tasksets and sending them to the task Scheduler. The task Scheduler sends a task to the Executor for execution. ⑥ The task is executed on the Executor and all resources are released after execution
5. How to optimize Spark? (being fostered fostered fostered fostered)
Spark tuning is complex, but it can be divided into three aspects: 1) Platform-level tuning: prevent unnecessary DISTRIBUTION of JAR packages, improve data localization, and select efficient storage formats such as Parquet 2) application-level tuning: Optimization of filter operators reduces too many small tasks, reduces the resource overhead of a single record, handles data skew, reuses RDD for caching, parallelizes job execution, and so on. 3) Optimization of JVM level: set appropriate resources, set reasonable JVM, enable efficient serialization methods such as KYro, increase off head memory, and so on
6. In which link is the data localization determined? (being fostered fostered fostered fostered)
The specific tasks that run on the other machine are determined when daG divides the stages
7. What is the elasticity of RDD? (being fostered fostered fostered fostered)
1) Automatically switch between memory and disk storage; 2) Efficient fault tolerance based on Lineage; 3) If the task fails, it will automatically retry for a specified number of times. 4) stage Will automatically retry a certain number of times if it fails, and only the failed shards will be counted; 5) checkpoint and persist, persistent cache after data calculation; 6) Data scheduling flexibility, DAG TASK scheduling and resource independent; 7) High elasticity of data sharding.
8. What are the drawbacks of RDD? (being fostered fostered fostered fostered)
1) Spark does not support fine-grained write and update operations (such as web crawlers). Spark writes data in coarse-grained mode. Coarse-grained data is written in batches to improve efficiency. But read data is fine-grained, which means it can be read strip by strip. 2) Incremental iterative calculation is not supported, Flink supports it
9. Spark shuffle process? (being fostered fostered fostered fostered)
1) developed from the following three points to shuffle in the process of dividing 2) shuffle the intermediate results of how to store 3) shuffle data how to pull You can refer to this post: www.cnblogs.com/jxhd1/p/652…
10. How local is Spark’s data? (being fostered fostered fostered fostered)
There are three types of data localizability in Spark: 1) PROCESS_LOCAL reads data cached on local nodes. 2) NODE_LOCAL reads data on local nodes. 3) ANY reads data on non-local nodes. Try to read the data in PROCESS_LOCAL or NODE_LOCAL mode. The PROCESS_LOCAL process is also related to the cache. If the RDD is used frequently, the RDD will be cached in memory. Note that since the cache is lazy, an action must be triggered to actually cache the RDD into memory.
11. Why is Spark persisted and in what scenarios is the persist operation performed? (being) being is being
Why persist? Spark’s default data is stored in memory, and much of Spark’s content is stored in memory. Therefore, It is very suitable for high-speed iteration. Only the first input data is generated in 1000 steps, and no temporary data is generated in the process. RDD errors or sharding can be calculated based on ancestry, and if the parent RDD is not persisted or cached, it needs to be redone. If persist is used, 1) the calculation of a step is time-consuming and requires persist; 2) the calculation chain is long and requires many steps to restore. This works well. 3) The RDD at checkpoint must be persisted. Cache or RDd. persist, save the result, and then checkpoint. The RDD chain does not need to be recalcitated. Persist is always performed before checkpoint. If data is persisted before shuffle, the framework persists data to disks by default. This is done automatically. If data is persisted before shuffle, the framework persists data to disks.
12. Introduce the join operation optimization experience? (being fostered fostered fostered fostered)
In fact, join is commonly divided into two types: map-side join and Reduce-side join. When large tables and small tables join, map-side join can significantly improve efficiency. Associating multiple pieces of data is very common in data processing. However, in distributed computing systems, this problem often becomes very troublesome, because the join operation provided by the framework generally sends all data to all Reduce partitions based on keys, also known as shuffle. This process consumes a large number of NETWORK and disk I/OS and is extremely inefficient. This process is commonly called reduce-side-join. If one of the tables is small, we can implement data association on the Map side by ourselves, skipping the shuffle process of a large number of data, greatly shortening the running time, and improving the performance by several to tens of times according to different data. Note: this topic in the interview is very very large probability to see, be sure to search the relevant information to master, here to cast a brick to attract jade.
13. Describe the process of Yarn executing a task. (being fostered fostered fostered fostered)
1) The client submits the Application to ResouceManager. The ResouceManager accepts the Application and selects a node based on the cluster resource status to start the Application’s task scheduler Driver (ApplicationMaster). 2) ResouceManager finds that node and commands the nodeManager on that node to start a new JVM process running the driver (ApplicationMaster) part of the application. When the driver (ApplicationMaster) starts, it registers with ResourceManager. Therefore, the driver is responsible for running the current program. 3) The driver (ApplicationMaster) downloads related JAR packages and other resources. Based on the downloaded JAR information, the driver applies for resources from ResourceManager. 4) ResouceManager receives the application from the Driver (ApplicationMaster), and maximizes the resource allocation request and sends the metadata information of the resource to the driver (ApplicationMaster). 5) After receiving the resource metadata information, the driver (ApplicationMaster) sends instructions to the NodeManager on a specific machine to start a specific Container. 6) NodeManager receives the command from the driver to start the Container. After the Container starts, it must register with the Driver (ApplicationMaster). 7) After receiving the registration of the Container, the driver (ApplicationMaster) schedules and computes tasks until the tasks are complete. Note: If the ResourceManager does not meet the resource request of the Driver (ApplicationMaster) for the first time and finds idle resources later, The driver (ApplicationMaster) proactively sends metadata information about available resources to provide additional resources for running the current application.
14. What are the advantages of Spark on Yarn mode? (being fostered fostered fostered fostered)
1) Cluster resources are shared with other computing frameworks. (Spark and MapReduce run at the same time. If Yarn is not used to allocate resources, MapReduce gets few memory resources, which is inefficient.) Resources are allocated according to demand, so as to improve the utilization of cluster resources. 2) Compared with the Standalone mode provided by Spark, Yarn allocates resources in more detail. 3) Simplified Application deployment. After applications of frameworks such as Spark and Storm are submitted by clients, Yarn manages and schedules resources. Container is used as the resource isolation unit to use memory and CPU. 4) Yarn manages multiple services running in a Yarn cluster in queue mode. You can adjust resource usage based on the load of different types of applications to achieve resource elastic management.
Tell me your understanding of Container. (being fostered fostered fostered fostered)
1) Container is the basic unit of resource allocation and scheduling. It encapsulates resources such as memory, cpus, disks, and network bandwidths. Currently yarn encapsulates only memory and CPU. 2) Container Is applied for by ApplicationMaster from ResourceManager. The resource scheduler in ResouceManager asynchronously assigns the Container to ApplicationMaster. The ApplicationMaster initiates the Container to the NodeManager where the resource is located. When the Container is running, you need to provide commands for executing tasks inside the Container
16. What are the benefits of Spark using the Parquet file storage format? (being fostered fostered fostered fostered)
1) If HDFS is the preferred standard for distributed file systems in the era of big data, then Parquet is the real-time preferred standard for file storage formats in the era of big data. 2) Faster: In most cases, the speed of operating common CSV files and Parquet files using Spark SQL is about 10 times faster than that of operating common files such as CSV files. When common file systems cannot run successfully on Spark, parquet can run successfully in most cases. 3) Parquet’s compression technology is very stable and excellent. In Spark SQL, the compression technology may not complete the work properly (such as resulting in lost task, lost Executor), but it can be completed properly if parquet is used. 4) Greatly reduces disk I/ O. Typically, the storage space can be reduced by 75%, which greatly reduces the input content of Spark SQL data processing. In particular, Spark1.6x has a push-down filter that can greatly reduce disk I/ O and memory usage in some cases. 5) Spark 1.6x Parquet mode greatly improves scanning throughput and data search speed. Compared with Spark1.5g, Spark1.6 is about twice as fast. In Spark1.6x, the CPU of Parquet operation is also greatly optimized. Effectively reduce CPU consumption. 6) Using Parquet can greatly optimize spark scheduling and execution. We tested spark using Parquet to effectively reduce the execution cost of stages and optimize the execution path.
17, Introduce the relationship between Parition and block. (being fostered fostered fostered fostered)
1) Blocks in HDFS are the smallest unit of distributed storage. They are divided equally and can be set for redundancy. In this way, some disk space is wasted, but the block size is neat, so that the corresponding content can be found and read quickly. 2) Partion in Spark is the smallest unit of elastic distributed data set RDD, which consists of partion distributed on each node. Partion refers to the smallest unit of data generated in the computing space during spark calculation. The size and number of partion for the same data (RDD) vary depending on the operator in the application and the number of data blocks initially read. 3) Block is located in storage space and partion is located in computing space. The size of block is fixed and the size of partion is not fixed. Data is viewed from two different perspectives.
18. What is the execution process of the Spark application? (being fostered fostered fostered fostered)
1) Build the Running environment of the Spark Application (start SparkContext). SparkContext registers with the resource manager (such as Standalone, Mesos, or YARN) and applies for running Executor resources. 2) the resource manager assign Executor resources and start StandaloneExecutorBackend, Executor operation will be as the heartbeat is sent to the resource manager; 3) SparkContext is constructed into a DAG graph, which is decomposed into stages and sends the Taskset to the Task Scheduler. The Executor requests tasks from SparkContext, and the Task Scheduler issues tasks to The Executor while SparkContext issues application code to the Executor; 4) The Task runs on an Executor and releases all resources when it is finished.
19. Is hash shuffle that does not need sorting necessarily faster than Sort shuffle that does? (being fostered fostered fostered fostered)
Not necessarily. When the data size is small, the Hash shuffle is faster than the Sorted Shuffle when the data size is large. When the amount of data is large, sorted Shuffle is much faster than Hash Shuffle, because a large number of small files are uneven and even appear data skew, consuming large memory. Before 1.x, Spark uses Hash, which is suitable for processing small and medium scale. After 1.x, Sorted Shuffle is added. Spark is more capable of large-scale processing.
20. Defects of sort-based shuffle? (being fostered fostered fostered fostered)
1) If the number of tasks in mapper is too large, a large number of small files will still be generated. In this case, the reducer segment needs to deserialize a large number of records at the same time in the shuffle data transfer process, resulting in a large amount of memory consumption and a huge burden on GC, resulting in slow or even crash of the system. 2) If ordering is required in the fragmentation, ordering of mapper segment and Reducer segment is required twice.
21, spark. Storage. MemoryFraction parameters and how to tune in the process of production? (being fostered fostered fostered fostered)
1) Set the percentage of Executor memory that contains persistent RDD data. The default value is 0.6. By default, 60% of Executor memory can be used to store persistent RDD data. Depending on the persistence strategy you choose, if you run out of memory, data may not be persisted, or data may be written to disk; 2) if more persistence operation, can improve the spark. Storage. MemoryFraction parameters, allowing more of the persistent data stored in the memory, improve the performance of reading data, if the operation of the shuffle is more, there are a lot of data read and write operations to the JVM, so should be adjusted a little bit small. Save more memory for the JVM and avoid too many JVM GC occurrences. Observed in the web UI if it is found that the gc time is very long, you can set the spark. Storage. MemoryFraction more a little bit small.
What is your understanding of Unified Memory Management? (being fostered fostered fostered fostered)
Memory usage in Spark is divided into two parts: execution and storage. Execution memory is used for shuffles, joins, sorts, and aggregations, while storage memory is used for caching or internal data transfer across nodes. Prior to 1.6, memory for an Executor consisted of the following parts: 1) ExecutionMemory. This memory area is the buffer needed to handle shuffles,joins, sorts and aggregations in order to avoid frequent IO. Through the spark. Shuffle. MemoryFraction (0.2 by default) configuration. 2) StorageMemory. This area of memory is used to handle block cache(where you call rdd.cache, RDD. persist, etc.) and broadcasts and storage of Task results. Can pass parameters spark. Storage. MemoryFraction (0.6) by default Settings. 3) OtherMemory. Reserved for the system because the program itself needs memory to run (default: 0.2). Disadvantages of traditional memory management: 1) Shuffle occupies 0.2 x 0.8 memory. If the memory is allocated in such a small amount, data may be spilled to disks. Frequent DISK I/OS are a heavy burden. Traditional Spark memory allocation has high requirements on operators. Shuffle Allocate memory: ExecutorMemoryManager (ShuffleMemoryManager, TaskMemoryManager, ExecutorMemoryManager) One Task gets all Execution Memory, and all other tasks have no Memory left and can only wait. 2) By default, tasks in the thread may occupy the entire memory, fragmented data