This is the 12th day of my participation in Gwen Challenge
5 Troubleshooting four: Solve the problem caused by NULL return from operator functions
- In some operator functions, we need to have a return value, but in some cases we don’t want a return value, and if we return NULL directly, we will get an error, such as scala.math (NULL) exception.
- If you encounter a situation where you do not want a value returned, you can resolve it by:
- Returns a special value, not NULL, for example, -1.
- After obtaining an RDD through the operator, you can run the filter operation on the RDD to filter out the data whose value is -1.
- After the filter operator is used, invoke the coalesce operator for optimization.
6 Troubleshooting 6: The MEMORY overflow of the JVM stack in Yarn-cluster mode cannot be executed
The following figure shows the working principles of yarn-cluster:
-
If the Spark job contains SparkSQL, the operation can be performed in yarn-client mode, but cannot be performed in yarn-cluster mode (an OOM error is reported).
- In yarn-client mode, the Driver runs on the local machine, and the PermGen configuration of the JVM used by Spark is the spark-class file on the local machine. The size of the JVM permanent generation is 128MB. This is ok. The Driver runs on a node in the YARN cluster and uses default Settings that are not configured. The size of PermGen permanent generation is 82MB.
- SparkSQL internal to perform very complex SQL semantic parsing, syntax tree conversion, etc., very complex, if the SQL statement itself is very complex, it is likely to lead to performance loss and memory consumption, especially for PermGen will be relatively large.
-
Therefore, if the usage of PermGen is greater than 82MB but less than 128MB, the yarn-client mode can run, but yarn-cluster mode cannot run.
-
To increase the capacity of PermGen, set parameters in the Spark-Submit script, as shown in the code listing.
-
Through – the conf spark. Driver. ExtraJavaOptions = “- XX: PermSize = 128 m – XX: MaxPermSize = 256 m” set the size of the permanent generation driver, defaults to 128 MB and 256 MB, largest This can avoid the problems mentioned above.
7 Troubleshooting 7: Solve the OVERFLOW of JVM stack memory caused by SparkSQL
-
When SparkSQL SQL statements have hundreds or thousands of OR keywords, Driver side JVM stack memory overflow can occur.
-
JVM stack memory overflow is basically the result of calling too many levels of methods, resulting in a large number of very deep recursions that exceed the depth limit of the JVM stack. (We guess that SparkSQL has a large number of OR statements. When parsing SQL, such as converting to syntax trees or generating execution plans, the PROCESSING of OR is recursive. A large number of OR statements will occur a large number of recursion.)
-
In this case, you are advised to split one SQL statement into multiple SQL statements. Ensure that each SQL statement contains less than 100 clauses. Based on actual production environment experiments, the OR keyword of an SQL statement is limited to 100 and usually does not result in JVM stack memory overflow.
8 Troubleshooting 8: Persistence and checkpoint use
-
Spark persistence works well in most cases, but sometimes data may be lost. If data is lost, the lost data needs to be re-calculated, cached and used after the calculation. To avoid data loss, you can checkpoint the RDD. This is to persist data to a fault-tolerant file system (such as HDFS).
-
After an RDD cache is checked, if the cache is lost, the SYSTEM preferentially checks whether the checkpoint data exists. If yes, the system uses the checkpoint data instead of recalculating. In other words, checkpoint data is used as a cache safeguard mechanism. If the cache fails, checkpoint data is used.
-
The advantage of checkpoint is that Spark job reliability is improved. If the cache fails, data does not need to be recalculated. The disadvantage is that data needs to be written to a file system, such as the HDFS, which consumes high performance.