“This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Hello, everyone, I am Huaijin Shake Yu, a big data meng new, home has two gold swallowing beast, Jia and Jia, can code can teach next almighty dad

If you like my article, you can [follow ⭐]+[like 👍]+[comment 📃], your three companies is my motivation, I look forward to growing up with you ~


Cause 1.

When I just got in touch with big data, the first project I took over, good guy, came up to a ZIP package, more than 200 M, each packaging needs to wait for half a day, each submission azakaban looked at the slow progress bar, painful, each time to others package is waiting for the progress bar, etc.

2. Optimization begins

When the project is ready for refactoring, this problem must be solved.

2.1 unpacking

Reframe the project, divide the project by function, separate the common things into a sub-project.

The project - API | | - XXX - XXX - common | - XXX - context | - XXX - businessCopy the code

2.2 Propose common script setting parameters

In the original code, the parameters of each task are not fixed, the order is not fixed, and the location of the configuration file is not fixed, which causes great trouble to the troubleshooting. After optimization, all parameters must be in a unified format, and all general information can be externally specified.

Spark input parameters: main method, flow, database, run date, configuration file, queue, version number, end dateCopy the code

In particular, the database as an external parameter must be input, to the follow-up multi-business multi-version operation to provide the basic support.

Version number is more for subsequent optimization to do standard management.

2.3 Removing static Files from the JAR Package to the HDFS

Originally engineering is huge, the biggest reason is the jars are local references, and submit task every time need to pull these files, also is a waste of time, so this way from the jar jar package African child engineering packages into the HDFS, let task ontology are greatly reduced, and set the version number, convenient version running at the same time not conflict.

/bin/spark-submit
-jars hdfs:///xx/java/lib/${version}/*.jar
Copy the code

After this optimization, the main mission pack was cut directly from 200M to 10M, and uploading azkaban was a blink of an eye.


conclusion

If you like my article, you can [follow ⭐]+[like 👍]+[comment 📃], your three companies is my motivation, I look forward to growing up with you ~

You can pay attention to the public number “Huaijin Shake Yu jia and Jia”, access to resources download