<! – I reduced the Spark script running time from 15 hours to 12 minutes by painstakingly tuning for 10 hours! –>
I had a puzzle on Monday and wrote an article on how to extract a specific line from Spark’s Dataframe that suggested several possible solutions.
I didn’t expect to be confronted with this problem so soon, so I used an example that a child could understand to describe what I was doing.
Simple and vivid examples
He says there are several classes in a primary school. Now he wants to rank the children according to their height by class and record them.
The problem was that there was only one height ruler in the school, and the process of measuring, sorting and registering height had to be done in a single classroom, due to both subjective and objective factors such as naughty children. The class that is not measured by the turn, is in the playground activity.
And the most frustrating part for teachers: getting children into the classroom. Height testing, recording, and sorting can take no more than a few minutes, but getting the children into the classroom is a tremendous effort on the part of the teachers, and a very time-consuming one.
The good news is that it takes about as long to get one class into the classroom as it does to get a hundred classes into the classroom at the same time. Therefore, the teacher usually calls all the students directly into the classroom.
But I was faced with a tricky situation. On my playground, there are 2200 classes, each class has 160,000 people. My classroom is also very large, but it certainly can not hold 2200 × 160,000 ≈ 300 million people.
So I thought, I a class a class test, this is the most intuitive, the best management.
“Come, Class One, into the classroom!” . Took ten minutes to get them all in… It took dozens of seconds to measure, arrange, record… “Good! One out! Class Two, come in!” .
So back and forth, and so on to the 2200 class of time, has passed a month…
The janitor spoke: Why don’t you just call them all in? Anyway, there are conditions in front: “organize a class into the classroom, and organize a hundred classes at the same time into the classroom, it takes about the same time.”
That makes sense. That’s what I was doing this morning: making the classroom bigger.
I hired people from the land bureau, I hired engineers, I hired construction crews, I tried everything, and every time I tried to fix it (the kind that could hold 500 million people), the classroom collapsed for whatever reason.
Alas! I have calculated that it can be built in theory.
I couldn’t be reconciled, so I kept trying, and I kept trying, and I kept trying, and hours went by.
This is another judge’s speech: stop fixing the classroom, you divide the children into groups, several classes at a time to come into the classroom!
Fair enough, but it took me some time to change part of the original management logic. In addition, a lot of time was spent debugging.
I initially set 100 classes as a batch into the classroom:
- It turned out that I had to do “call the kids into the classroom” 2200 times (once per class)
- Now I only do “call the children into class” 22 times. See if that’s 100 times faster
Contrast to explain
The above is actually a simplified version of what I did, including:
- The “classroom” is the “memory” of the computer, and you have to bring data into memory in order to sort it or something
- “Entering the classroom” is the “IO operation” of the computer. The memory of the computer is very expensive. Generally, computers are 8G or 16G, while the hard disk is relatively cheap, with 256G or 512G, or even several terabytes.
- “IO operations” are more time consuming than calculations
Here is an excerpt from my work journal (desensitized version) :
First, I called each class into a separate classroom, which was time-consuming.
Starting at about 9:30 am on July 19 and ending at 0:23 am on July 20, there were 2200 columns, each column had 160,000 data, all had to be sorted, and IO was involved, and it took 15 hours. The time used is IO time and processing time for each column:
$$\ alpha \ times time_ {\ text {IO}} * column + \ beta \ \ log_2 times line line {} $$
In comparison with IO, the computation time (such as sorting) is negligible, so the time can be denoted as
$$\alpha \times time_{\text{IO}} * $$
So I wondered if I could “call all the classes in at once,” after all:
- My machine has 8 gigabytes of memory
- Data use 4G at most
I set out to “expand the classroom” and tried it a lot, battling configuration files. Conf, spark-shell, spark-env.cmd, jvm-xmx4G, etc., all morning, to no effect.
I think my attempt has produced an effect, because the original error is not reported, and the process of collect can also be completed (all children can enter the classroom, which was impossible before). However, once the operation is involved (after collect, it will be stuck for a long time, and the due Array cannot be returned), the JVM heap will explode. In addition, after some other adjustments, this garbage collection problem no longer explodes heap, GC overhead limit exceeded.
I have reason to suspect that performance is limited by the hardware.
So I thought, “Divide the children into groups and call several classes into the classroom at a time.”
There are a lot of bugs, but the last one I chose was to call 100 classes at once, which took about 12 minutes.
“Tuning” ends.
On the whole, there was almost no difficulty in thinking, and it took so much time mainly because:
- Not reconciled to one of their ideas is not feasible, just try it
- inexperienced
Ah! If the code that takes 15 hours is not written by me a month ago, but written by someone else, then I change it to 12 minutes, and it looks like I’m pretty good 🤣 I’m just kidding, I hope everyone writes good code, so we can save time to rest
Well, back to bed. Tomorrow I’ll have to go on with other work for ‘the children’; And there’s another job at another school. (It’s a little hard to be assigned two jobs at the same time.)
I am Xiao Pai, a postgraduate student in Tianjin University, WeChat Piperlhj, if you are also engaged in Spark related work, be sure to add me WeChat, I really need a master to let me harass 😅
Don’t forget to watch