Non-professional translation, welcome correction!

The original link: gee.cs.oswego.edu/dl/html/Str…

Auxiliary translation tool: DeepL

When to use parallel streams?

[Draft, September 1, 2014. At present, minimum format specification, location to be determined.]

The Java.util.Streams framework supports data-driven operations on collections and other resources. Most stream methods apply the same operation to each data element. When multiple CPU cores are available, “data-driven” can become “data-parallel” by using the parallelStream() method of collections. But when should you do it?

Consider using s.parallelstream ().operation(F) instead of s.stram ().operation(F) when the operation is standalone and computationally expensive, or when the operation is applied to a large number of elements of an efficiently divisible data structure, or both. In more detail:

  • F, the function for each element (usually a lambda) is independent: the evaluation of each element does not depend on or affect the evaluation of any other element. See the Stream package summary for further guidance on using stateless non-interfering functions.
  • S, the set of sources is effectively divisible. In addition to the Collections, there are other easy to parallelize the current source, such as Java. Util. SplittableRandom (for it, you can use a stream. The parallel () method to parallelize). But most IO based sources are designed primarily for sequential use.
  • The total time spent executing sequential versions exceeded a minimum threshold. Today, on most platforms, the threshold is about 100 microseconds (within the same order of magnitude). But you don’t have to measure this exactly. In practice, you can do this by usingNTimes the number of elementsQ(to performFThe cost of each element) to estimate this number, turn it aroundQEstimate to the number of operations or lines of code, and then checkN*QAt least 10,000. (You can add another zero or two if you’re not sure.) So, when F is a small function, e.gx->x+1So it needsN>=10000 elements, parallel execution is worthwhile. Conversely, when F is a large-scale calculation, like finding the best next move in a chess game,QThe factor is so high that as long as the set is completely divisible,NIt doesn’t matter.
  • For IO operations, database operations, it is necessary to estimate the time overhead, the number of lines of code has not much reference value.

The flow framework does not (and cannot) enforce these conditions. If the computation is not independent, running in parallel makes no sense and can even lead to serious errors. The other criteria stem from three engineering issues and trade-offs:

  • Booting As processors add cores over the years, most processors also add power control mechanisms, which can make for slow kernel booting and sometimes extra overhead from the JVM, operating system, and hypervisors. This startup overhead is comparable to the time required for a large number of kernels to process and execute subtasks. Once started, parallel computing may be more energy efficient than sequential computing (depending on the various processors and system details; For example, see this article by Federova et al.).
  • Granularity is not worth subdividing already small calculations. The framework typically splits the problem so that all the available cores in the system handle the subproblem. If each core actually has nothing to do after startup, then the effort (mostly sequential execution) to set up for parallel computing is wasted. Considering that the actual range of the kernel is now 2 to 256, this threshold also avoids the effects of excessive task splitting.
  • The most efficient divisible collections include ArrayLists and {Concurrent}HashMaps, as well as simple Arrays (that is, those of the form T[] that can be partitioned using the static java.util.Arrays method). The least efficient are LinkedLists, BlockingQueues, and most IO based resources. Other resources fall somewhere in between. (Data structures tend to be effectively divisible if they internally support random access, efficient search, or both.) If it takes longer to split the data than it does to process it, the effort is wasted. So, if the calculated Q factor is high enough, even for LinkedList, you might get parallel acceleration, but that’s not very common. In addition, some data sources cannot be completely split into individual elements, so there may be limitations on how well tasks can be split.

Gathering detailed measurements of these effects can be difficult (although it can be done with a little care when using tools such as the JMH). But the overall effect is easy to see. You can do your own experiment to get a feel for it. For example, on a 32-core test machine, running small functions like Max () or sum() on an ArrayList, the equilibrium point is very close to the 10K size. Larger sizes can see up to a 20-fold increase in speed. Sizes less than 10K are not much less than 10K and tend to be slower than sequential execution. The worst slowdowns happen with fewer than 100 elements — this activates a bunch of threads that end up with nothing to do because the calculations are done before they start. On the other hand, if you use an efficient and fully divisible collection like An ArrayList, you can see an immediate speed boost when each element is time consuming to evaluate.

Another way of saying it is that using Parallel () may take you about 100 microseconds without enough computation, while using it in reasonable circumstances should save at least that much (possibly hours for very large problems). The exact costs and benefits vary over time and platform, and can vary from situation to situation. For example, running a small calculation in parallel in a sequential loop can greatly affect the performance of the program. Microbenchmarks that do this may not predict actual usage.

Some questions and answers:

  1. Why can’t the JVM figure out for itself whether to use parallel mode or not? It can try, but it will often give the wrong answer. The pursuit of completely unguided automatic multi-core parallelization has not met with consistent success over the past three decades, so the flow framework takes a safer approach that only asks the user to make a yes/no decision. These decisions rely on engineering trade-offs that are unlikely to disappear entirely, and are similar to those made all the time in sequential programming. For example, when looking for the largest element in a collection of only one element, you might encounter a hundred times the overhead (deceleration) instead of using the element directly (not in a collection). Sometimes the JVM can optimize this overhead for you. But this doesn’t happen very often in the sequential case, and never in the parallel case. On the other hand, we do expect tools to evolve to help users make better decisions.

  2. What if I know too little about the parameters (F, N, Q, S) to make a good decision? This is also similar to a common sequential programming problem. For example, calling Collection method s.contains (x) is usually fast if S is HashSet, slow if LinkedList, or somewhere in between. In general, the best way to deal with this problem is for the author of a component that uses a collection not to export it directly, but to export operations based on it. So that users are not affected by these decisions. The same applies to parallel operations. For example, a component with an internal set “price” can define a method using a size threshold, unless the calculation of each element is expensive. Such as:

  public long getMaxPrice(a) { return priceStream().max(); }

   private Stream priceStream(a) {
     return(price.size() < MIN_PAR) ? Prices. The stream () : prices. ParallelStream (). }Copy the code

You can extend this idea in a variety of ways to address various considerations about when and how to use parallelism.

  1. What if my function might do IO or sync? At one extreme are functions that do not pass independent criteria, including intrinsic sequential IO, access to locked synchronized resources, and cases where the failure of one concurrent subtask executing IO has side effects on other subtasks. It doesn’t make much sense to parallelize these things. At the other extreme are calculations that perform occasional instantaneous IO or synchronization, with few blocks (such as the use of most forms of logging and most concurrent collections, such as ConcurrentHashMap). These are harmless. The situations in between require the most judgment. If each subtask is likely to be blocked for an extended period of time, waiting for IO or access, then CPU resources may be idle and there is no way for the program or JVM to use them. Everyone is not happy about this. Parallel flow is often not a good option in this case, but there are good alternatives, such as async-IO and CompletableFuture designs.

  2. What if my source is IO based? Currently, JDK IO-based stream sources (such as bufferedReader.lines ()) are primarily for sequential use, processing elements one by one as they arrive. Opportunities exist to support efficient bulk processing of buffered IO, but currently this requires custom development flow sources, Spliterators, and/or Collectors. Some common forms may be supported in future JDK releases.

  3. What if my program is running on a busy computer and all the kernels are in use? Machines typically have a fixed set of cores, and when you perform parallel operations, you can’t magically create more cores. However, as long as the criteria for choosing parallel execution are clearly met, there is usually no cause for concern. Your parallel tasks will compete with other tasks for CPU time, so you’ll see less of a speed increase. In most cases, this is still more effective than other methods. The design of the underlying mechanism is this: if there are no other core is available, then compared with the order of performance, you will only see a little slow, unless the system was overloaded, so much so that it all the time spent on a context switch instead of doing any real work, or be adjusted to assume that all is order processing. If you are on such a system, the administrator may have disabled the use of multithreading /cores as part of the JVM configuration. If you are the administrator of such a system, you might consider doing so.

  4. Are all operations parallel in parallel mode? Yes, at least to some extent, although the Stream framework is subject to source and method constraints in choosing how to do this. In general, fewer constraints allow for more potential parallelization. On the other hand, there is no guarantee that the framework will extract and apply all possible parallelization opportunities. In some cases, if you have the time and expertise, you may be able to handcraft a significantly better parallel solution.

  5. How much of a parallel speed increase will I get? If you follow these guidelines, it’s usually worth it. Predictability is not a strong point of modern hardware and systems, so the usual answer is impossible. Cache location, garbage collection rates, JIT compilation, memory contention, data layout, operating system scheduling policies, and the presence of hypervisors are all factors that can have a significant impact. These factors also play a role in sequential performance, but tend to be amplified in parallel environments. A problem that causes a 10% difference in sequential execution may cause a 10-fold difference in parallelism.

    The flow framework includes facilities that can help you improve your chances of acceleration. For example, the use of specialization for primitives like IntStream tends to be more effective in parallelism than in sequence, because it not only reduces overhead (and footprint), but also improves the localization of the cache. Using ConcurrentHashMap instead of HashMap as the target of parallel “collection” operations reduces internal overhead. As people gain experience with the framework, more tips and guidance will emerge.

  6. It’s all terrible! Should we have a policy to disable parallelism using JVM attributes? We don’t want to tell you what to do. Introducing new ways for programmers to go wrong can be scary. Errors in coding, design and judgment are bound to occur. But some people have been working on languages for decades, and enabling application-level parallelism would lead to major disasters that haven’t happened yet.


Written by Doug Lea, thanks to the help of Brian Goetz, Paul Sandoz, Aleksey Shipilev, Heinz Kabutz, Joe Bowbeer and others