Research and testing of system.currentTimemillis () performance problems
Cause 1.
Today finished the work has nothing to do in the group of zhang Jigang li fan bikini photo map, the dog management said I hair yellow map ban I speak, I Q group touch fish rights are deprived, that line, just before the test of a thing to sort out the record.
CurrentTimeMillis () has performance issues with high concurrency, but it is not clear what the problem is. The most basic and common method for obtaining low precision time in Java libraries is: System.currenttimemillis () has been tested for its performance and for its cache timestamps on the web.
Conclusion first: It does seem to be the case, however, that system.currentTimemillis () can have serious performance problems under certain conditions, but caching timestamps is even more problematic.
2. The analysis
Through the test, it is found that the performance of this method is excellent in Windows, and there is no problem in the performance of TSC, JVM_CLOCK and other higher rating clock sources in Linux operating system. However, when HPET, ACPI_PM and other high-precision clock sources are used, With high concurrency, the call overhead of this method becomes significantly higher, which is unacceptably high (more than 100 times difference). System.currenttimemillis () is a native method, which means that the implementation of this method depends on the implementation of heterogeneous languages, and may vary depending on the operating System and JDK installed. This explains why the performance of this method varies under certain conditions. Because the underlying implementation of different systems is different.
The Dark magic implementation on Windows performs very well. Caching the timestamp method has almost no performance advantage over native, but Windows also has a lower resolution of 0.5ms (which is perfectly fine, After all, System.currentTimemillis () itself is only a low-precision time capture method); But Linux implementation is more complex, will deal with the system API overhead will be relatively high, more if the use of HPET is the whole system shared time source, his API calls on the whole computer only less than 2 million times /s, plus other overhead, the gap will be very obvious.
3. Test process
Calculate consumption by logging 50 million calls to System.CurrentTimemillis ().
public long testSystemMilliSpeed(a) {
long sum = 0;
long t1 = System.currentTimeMillis();
for (int i = 0; i < 50000000; i++) {
sum += System.currentTimeMillis();
}
long t2 = System.currentTimeMillis();
System.out.println("[System.currentTimeMillis()] Sum = " + sum + "; time spent = " + (t2 - t1) +
" ms; or " + (t2 - t1) * 1.0 the e6 / 50000000 + " ns / iter");
return sum;
}
Copy the code
The test machine is ali cloud computing optimized C6 server, 8-core 16GB, using AdoptOpenJDK1.8
The system uses WindowsServer 2019 and CentOS 8.2
Typical result Example Windows
Single thread:
[System.currentTimeMillis()] Sum = 7016145501759210272; time spent = 181 ms; Or 3.62 ns/iter [system.currentTimemillis ()] Sum = 7016145521514297836; time spent = 173 ms; Or 3.46ns/iter [system.currentTimemillis ()] Sum = 7016145533408438385; time spent = 189 ms; Or 3.78 ns/iterCopy the code
8 threads:
[System.currentTimeMillis()] Sum = 7014845600130793472; time spent = 312 ms; Or 6.24ns/iter [system.currentTimemillis ()] Sum = 7014845600415181616; time spent = 316 ms; Or 6.32 ns/iter [system.currentTimemillis ()] Sum = 7014845600415181616; time spent = 314 ms; Or 6.28 ns/iterCopy the code
16 threads
[System.currentTimeMillis()] Sum = 7014831913525842714; time spent = 401 ms; Or 8.02ns/iter [system.currentTimemillis ()] Sum = 7014831913438052239; time spent = 418 ms; Or 8.36 ns/iter [system.currentTimemillis ()] Sum = 7014831914892399458; time spent = 381 ms; Or 7.62 ns/iterCopy the code
summary
In Windows, system.CurrentTimemillis () has a very low call cost of 3.62ns per call, which is equivalent to 2.7 billion times per second. Even when four threads performance almost no decline, at about four times faster than single thread 9.2 billion times per second, but continue to add thread again began to get higher single call overhead, but the performance degradation is not clearly, call overhead is can be at ease completely ignored using this method. In addition, System.CurrentTimemillis () is implemented on Windows at the cost of only 0.5ms resolution, which is less accurate than Linux. However, this is not a method of high precision timing, which is ignored as a good point on Windows.
Typical results Example Linux (TSC)
Single thread
[System.currentTimeMillis()] Sum = 7014846681218501008; time spent = 1308 ms; Or 26.16ns/iter [system.currentTimemillis ()] Sum = 7014846746587342078; time spent = 1307 ms; Or 26.14ns/iter [system.currentTimemillis ()] Sum = 7014846811970028463; time spent = 1308 ms; Or 26.16 ns/iterCopy the code
8 threads
[System.currentTimeMillis()] Sum = 7014815771421781278; time spent = 1563 ms; Or 31.26 ns/iter [system.currentTimemillis ()] Sum = 7014815771870699367; time spent = 1561 ms; Or 31.22 ns/iter [system.currentTimemillis ()] Sum = 7014815774818805007; time spent = 1588 ms; Or 31.76 ns/iterCopy the code
16 threads
[System.currentTimeMillis()] Sum = 7020243696839914036; time spent = 3147 ms; Or 62.94 ns/iter [system.currentTimemillis ()] Sum = 7020243692320997645; time spent = 3164 ms; Or 63.28ns/iter [system.currentTimemillis ()] Sum = 7020243700477529928; time spent = 3069 ms; Or 61.38 ns/iterCopy the code
summary
It is clear from the results that Linux is significantly slower than Windows when using TSC (hell, Java is deployed on Linux servers a lot of the time…). And single thread 26 ns about 38 million times per second, when the number of threads is less than the number of CPU core QPS completely with increased with the number of threads, run after full 8 thread with the increase of the number of threads QPS peak no longer improve then single overhead linear, 4 times that of 64 single thread overhead is also stable increase 4 times of 120 + ns every time, The overhead of such calls is totally acceptable, and the performance is still very good and you can feel completely comfortable with it.
Typical result example Linux (ACPI_PM)
Single thread
[System.currentTimeMillis()] Sum = 7020093683684623170; time spent = 168303 ms; Or 3366.06 ns/iter [system.currentTimemillis ()] Sum = 7020102078358700323; time spent = 167509 ms; Or 3350.18 ns/iter [system.currentTimemillis ()] Sum = 7020110475654659673; time spent = 168367 ms; Or 3367.34 ns/iterCopy the code
8 threads
[System.currentTimeMillis()] Sum = 7020137245340372492; time spent = 348907 ms; Or 6978.14ns/iter [system.currentTimemillis ()] Sum = 7020137258588772567; time spent = 349399 ms; Or 6987.98 ns/iter [system.currentTimemillis ()] Sum = 7020137296734841559; time spent = 351311 ms; Or 7026.22 ns/iterCopy the code
16 threads
[System.currentTimeMillis()] Sum = 7020205171447447339; time spent = 751769 ms; Or 15035.38ns/iter [system.currentTimemillis ()] Sum = 7020205162872284276; time spent = 751808 ms; Or 15036.16ns/iter [system.currentTimemillis ()] Sum = 7020205146376911145; time spent = 751996 ms; Or 15039.92 ns/iterCopy the code
The results summary
Note: In Linux, HPET and ACPI_PM belong to the lower “available” level, TSC belongs to the “excellent” clock source, and KVM_CLOCK belongs to the “ideal” time source
Because there is no HPET time source registered on Ali Cloud ECS, the expected results were also achieved by using ACPI_PM time source at the same level. After switching to ACPI_PM time source, the test result record deteriorated, and I could not even finish the JMH test (it is expected that I would not finish the test in a day). I simply ran several times to obtain some typical results.
The results show a significant linear increase in overhead, and as the number of threads increases, the performance of ACPI_PM is less than a fraction of that of Windows, which increases by a factor of a thousand to only about 300,000 times per second, which has reached the level of microsecond single call overhead, such a huge call overhead can not be ignored. If the currentTimeMillis() method happens to be heavily used in a project, such as counting the elapsed time per request, it is likely to have a direct impact on the project response time.
4. JMH test
Note: Since ACPI_PM is too slow to run the JMH test, only the Linux (TSC) and Windows test results are available
The test environment uses the latest generation OF Alicloud C6 computing optimization ECS, which is configured as 8-core 16G. JDK uses AdoptopenJDK1.8_U282. The operating systems are WindowsServer2019 Data Center edition and CentOS 8.2
The above example code is used to call the loop, each loop 50 million times, respectively test single thread, 4, 8, 16, 32, 64 threads and record the results into a table
When the number of concurrent threads is greater than or equal to the number of CPU cores on the server, the call overhead is not significantly increased. When the number of concurrent threads is greater than or equal to the number of CPU cores on the server, the call overhead is stable. With the increase of concurrency, the consumption time increases steadily. The actual QPS decreases as CPU load and concurrency race performance reaches saturation, which is the expected result.
As shown, using ACPI_PM as the clock source was far too slow. With 16 threads running, performance dropped to an average of 16 microseconds to execute a call, which was too slow to wait until the test was complete
Note: Test code and results are in the appendix at the bottom
5. Cache timestamp
As mentioned above, I did this test because some articles and toolkits use caching timestamps instead of native currentTimeMillis() to avoid possible instability due to different operating systems. So we’ll have to run the same tests again.
This is the cache timestamp implementation code:
public class SystemClock {
private final AtomicLong now;
private static ScheduledExecutorService scheduler;
static {
Runtime.getRuntime().addShutdownHook(new Thread() {
@Override
public void run(a) {
if(scheduler ! =null) { scheduler.shutdownNow(); }}}); }private SystemClock(a) {
this.now = new AtomicLong(System.currentTimeMillis());
scheduler = Executors.newSingleThreadScheduledExecutor(runnable -> {
Thread thread = new Thread(runnable, "SystemClockScheduled");
thread.setDaemon(true);
return thread;
});
scheduler.scheduleAtFixedRate(() -> now.set(System.currentTimeMillis()), 1.1, TimeUnit.MILLISECONDS);
}
public static long now(a) {
return getInstance().now.get();
}
private enum SystemClockEnum {
SYSTEM_CLOCK;
private final SystemClock systemClock;
SystemClockEnum() {
systemClock = new SystemClock();
}
public SystemClock getInstance(a) {
returnsystemClock; }}private static SystemClock getInstance(a) {
returnSystemClockEnum.SYSTEM_CLOCK.getInstance(); }}Copy the code
Throughput test method is the same as system.currentTimemillis () test method above, loop call 50 million times, multithreading with JMH test
Typical results
Single thread
## Windows[BufferClock] Sum = 7014869507792291096; time spent = 16ms; Or 0.32 ns/iter [BufferClock] Sum = 7014869509083652400; time spent = 32ms; Or 0.64 ns/iter [BufferClock] Sum = 7014869510255266345; time spent = 31ms; Or 0.62 ns/iter
## Linux (TSC)[BufferClock] Sum = 7014836520628701408; time spent = 25ms; Or 0.5 ns/iter [BufferClock] Sum = 7014836521867072124; time spent = 25ms; Or 0.5 ns/iter [BufferClock] Sum = 7014836523105825292; time spent = 25ms; Or 0.5 ns/iter
## Linux (ACPI_PM)[BufferClock] Sum = 7014970447687570464; time spent = 25ms; Or 0.5 ns/iter [BufferClock] Sum = 7014970448949515660; time spent = 25ms; Or 0.5 ns/iter [BufferClock] Sum = 7014970450209997473; time spent = 25ms; Or 0.5 ns/iterCopy the code
8 threads
## Windows[BufferClock] Sum = 7014837658100506766; time spent = 53ms; Or 1.06 ns/iter [BufferClock] Sum = 7014837658094441788; time spent = 53ms; Or 1.06 ns/iter [BufferClock] Sum = 7014837658032598985; time spent = 56ms; Or 1.12 ns/iter
## Linux (TSC)[BufferClock] Sum = 7020249783930639245; time spent = 51ms; Or 1.02 ns/iter [BufferClock] Sum = 7020249784793206406; time spent = 51ms; Or 1.02 ns/iter [BufferClock] Sum = 7020249784998169933; time spent = 51ms; Or 1.02 ns/iter
## Linux (ACPI_PM)[BufferClock] Sum = 7020287669118903216; time spent = 71ms; Or 1.42 ns/iter [BufferClock] Sum = 7020287669246604147; time spent = 57ms; Or 1.14ns/iter [BufferClock] Sum = 7020287669311506737; time spent = 64ms; Or 1.28 ns/iterCopy the code
16 threads
## Windows[BufferClock] Sum = 7014887357711793536; time spent = 52ms; Or 1.04 ns/iter [BufferClock] Sum = 7014887357711793536; time spent = 53ms; Or 1.06 ns/iter [BufferClock] Sum = 7014887357711793536; time spent = 54ms; Or 1.08 ns/iter
## Linux (TSC)[BufferClock] Sum = 7020240799851649690; time spent = 101ms; Or 2.02 ns/iter [BufferClock] Sum = 7020240798802739189; time spent = 114ms; Or 2.28 ns/iter [BufferClock] Sum = 7020240798701704537; time spent = 118ms; Or 2.36 ns/iter
## Linux (ACPI_PM)[BufferClock] Sum = 7020340267051437045; time spent = 192ms; Or 3.84 ns/iter [BufferClock] Sum = 7020340268861047443; time spent = 144ms; Or 2.88 ns/iter [BufferClock] Sum = 7020340269307897194; time spent = 183ms; Or 3.66 ns/iterCopy the code
JMH test results
The currentTimeMillis() result is merged with the result of the cache timestamp test. It shows that Linux has a higher slope than Windows when the number of threads exceeds the number of cores, and the green and blue cache timestamp lines overlap so that only one line can be seen.
Partial process screenshot
The results summary
As can be seen from the typical results, both on Linux and Windows, the performance is very stable and excellent, which is completely free of overhead. Therefore, it can be seen that the method of caching timestamp can effectively avoid the slow problem caused by the time source.
6. The hard truth
Cache timestamps perform so well that we can use them instead of JDK methods. As a result, caching timestamps is not ideal or controllable, and should not be used anywhere (unless you have high concurrency requirements but don’t care about precision, which is not the case).
Stability test
In addition to being fast, a method of getting the time must also be accurate. If the result is not accurate, then fast doesn’t mean anything. (I am fast in my head, 2352 + 135212 = 97652, pop, I will ask you if you are fast)
The test method
By recording the call results within a period of time, the dispersion degree of ticks at each tick interval was calculated
Note: You need to allocate at least 4GB of memory for the JVM to start because you need a very large 500 million long array to record the result of the call
public class TicksTest {
public static volatile boolean flag = true;
public static void main(String[] args) {
TicksTest.step();
}
public static void step(a) {
/ / 500 million
long[] values = new long[500000000];
// Implement timing of 500ms
new Thread(() -> {
try {
Thread.sleep(500L);
} catch (InterruptedException e) {
e.printStackTrace();
}
TicksTest.flag = false;
}).start();
int length = 0;
long star = System.currentTimeMillis();
for (int i = 0; flag && i < values.length; i++) {
values[i] = System.currentTimeMillis();
// values[i] = SystemClock.now();
length = i;
}
long end = System.currentTimeMillis();
// Statistical results
long min = 0, max = 0, total = 0, minCount = 0, maxCount = 0, totalCount = 0, last = 0, skip = 0;
List<Long> ticksList = new ArrayList<>(600);
List<Long> ticksCountList = new ArrayList<>(600);
for (int i = 1; i < length; i++) {
if(values[i] ! = values[i -1]) {
long step = (values[i] - values[i - 1]);
long stepCount = i - last;
last = i;
System.out.printf("step %09d %d %d%n", i, step, stepCount);
// Skip the first one
if (skip++ > 0) {
total += step;
totalCount += stepCount;
ticksList.add(step);
ticksCountList.add(stepCount);
if (max < step)
max = step;
if (min == 0 || min > step)
min = step;
if (maxCount < stepCount)
maxCount = stepCount;
if (minCount == 0 || minCount > stepCount)
minCount = stepCount;
}
}
}
System.out.printf("time: %d, count: %d, " +
"ticks: (avg: %d, min: %d, max: %d) " +
"ticksCount: (avg: %d, min: %d, max: %d)", end - star, ticksList.size(), total / ticksList.size(), min, max, totalCount / ticksCountList.size(), minCount, maxCount); }}Copy the code
Typical results
This test result was tested with my own development computer with a relatively low configuration of 4 cores and 16G
currentTimeMillis()
time: 515, count: 512, ticks: (avg: 1, min: 1, max: 1) ticksCount: (avg: 193040, min: 11356, max: 300382)
time: 509, count: 506, ticks: (avg: 1, min: 1, max: 1) ticksCount: (avg: 207048, min: 13123, max: 332162)
time: 513, count: 511, ticks: (avg: 1, min: 1, max: 1) ticksCount: (avg: 208990, min: 3133, max: 330624)
Copy the code
Cache timestamp
time: 506, count: 39, ticks: (avg: 12, min: 1, max: 17) ticksCount: (avg: 7673476, min: 62472, max: 11029922)
time: 509, count: 36, ticks: (avg: 13, min: 1, max: 21) ticksCount: (avg: 8133013, min: 22302, max: 13444986)
time: 503, count: 36, ticks: (avg: 13, min: 1, max: 20) ticksCount: (avg: 8624468, min: 48840, max: 13364882)
Copy the code
As can be seen from the results, the method of obtaining timestamp in JDK was stable at 1ms each time ticks, and the results of successive tests were consistent. However, the method of caching timestamp maintained by the asynchronous thread every millisecound is very unstable. A few ticks can work normally for 1ms, while most of the other ticks are more than two digits. If it comes to the extreme situation, I will give you the same result when I call it twice every 20ms in reality, and your expression may collapse.
Again, the result of currentTimeMillis() depends on the underlying operating system, which can be tens of milliseconds of ticks, as noted in the JDK method comments. However, CentOS, Ubuntu and Depin are all reliable.
7. To summarize
CurrentTimeMillis () has a very low overhead per call on Windows. The QPS of a single thread is the highest when using half of the CPU threads, and then increases as the number of threads increases but not very linearly. In Linux, when TSC time source is used, QPS increases linearly before full CPU core is used, and QPS reaches the peak after full CPU core is used and does not increase with the increase of threads, which is a very excellent method. When using the ACPI_PM time source, the single thread is already running full, and the QPS does not increase as the number of threads increases. Instead, it is stuck waiting for currentTimeMillis() and the CPU usage is low. When using ACPI_PM, the bottom layer should be single threaded but all the top threads must queue up to wait, so adding threads does not increase QPS and the CPU usage is low. The resolution is too low and the asynchronous update thread may not be able to preempt CPU time during high load, so the cached timestamp will never be updated.
In the vast majority of cases JDK implementations are more reliable than most people can imagine. Trust the JDK, never get smart, and if you’re going to be smart, make sure it’s thoroughly tested and proven!
8. The appendix
Test code and results github.com/yyyyyzc/cur…
References: pzemtsov. Making. IO / 2017/07/23 /…