TL;DR
http://www.brendangregg.com/activebenchmarking.html http://www.brendangregg.com/activebenchmarking.html
临时基准测试:你对 A 进行基准测试,但实际上测量 B,并且
断定您已经测量了 C.
问题N1。暂停后的第一次测量发生变化。
看起来你面临着栈上替换 http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.html#onStackReplacement。当 OSR 发生时,VM 将暂停,目标函数的堆栈帧将被等效帧替换。
根本情况是微基准错误 - 它没有正确预热。只需将以下行插入到 while 循环之前的基准测试中即可修复它:
System.out.println("WARMUP = " + busyPause(5000000000L));
如何检查这一点 - 只需运行您的基准测试-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls
。我修改了您的代码 - 现在它在每次调用之前将间隔打印到系统输出中:
interval = 1
interval = 1
interval = 5000000000
689 145 4 JvmPauseLatency::busyPause (19 bytes) made not entrant
689 146 3 JvmPauseLatency::busyPause (19 bytes)
Installing method (3) JvmPauseLatency.busyPause(J)J
698 147 % 4 JvmPauseLatency::busyPause @ 6 (19 bytes)
Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
702 148 4 JvmPauseLatency::busyPause (19 bytes)
705 146 3 JvmPauseLatency::busyPause (19 bytes) made not entrant
Installing method (4) JvmPauseLatency.busyPause(J)J
interval = 5000000000
interval = 5000000000
interval = 5000000000
interval = 5000000000
通常 OSR 发生在第 4 层,因此为了禁用它,您可以使用以下选项:
-
-XX:-TieredCompilation
禁用分层编译
-
-XX:-TieredCompilation -XX:TieredStopAtLevel=3
禁用 4 级分层编译
-
-XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement
禁用 OSR
问题N2。如何测量。
我们先从文章开始https://shipilev.net/blog/2014/nanotrusting-nanotime https://shipilev.net/blog/2014/nanotrusting-nanotime。简而言之:
- JIT 只能编译方法 - 在您的测试中您有一个循环,因此只有 OSR 可用于您的测试
- 你正在尝试测量一些小东西,可能小于
nanoTime()
打电话(见易失性写入的成本是多少? https://shipilev.net/blog/2014/nanotrusting-nanotime/#_building_performance_models)
- 微架构级别 – 缓存、CPU 管道停顿很重要,例如,TLB 未命中或分支错误预测花费的时间比测试执行时间更多
因此,为了避免所有这些陷阱,您可以使用基于 JMH 的基准测试,如下所示:
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;
import java.util.Random;
import java.util.concurrent.TimeUnit;
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class LatencyTest {
public static final long LONG_PAUSE = 5000L;
public static final long SHORT_PAUSE = 1L;
public Random rand;
@Setup
public void initI() {
rand = new Random(0xDEAD_BEEF);
}
private long busyPause(long pauseInNanos) {
Blackhole.consumeCPU(pauseInNanos);
return pauseInNanos;
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyBusyPauseShort() {
return busyPause(SHORT_PAUSE);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyBusyPauseLong() {
return busyPause(LONG_PAUSE);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyFunc() {
return doCalculation(1);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long measureShort() {
long x = busyPause(SHORT_PAUSE);
return doCalculation(x);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long measureLong() {
long x = busyPause(LONG_PAUSE);
return doCalculation(x);
}
private long doCalculation(long x) {
long calculation = 0;
calculation += x / (rand.nextInt(5) + 1);
calculation -= calculation / (rand.nextInt(5) + 1);
calculation -= x / (rand.nextInt(6) + 1);
calculation += calculation / (rand.nextInt(6) + 1);
return calculation;
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include(LatencyTest.class.getName())
.verbosity(VerboseMode.NORMAL)
.build();
new Runner(options).run();
}
}
请注意,我已将繁忙循环实现更改为 Blackhole#consumeCPU() 以避免操作系统相关的影响。所以我的结果是:
Benchmark Mode Cnt Score Error Units
LatencyTest.latencyBusyPauseLong avgt 4 15992.216 ± 106.538 ns/op
LatencyTest.latencyBusyPauseShort avgt 4 6.450 ± 0.163 ns/op
LatencyTest.latencyFunc avgt 4 97.321 ± 0.984 ns/op
LatencyTest.measureLong avgt 4 16103.228 ± 102.338 ns/op
LatencyTest.measureShort avgt 4 100.454 ± 0.041 ns/op
请注意,结果几乎是相加的,即LatencyFunc + LatencyBusyPauseShort =MeasureShort
问题N3。差距很大。
你的测试出了什么问题?它没有正确地预热 JVM,即它使用一个参数来预热,而使用另一个参数来测试。为什么这很重要? JVM 使用配置文件引导的优化,例如,它计算分支被采用的频率,并为特定配置文件生成“最佳”(无分支)代码。因此,我们尝试使用参数 1 来预热 JVM 我们的基准测试,JVM 生成“最佳代码”,其中从未采用 while 循环中的分支。这是 JIT 编译日志中的一个事件(-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation
):
<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/>
属性更改后,JIT 使用不常见的陷阱来处理不是最佳的代码。我创建了一个基准测试,该基准测试基于原始基准并进行了一些细微的更改:
- busyPause被JMH的consumeCPU取代,以便在不与系统交互的情况下进行纯java基准测试(实际上nano time使用用户层函数
vdso clock_gettime
并且我们无法分析此代码)
- 所有计算都被删除
_
import java.util.Arrays;
public class JvmPauseLatency {
private static final int WARMUP = 2000 ;
private static final int EXTRA = 10;
private static final long PAUSE = 70000L; // in nanos
private static volatile long consumedCPU = System.nanoTime();
//org.openjdk.jmh.infra.Blackhole.consumeCPU()
private static void consumeCPU(long tokens) {
long t = consumedCPU;
for (long i = tokens; i > 0; i--) {
t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
}
if (t == 42) {
consumedCPU += t;
}
}
public void run(long warmPause) {
long[] results = new long[WARMUP + EXTRA];
int count = 0;
long interval = warmPause;
while(count < results.length) {
consumeCPU(interval);
long latency = System.nanoTime();
latency = System.nanoTime() - latency;
results[count++] = latency;
if (count == WARMUP) {
interval = PAUSE;
}
}
System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
}
public static void main(String[] args) {
int totalCount = 0;
while (totalCount < 100) {
new JvmPauseLatency().run(0);
totalCount ++;
}
}
}
结果是
Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]
为了修复这个基准只需替换new JvmPauseLatency().run(0)
with new JvmPauseLatency().run(PAUSE);
这是结果:
Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]
如果你想动态改变“暂停” - 你必须动态预热 JVM,即
while(count < results.length) {
consumeCPU(interval);
long latency = System.nanoTime();
latency = System.nanoTime() - latency;
results[count++] = latency;
if (count >= WARMUP) {
interval = PAUSE;
} else {
interval = rnd.nextBoolean() ? PAUSE : 0;
}
}
问题N4。解释器-Xint 怎么样?
对于基于开关的解释器,我们有很多问题,主要是间接分支指令。我做了3个实验:
- 随机预热
- 持续预热,0 暂停
- 整个测试使用暂停0,包括
每个实验都是通过以下命令启动的sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency
结果是:
Performance counter stats for 'java -Xint JvmPauseLatency':
272,822,274,275 cycles
723,420,125,590 instructions # 2.65 insn per cycle
26,994,494 cache-references
8,575,746 cache-misses # 31.769 % of all cache refs
2,060,138,555 bus-cycles
2,930,155 branch-misses
86.808481183 seconds time elapsed
Performance counter stats for 'java -Xint JvmPauseLatency':
2,812,949,238 cycles
7,267,497,946 instructions # 2.58 insn per cycle
6,936,666 cache-references
1,107,318 cache-misses # 15.963 % of all cache refs
21,410,797 bus-cycles
791,441 branch-misses
0.907758181 seconds time elapsed
Performance counter stats for 'java -Xint JvmPauseLatency':
126,157,793 cycles
158,845,300 instructions # 1.26 insn per cycle
6,650,471 cache-references
909,593 cache-misses # 13.677 % of all cache refs
1,635,548 bus-cycles
775,564 branch-misses
0.073511817 seconds time elapsed
在分支未命中的情况下,由于巨大的内存占用,延迟和占用空间会非线性增长。