JFR定位线上问题实例 - JFR导致的雪崩问题定位与解决原创
最近发现一个应用,使用默认的 JFR 配置,发生了类似于雪崩的现象,这里记录下,引以为戒。
Key takeaways
- JFR 的线程堆栈 dump 采集默认周期是everyChunk,也就是每次新建一个Chunk就会采集一次。默认每个Chunk大小为 12M,在线程比较多的时候,堆栈 dump 可能大于 12M,导致 JFR 一直切换新的 Chunk,然后又触发evenryChunk导致无限循环而雪崩。
- 对于 JFR 默认事件采集配置(位于JDK目录/lib/jfr/default.jfc),每个采集周期和Chunk相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用Chunk作为指标,防止上面这个问题。
- 只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。
背景介绍
线上某个线程数非常多的应用,线程是空闲状态,有任务提交过来,但是过了 10s,任务才开始执行,并且这一段时间有很多类似的情况,例如:
2020-12-01 13:07:34.515 INFO [微服务名称,298ee1f359f67ff7,298ee1f359f67ff7] [19] [线程全局唯一名称:调度线程] 任务提交到线程池
.....
2020-12-01 13:07:43.623 INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 开始处理
2020-12-01 13:07:44.159 INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 处理完成
查看24684780-0这个线程池的统计当时的队列是空的,所有线程都是等待任务的状态。排除了有其他任务占满核心线程池导致等待的情况。
线程一直空闲,但是却不能接受任务处理,表示这个线程可能一直处于需要进入安全点SafePoint的场景。查看GC,当时的 GC 是正常的,证明并不是 GC 引起的需要进入安全点:
查看当时的 safepoint 日志(我们线上的进程开启了 safepoint 日志,参数:-Xlog:safepoint=debug:file=safepoint.log:utctime,level,tags:filecount=50,filesize=100M)
,发现在这期间内,有很多很多的进入 safepoint 的处理(我们只看了进入安全点(Entering safepoint region)这一行):
[2020-12-01T13:07:32.290+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.335+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:32.635+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:32.655+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:32.853+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:32.923+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.968+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.269+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.286+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.480+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:33.524+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.806+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.822+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.949+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:34.019+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.064+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.346+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.361+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:34.510+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.555+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.839+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.861+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.023+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:35.097+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.140+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.423+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.439+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.614+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.654+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.943+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.959+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.089+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:36.172+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.216+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:36.498+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:36.515+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.706+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.749+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.036+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.052+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.184+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:37.258+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.301+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.624+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.648+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.906+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.955+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.248+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:38.265+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:38.431+0000][info ][safepoint] Entering safepoint region: G1CollectForAllocation
[2020-12-01T13:07:38.551+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:38.620+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:38.686+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.978+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.001+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.214+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.269+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:39.552+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.568+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.733+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:39.813+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.860+0000][info ][safepoint] Entering safepoint region: PrintThreads
看上去很像 JFR 的定时采集,通过 top -Hp查看线程cpu占用:
top - 01:31:46 up 10:01, 1 user, load average: 2.16, 2.55, 2.62
Threads: 25390 total, 2 running, 25388 sleeping, 0 stopped, 0 zombie
%Cpu(s): 17.0 us, 4.7 sy, 0.0 ni, 74.0 id, 4.1 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 32120452 total, 287936 free, 20438864 used, 11393652 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 11485760 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
63120 root 20 0 26.8g 9.4g 22016 R 73.2 30.8 178:50.29 VM Thread
63130 root 20 0 26.8g 9.4g 22016 S 15.7 30.8 45:19.81 JFR Periodic Ta
63114 root 20 0 26.8g 9.4g 22016 S 6.6 30.8 17:31.55 GC Thread#0
63185 root 20 0 26.8g 9.4g 22016 S 6.6 30.8 44:59.01 Log4j2-TF-10-As
63346 root 20 0 26.8g 9.4g 22016 S 6.3 30.8 17:31.89 GC Thread#2
63345 root 20 0 26.8g 9.4g 22016 S 6.0 30.8 17:31.70 GC Thread#1
63347 root 20 0 26.8g 9.4g 22016 S 5.7 30.8 17:31.86 GC Thread#3
63128 root 20 0 26.8g 9.4g 22016 S 5.4 30.8 13:35.61 JFR Recorder Th
63134 root 20 0 26.8g 9.4g 22016 S 4.2 30.8 13:21.16 java
63569 root 20 0 26.8g 9.4g 22016 S 1.8 30.8 19:20.94 lettuce-epollEv
64742 root 20 0 26.8g 9.4g 22016 S 1.2 30.8 1:03.26 XNIO-2 task-6
2777 root 20 0 26.8g 9.4g 22016 S 1.2 30.8 0:46.15 XNIO-2 task-24
。。。。。。
发现使用 CPU 最多的线程是 VM 线程,其次是 JFR 线程。怀疑是因为 JFR 采集的原因造成的所有线程一直进入安全点的问题。我们的 JFR 采集事件的配置是默认的配置,也就是 JDK 目录下的 lib 目录下的 jfr 目录下的 default.jfc 查看其中的线程 Dump 配置:
<event name="jdk.ThreadDump">
<setting name="enabled" control="thread-dump-enabled">true</setting>
<setting name="period" control="thread-dump-interval">everyChunk</setting>
</event>
发现默认是everyChunk,也就是每次进入新的Chunk会打印一次。那么什么是Chunk?啥时候进入新的Chunk?
在 JFR 中,所有的 Event (包括通过JFR API产生的 Event 还有 JVM 产生的 EVENT),会先存储到每个线程自己的 Thread Buffer 中;在这个 Buffer 满了之后,会将 Buffer 的内容刷入 Global Buffer 中;Global Buffer 是一个环形 Buffer,保存着所有线程发送来的 Thread Buffer 中的内容。当这个环形 Buffer 存储到达上限之后,根据配置,会选择丢弃或者刷入文件,这里默认是刷入文件。这个文件,就可以理解为Chunk。
Chunk
的大小是有限制的,默认是12m.
但是,这个进程的线程很多,导致线程 dump 超过了 12m。导致 JFR 一直切换新的 Chunk,然后又触发evenryChunk
导致无限循环进入安全点。
解决方案
发现问题之后,首先想到的是,减少线程数量。由于业务需要,这个业务的线程主要用来做业务隔离以及有序执行,同一时间内的 Running 线程数量并不会很多。只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。所以暂时先不考虑减少线程数量。
最后的解决方案是,对于 JFR 默认事件采集配置,每个采集周期和Chunk相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用Chunk作为指标,防止出现某个采集雪崩导致不断切换chunk导致无限采集这个问题。