性能文章>译 | JEP 331: Low-Overhead Heap Profiling>

译 | JEP 331: Low-Overhead Heap Profiling转载



Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI.


Provide a way to get information about Java object heap allocations from the JVM that:

  • Is low-overhead enough to be enabled by default continuously,
  • Is accessible via a well-defined, programmatic interface,
  • Can sample all allocations (i.e., is not limited to allocations that are in one particular heap region or that were allocated in one particular way),
  • Can be defined in an implementation-independent way (i.e., without relying on any particular GC algorithm or VM implementation), and
  • Can give information about both live and dead Java objects.


  • 开销足够低,可以在默认情况下连续启用,
  • 可以通过定义良好的编程接口访问,
  • 可以对所有的分配进行抽样(即,不局限于一个特定堆区域中的分配或以一种特定方式分配的分配),
  • 可以以一种与实现无关的方式定义(即,不依赖于任何特定的GC算法或VM实现),以及
  • 可以提供有关活的和死的Java对象的信息。


There is a deep need for users to understand the contents of their heaps. Poor heap management can lead to problems such as heap exhaustion and GC thrashing. As a result, a number of tools have been developed to allow users to introspect into their heaps, such as the Java Flight Recorder, jmap, YourKit, and VisualVM tools.
用户非常需要理解堆的内容。糟糕的堆管理可能会导致堆耗尽和GC抖动等问题。因此,人们开发了许多工具来允许用户自省他们的堆,例如Java Flight Recorder、jmap、YourKit和VisualVM工具。
One piece of information that is lacking from most of the existing tooling is the call site for particular allocations. Heap dumps and heap histograms do not contain this information. This information can be critical to debugging memory issues, because it tells developers the exact location in their code particular (and particularly bad) allocations occurred.
There are currently two ways of getting this information out of HotSpot:

  • First, you can instrument all of the allocations in your application using a bytecode rewriter such as the Allocation Instrumenter. You can then have the instrumentation take a stack trace (when you want one).
  • Second, you can use Java Flight Recorder, which takes a stack trace on TLAB refills and when allocating directly into the old generation. The downsides of this are that a) it is tied to a particular allocation implementation (TLABs), and misses allocations that don’t meet that pattern; b) it doesn’t allow the user to customize the sampling interval; and c) it only logs allocations, so you cannot distinguish between live and dead objects.


  • 首先,您可以使用字节码重写器(例如Allocation Instrumenter)来检测应用程序中的所有分配。然后,您可以让插装进行堆栈跟踪(当您需要时)。
  • 其次,您可以使用Java Flight Recorder,它在TLAB重新填充和直接分配到老一代时进行堆栈跟踪。这样做的缺点是:a)它绑定到特定的分配实现(TLABs),并且错过了不符合该模式的分配;B)它不允许用户自定义采样间隔;c)它只记录分配,所以你无法区分活对象和死对象。

This proposal mitigates these problems by providing an extensible JVMTI interface that allows the user to define the sampling interval and returns a set of live stack traces.


New JVMTI event and method(新的JVMTI事件和方法)

The user facing API for the heap sampling feature proposed here consists of an extension to JVMTI that allows for heap profiling. The following systems rely on an event notification system that would provide a callback such as:

SampledObjectAlloc(jvmtiEnv *jvmti_env,
                   JNIEnv* jni_env,
                   jthread thread,
                   jobject object,
                   jclass object_klass,
                   jlong size)


  • thread is the thread allocating the jobject,
  • object is the reference to the sampled jobject,
  • object_klass is the class for the jobject, and
  • size is the size of the allocation.


  • thread是分配对象的线程
  • object是对采样对象的引用
  • object_klass是jobject的类
  • size是分配的大小

The new API also includes a single new JVMTI method:

jvmtiError  SetHeapSamplingInterval(jvmtiEnv* env, jint sampling_interval)

where sampling_interval is the average allocated bytes between a sampling. The specification of the method is:

  • If non zero, the sampling interval is updated and will send a callback to the user with the new average sampling interval of sampling_interval bytes
    • For example, if the user wants a sample every megabyte, sampling_interval would be 1024 * 1024.
  • If zero is passed to the method, the sampler samples every allocation once the new interval is taken into account, which might take a certain number of allocations


  • 如果不为零,采样间隔将被更新,并将用sampling_interval字节的新平均采样间隔发送回调给用户
    • 例如,如果用户希望每兆字节采样一次,则sampling_interval将是1024 * 1024。
  • 如果将0传递给方法,采样器在考虑到新的间隔后对每个分配进行采样,这可能需要一定数量的分配

Note that the sampling interval is not precise. Each time a sample occurs, the number of bytes before the next sample will be chosen will be pseudo-random with the given average interval. This is to avoid sampling bias; for example, if the same allocations happen every 512KB, a 512KB sampling interval will always sample the same allocations. Therefore, though the sampling interval will not always be the selected interval, after a large number of samples, it will tend towards it.

Use-case example(用例示例)

To enable this, a user would use the usual event notification call to:


The event will be sent when the allocation is initialized and set up correctly, so slightly after the actual code performs the allocation. By default, the average sampling interval is 512KB.
The minimum required to enable the sampling event system is to call SetEventNotificationMode with JVMTI_ENABLE and the event type JVMTI_EVENT_SAMPLED_OBJECT_ALLOC. To modify the sampling interval, the user calls the SetHeapSamplingInterval method.
To disable the system,


disables the event notifications and disables the sampler automatically.
Calling the sampler again via SetEventNotificationMode will re-enable the sampler with whatever sampling interval was currently set (either 512KB by default or the last value passed by a user via SetHeapSamplingInterval).

New capability(新功能)

To protect the new feature and make it optional for VM implementations, a new capability named can_generate_sampled_object_alloc_events is introduced into the jvmtiCapabilities.

Global / thread level sampling(全局/线程级采样)

Using the notification system provides a direct means to send events only for specific threads. This is done via SetEventNotificationMode and providing a third parameter with the threads to be modified.

A full example(完整的例子)

The following section provides code snippets to illustrate the sampler’s API. First, the capability and the event notification is enabled:

jvmtiEventCallbacks callbacks;
memset(&callbacks, 0, sizeof(callbacks));
callbacks.SampledObjectAlloc = &SampledObjectAlloc;

jvmtiCapabilities caps;
memset(&caps, 0, sizeof(caps));
caps.can_generate_sampled_object_alloc_events = 1;
if (JVMTI_ERROR_NONE != (*jvmti)->AddCapabilities(jvmti, &caps)) {
  return JNI_ERR;

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;

if (JVMTI_ERROR_NONE !=  (*jvmti)->SetEventCallbacks(jvmti, &callbacks, sizeof(jvmtiEventCallbacks)) {
  return JNI_ERR;

// Set the sampler to 1MB.
if (JVMTI_ERROR_NONE !=  (*jvmti)->SetHeapSamplingInterval(jvmti, 1024 * 1024)) {
  return JNI_ERR;

To disable the sampler (disables events and the sampler):

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_DISABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;

To re-enable the sampler with the 1024 * 1024 byte sampling interval , a simple call to enabling the event is required:
要重新启用1024 * 1024字节采样间隔的采样器,需要一个简单的调用来启用事件:

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;

User storage of sampled allocations(抽样分配的用户存储)

When an event is generated, the callback can capture a stack trace using the JVMTI GetStackTrace method. The jobject reference obtained by the callback can be also wrapped into a JNI weak reference to help determine when the object has been garbage collected. This approach allows the user to gather data on what objects were sampled, as well as which are still considered live, which can be a good means to understand the job’s behavior.
当事件生成时,回调可以使用JVMTI GetStackTrace方法捕获堆栈跟踪。回调获得的jobject引用也可以包装成JNI弱引用,以帮助确定对象何时已被垃圾收集。这种方法允许用户收集关于采样对象的数据,以及仍然被认为是活动的对象的数据,这是了解作业行为的好方法。
For example, something like this could be done:

extern "C" JNIEXPORT void JNICALL SampledObjectAlloc(jvmtiEnv *env,
                                                     JNIEnv* jni,
                                                     jthread thread,
                                                     jobject object,
                                                     jclass klass,
                                                     jlong size) {
  jvmtiFrameInfo frames[32];
  jint frame_count;
  jvmtiError err;

  err = global_jvmti->GetStackTrace(NULL, 0, 32, frames, &frame_count);
  if (err == JVMTI_ERROR_NONE && frame_count >= 1) {
    jweak ref = jni->NewWeakGlobalRef(object);
    internal_storage.add(jni, ref, size, thread, frames, frame_count);

where internal_storage is a data structure that can handle the sampled objects, consider if there is a need to clean up any garbage collected sample, etc. The internals of that implementation are usage-specific, and out of scope of this JEP.
The sampling interval can be used as a means to mitigate profiling overhead. With a sampling interval of 512KB, the overhead should be low enough that a user could reasonably leave the system on by default.

Implementation details(实现细节)

The current prototype and implementation proves the feasibility of the approach. It contains five parts:

  1. Architecture dependent changes due to a change of a field name in the ThreadLocalAllocationBuffer (TLAB) structure. These changes are minimal as they are just name changes.
  2. The TLAB structure is augmented with a new allocation_end pointer, to complement the existing end pointer. If the sampling is disabled, the two pointers are always equal and the code performs as before. If the sampling is enabled, end is modified to be where the next sample point is requested. Then, any fast path will “think” the TLAB is full at that point and go down the slow path, which is explained in (3).
  3. The gc/shared/collectedHeap code is changed due to its usage as an entry point to the allocation slow path. When a TLAB is considered full (because allocation has passed the end pointer), the code enters collectedHeap and tries to allocate a new TLAB. At this point, the TLAB is set back to its original size and an allocation is attempted. If the allocation succeeds, the code samples the allocation, and then returns. If it does not, it means allocation has reached the end of the TLAB, and a new TLAB is needed. The code path continues its normal allocation of a new TLAB and determines if that allocation requires a sample. If the allocation is considered too big for the TLAB, the system samples the allocation as well, thus covering in TLAB and out of TLAB allocations for sampling.
  4. When a sample is requested, there is a collector object set on the stack in a place safe for sending the information to the native agent. The collector keeps track of sampled allocations and, at destruction of its own frame, sends a callback to the agent. This mechani** ensures the object is initialized correctly.
  5. If a JVMTI agent has registered a callback for the SampledObjectAlloc event, the event will be triggered and it will obtain sampled allocations. An example implementation can be found in the libHeapMonitorTest.c file, which is used for JTreg testing.


  1. 由于ThreadLocalAllocationBuffer (TLAB)结构中字段名称的更改,导致架构相关的更改。这些更改是最小的,因为它们只是名称更改。
  2. TLAB结构增加了一个新的allocation_end指针,以补充现有的结束指针。如果禁用采样,则两个指针始终相等,代码将像以前一样执行。如果启用了采样,end将被修改为请求下一个采样点的位置。然后,任何快速路径都会“认为”TLAB在此时已经满了,然后沿着慢路径走,这在(3)中解释过。
  3. gc/shared/collectedHeap代码被更改,因为它被用作分配慢路径的入口点。当TLAB被认为已满(因为分配已传递结束指针)时,代码进入collectedHeap并尝试分配一个新的TLAB。此时,TLAB将恢复到其原始大小,并尝试进行分配。如果分配成功,代码对分配进行采样,然后返回。如果没有,则意味着TLAB的分配已经结束,需要一个新的TLAB。代码路径继续其对新TLAB的正常分配,并确定该分配是否需要示例。如果分配被认为对TLAB来说太大,系统也会对分配进行抽样,从而覆盖TLAB分配内和TLAB分配外进行抽样。
  4. 当请求一个示例时,堆栈上有一个收集器对象设置在一个安全的位置,用于将信息发送到本机代理。收集器跟踪采样分配,并在销毁自己的帧时向代理发送回调。该机制确保对象被正确初始化。
  5. 如果JVMTI代理为SampledObjectAlloc事件注册了回调,则该事件将被触发,并且它将获得抽样分配。在libHeapMonitorTest.c文件中可以找到一个示例实现,该文件用于JTreg测试。


There are multiple alternatives to the system presented in this JEP. The introduction presented two already: Flight Recorder provides an interesting alternative. This implementation provides several advantages. First, JFR does not allow the sampling size to be set or provide a callback. Next, JFR’s use of a buffer system can lead to lost allocations when the buffer is exhausted. Finally, the JFR event system does not provide a means to track objects that have been garbage collected, which means it is not possible to use it to provide information about live and garbage collected objects.
对于这个JEP中提出的系统,有多种替代方案。介绍中已经介绍了两个:Flight Recorder提供了一个有趣的替代方案。这个实现提供了几个优点。首先,JFR不允许设置抽样大小或提供回调。其次,当缓冲区耗尽时,JFR使用缓冲区系统可能导致分配丢失。最后,JFR事件系统没有提供跟踪已被垃圾收集的对象的方法,这意味着不可能使用它来提供有关活动对象和垃圾收集对象的信息。
Another alternative is bytecode instrumentation using ASM. Its overhead makes it prohibitive and not a workable solution.
This JEP adds a new feature into JVMTI, which is an important API/framework for various development and monitoring tools. With it, a JVMTI agent can use a low overhead heap profiling API along with the rest of the JVMTI functionality, which provides great flexibility to the tools. For instance, it is up to the agent to decide if a stack trace needs to be collected at each event point.


There are 16 tests in the JTreg framework for this feature that test: turning on/off with multiple threads, multiple threads allocating at the same time, testing if the data is being sampled at the right interval, and if the gathered stacks reflect the correct program information.

Risks and Assumptions(风险和假设)

There are no performance penalties or risks with the feature disabled. A user who does not enable the system will not perceive a performance difference.
However, there is a potential performance/memory penalty with the feature enabled. In the initial prototype implementation, the overhead was minimal (<2%). This used a more heavyweight mechani** that modified JIT’d code. In the final version presented here, the system piggy-backs on the TLAB code, and should not experience that regression.
Current evaluation of the Dacapo benchmark puts the overhead at:

  • 0% when the feature is disabled
  • 1% when the feature is enabled at the default 512KB interval, but no callback action is performed (i.e., the SampledAllocEvent method is empty but registered to the JVM).
  • 3% overhead with a sampling callback that does a ***** implementation to store the data (using the one in the tests)


  • 禁用时为0%
  • 1%,当以默认的512KB间隔启用该特性,但不执行回调动作(即SampledAllocEvent方法为空,但已注册到JVM)。
  • 3%开销,使用抽样回调,执行简单的实现来存储数据(使用测试中的实现)


  • 浏览更多精彩评论
  • 和开发者讨论交流,共同进步


译 | The JVM Tool Interface (JVM TI): How VM Agents Work

译 | The JVM Tool Interface (JVM TI): How VM Agents Work