性能文章>译 | JEP 331: Low-Overhead Heap Profiling>

译 | JEP 331: Low-Overhead Heap Profiling转载

401624

Summary(概要)

Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI.
提供一种低开销的Java堆分配采样方式,可通过JVMTI访问。

Goals(目标)

Provide a way to get information about Java object heap allocations from the JVM that:

  • Is low-overhead enough to be enabled by default continuously,
  • Is accessible via a well-defined, programmatic interface,
  • Can sample all allocations (i.e., is not limited to allocations that are in one particular heap region or that were allocated in one particular way),
  • Can be defined in an implementation-independent way (i.e., without relying on any particular GC algorithm or VM implementation), and
  • Can give information about both live and dead Java objects.

提供一种从JVM获取Java对象堆分配信息的方法:

  • 开销足够低,可以在默认情况下连续启用,
  • 可以通过定义良好的编程接口访问,
  • 可以对所有的分配进行抽样(即,不局限于一个特定堆区域中的分配或以一种特定方式分配的分配),
  • 可以以一种与实现无关的方式定义(即,不依赖于任何特定的GC算法或VM实现),以及
  • 可以提供有关活的和死的Java对象的信息。

Motivation(动机)

There is a deep need for users to understand the contents of their heaps. Poor heap management can lead to problems such as heap exhaustion and GC thrashing. As a result, a number of tools have been developed to allow users to introspect into their heaps, such as the Java Flight Recorder, jmap, YourKit, and VisualVM tools.
用户非常需要理解堆的内容。糟糕的堆管理可能会导致堆耗尽和GC抖动等问题。因此,人们开发了许多工具来允许用户自省他们的堆,例如Java Flight Recorder、jmap、YourKit和VisualVM工具。
One piece of information that is lacking from most of the existing tooling is the call site for particular allocations. Heap dumps and heap histograms do not contain this information. This information can be critical to debugging memory issues, because it tells developers the exact location in their code particular (and particularly bad) allocations occurred.
大多数现有工具缺少的一个信息是特定分配的调用站点。堆转储和堆直方图不包含此信息。此信息对于调试内存问题非常重要,因为它告诉开发人员代码中发生特定(特别糟糕的)分配的确切位置。
There are currently two ways of getting this information out of HotSpot:

  • First, you can instrument all of the allocations in your application using a bytecode rewriter such as the Allocation Instrumenter. You can then have the instrumentation take a stack trace (when you want one).
  • Second, you can use Java Flight Recorder, which takes a stack trace on TLAB refills and when allocating directly into the old generation. The downsides of this are that a) it is tied to a particular allocation implementation (TLABs), and misses allocations that don’t meet that pattern; b) it doesn’t allow the user to customize the sampling interval; and c) it only logs allocations, so you cannot distinguish between live and dead objects.

目前有两种方法从热点获取这些信息:

  • 首先,您可以使用字节码重写器(例如Allocation Instrumenter)来检测应用程序中的所有分配。然后,您可以让插装进行堆栈跟踪(当您需要时)。
  • 其次,您可以使用Java Flight Recorder,它在TLAB重新填充和直接分配到老一代时进行堆栈跟踪。这样做的缺点是:a)它绑定到特定的分配实现(TLABs),并且错过了不符合该模式的分配;B)它不允许用户自定义采样间隔;c)它只记录分配,所以你无法区分活对象和死对象。

This proposal mitigates these problems by providing an extensible JVMTI interface that allows the user to define the sampling interval and returns a set of live stack traces.
该建议通过提供可扩展的JVMTI接口来缓解这些问题,该接口允许用户定义采样间隔并返回一组活动堆栈跟踪。

Description(描述)

New JVMTI event and method(新的JVMTI事件和方法)

The user facing API for the heap sampling feature proposed here consists of an extension to JVMTI that allows for heap profiling. The following systems rely on an event notification system that would provide a callback such as:
这里提出的面向用户的堆采样特性API由JVMTI的扩展组成,该扩展允许进行堆分析。以下系统依赖于提供回调的事件通知系统,例如:

void JNICALL
SampledObjectAlloc(jvmtiEnv *jvmti_env,
                   JNIEnv* jni_env,
                   jthread thread,
                   jobject object,
                   jclass object_klass,
                   jlong size)

where:

  • thread is the thread allocating the jobject,
  • object is the reference to the sampled jobject,
  • object_klass is the class for the jobject, and
  • size is the size of the allocation.

说明:

  • thread是分配对象的线程
  • object是对采样对象的引用
  • object_klass是jobject的类
  • size是分配的大小

The new API also includes a single new JVMTI method:
新的API还包括一个新的JVMTI方法:

jvmtiError  SetHeapSamplingInterval(jvmtiEnv* env, jint sampling_interval)

where sampling_interval is the average allocated bytes between a sampling. The specification of the method is:

  • If non zero, the sampling interval is updated and will send a callback to the user with the new average sampling interval of sampling_interval bytes
    • For example, if the user wants a sample every megabyte, sampling_interval would be 1024 * 1024.
  • If zero is passed to the method, the sampler samples every allocation once the new interval is taken into account, which might take a certain number of allocations

其中sampling_interval是两次采样之间分配的平均字节数。该方法的规格为:

  • 如果不为零,采样间隔将被更新,并将用sampling_interval字节的新平均采样间隔发送回调给用户
    • 例如,如果用户希望每兆字节采样一次,则sampling_interval将是1024 * 1024。
  • 如果将0传递给方法,采样器在考虑到新的间隔后对每个分配进行采样,这可能需要一定数量的分配

Note that the sampling interval is not precise. Each time a sample occurs, the number of bytes before the next sample will be chosen will be pseudo-random with the given average interval. This is to avoid sampling bias; for example, if the same allocations happen every 512KB, a 512KB sampling interval will always sample the same allocations. Therefore, though the sampling interval will not always be the selected interval, after a large number of samples, it will tend towards it.
注意,采样间隔是不精确的。每次出现一个样本时,在下一个样本被选择之前的字节数将是给定平均间隔的伪随机。这是为了避免抽样偏差;例如,如果相同的分配每512KB发生一次,512KB采样间隔将始终对相同的分配进行采样。因此,虽然采样间隔并不总是选择的间隔,但在大量的样本之后,它会趋向于它。

Use-case example(用例示例)

To enable this, a user would use the usual event notification call to:
要启用此功能,用户将使用通常的事件通知调用来操作:

jvmti->SetEventNotificationMode(jvmti, JVMTI_ENABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

The event will be sent when the allocation is initialized and set up correctly, so slightly after the actual code performs the allocation. By default, the average sampling interval is 512KB.
该事件将在分配初始化并正确设置时发送,因此略晚于实际代码执行分配之后。缺省情况下,平均采样间隔为512KB。
The minimum required to enable the sampling event system is to call SetEventNotificationMode with JVMTI_ENABLE and the event type JVMTI_EVENT_SAMPLED_OBJECT_ALLOC. To modify the sampling interval, the user calls the SetHeapSamplingInterval method.
启用采样事件系统的最低要求是使用JVMTI_ENABLE和事件类型JVMTI_EVENT_SAMPLED_OBJECT_ALLOC调用SetEventNotificationMode。要修改采样间隔,用户调用SetHeapSamplingInterval方法。
To disable the system,
禁用方式,

jvmti->SetEventNotificationMode(jvmti, JVMTI_DISABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

disables the event notifications and disables the sampler automatically.
禁用事件通知并自动禁用采样器。
Calling the sampler again via SetEventNotificationMode will re-enable the sampler with whatever sampling interval was currently set (either 512KB by default or the last value passed by a user via SetHeapSamplingInterval).
通过SetEventNotificationMode再次调用采样器将使用当前设置的采样间隔重新启用采样器(默认为512KB或用户通过SetHeapSamplingInterval传递的最后一个值)。

New capability(新功能)

To protect the new feature and make it optional for VM implementations, a new capability named can_generate_sampled_object_alloc_events is introduced into the jvmtiCapabilities.
为了保护新特性并使其成为VM实现的可选特性,在jvmtiCapabilities中引入了名为can_generate_sampled_object_alloc_events的新功能。

Global / thread level sampling(全局/线程级采样)

Using the notification system provides a direct means to send events only for specific threads. This is done via SetEventNotificationMode and providing a third parameter with the threads to be modified.
使用通知系统提供了一种仅为特定线程发送事件的直接方法。这是通过SetEventNotificationMode完成的,并提供第三个参数,其中包含要修改的线程。

A full example(完整的例子)

The following section provides code snippets to illustrate the sampler’s API. First, the capability and the event notification is enabled:
下面的部分提供代码片段来演示采样器的API。首先,启用功能和事件通知:

jvmtiEventCallbacks callbacks;
memset(&callbacks, 0, sizeof(callbacks));
callbacks.SampledObjectAlloc = &SampledObjectAlloc;

jvmtiCapabilities caps;
memset(&caps, 0, sizeof(caps));
caps.can_generate_sampled_object_alloc_events = 1;
if (JVMTI_ERROR_NONE != (*jvmti)->AddCapabilities(jvmti, &caps)) {
  return JNI_ERR;
}

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

if (JVMTI_ERROR_NONE !=  (*jvmti)->SetEventCallbacks(jvmti, &callbacks, sizeof(jvmtiEventCallbacks)) {
  return JNI_ERR;
}

// Set the sampler to 1MB.
if (JVMTI_ERROR_NONE !=  (*jvmti)->SetHeapSamplingInterval(jvmti, 1024 * 1024)) {
  return JNI_ERR;
}

To disable the sampler (disables events and the sampler):
禁用采样器(禁用事件和采样器):

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_DISABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

To re-enable the sampler with the 1024 * 1024 byte sampling interval , a simple call to enabling the event is required:
要重新启用1024 * 1024字节采样间隔的采样器,需要一个简单的调用来启用事件:

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                       JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
  return JNI_ERR;
}

User storage of sampled allocations(抽样分配的用户存储)

When an event is generated, the callback can capture a stack trace using the JVMTI GetStackTrace method. The jobject reference obtained by the callback can be also wrapped into a JNI weak reference to help determine when the object has been garbage collected. This approach allows the user to gather data on what objects were sampled, as well as which are still considered live, which can be a good means to understand the job’s behavior.
当事件生成时,回调可以使用JVMTI GetStackTrace方法捕获堆栈跟踪。回调获得的jobject引用也可以包装成JNI弱引用,以帮助确定对象何时已被垃圾收集。这种方法允许用户收集关于采样对象的数据,以及仍然被认为是活动的对象的数据,这是了解作业行为的好方法。
For example, something like this could be done:
例如,可以这样做:

extern "C" JNIEXPORT void JNICALL SampledObjectAlloc(jvmtiEnv *env,
                                                     JNIEnv* jni,
                                                     jthread thread,
                                                     jobject object,
                                                     jclass klass,
                                                     jlong size) {
  jvmtiFrameInfo frames[32];
  jint frame_count;
  jvmtiError err;

  err = global_jvmti->GetStackTrace(NULL, 0, 32, frames, &frame_count);
  if (err == JVMTI_ERROR_NONE && frame_count >= 1) {
    jweak ref = jni->NewWeakGlobalRef(object);
    internal_storage.add(jni, ref, size, thread, frames, frame_count);
  }
}

where internal_storage is a data structure that can handle the sampled objects, consider if there is a need to clean up any garbage collected sample, etc. The internals of that implementation are usage-specific, and out of scope of this JEP.
如果internal_storage是一个可以处理采样对象的数据结构,请考虑是否需要清理任何垃圾收集的样本,等等。该实现的内部是特定于使用的,超出了这个JEP的范围。
The sampling interval can be used as a means to mitigate profiling overhead. With a sampling interval of 512KB, the overhead should be low enough that a user could reasonably leave the system on by default.
采样间隔可以用作减少分析开销的一种手段。使用512KB的采样间隔,开销应该足够低,用户可以合理地在默认情况下打开系统。

Implementation details(实现细节)

The current prototype and implementation proves the feasibility of the approach. It contains five parts:

  1. Architecture dependent changes due to a change of a field name in the ThreadLocalAllocationBuffer (TLAB) structure. These changes are minimal as they are just name changes.
  2. The TLAB structure is augmented with a new allocation_end pointer, to complement the existing end pointer. If the sampling is disabled, the two pointers are always equal and the code performs as before. If the sampling is enabled, end is modified to be where the next sample point is requested. Then, any fast path will “think” the TLAB is full at that point and go down the slow path, which is explained in (3).
  3. The gc/shared/collectedHeap code is changed due to its usage as an entry point to the allocation slow path. When a TLAB is considered full (because allocation has passed the end pointer), the code enters collectedHeap and tries to allocate a new TLAB. At this point, the TLAB is set back to its original size and an allocation is attempted. If the allocation succeeds, the code samples the allocation, and then returns. If it does not, it means allocation has reached the end of the TLAB, and a new TLAB is needed. The code path continues its normal allocation of a new TLAB and determines if that allocation requires a sample. If the allocation is considered too big for the TLAB, the system samples the allocation as well, thus covering in TLAB and out of TLAB allocations for sampling.
  4. When a sample is requested, there is a collector object set on the stack in a place safe for sending the information to the native agent. The collector keeps track of sampled allocations and, at destruction of its own frame, sends a callback to the agent. This mechani** ensures the object is initialized correctly.
  5. If a JVMTI agent has registered a callback for the SampledObjectAlloc event, the event will be triggered and it will obtain sampled allocations. An example implementation can be found in the libHeapMonitorTest.c file, which is used for JTreg testing.

目前的原型和实现证明了该方法的可行性。它包括五个部分:

  1. 由于ThreadLocalAllocationBuffer (TLAB)结构中字段名称的更改,导致架构相关的更改。这些更改是最小的,因为它们只是名称更改。
  2. TLAB结构增加了一个新的allocation_end指针,以补充现有的结束指针。如果禁用采样,则两个指针始终相等,代码将像以前一样执行。如果启用了采样,end将被修改为请求下一个采样点的位置。然后,任何快速路径都会“认为”TLAB在此时已经满了,然后沿着慢路径走,这在(3)中解释过。
  3. gc/shared/collectedHeap代码被更改,因为它被用作分配慢路径的入口点。当TLAB被认为已满(因为分配已传递结束指针)时,代码进入collectedHeap并尝试分配一个新的TLAB。此时,TLAB将恢复到其原始大小,并尝试进行分配。如果分配成功,代码对分配进行采样,然后返回。如果没有,则意味着TLAB的分配已经结束,需要一个新的TLAB。代码路径继续其对新TLAB的正常分配,并确定该分配是否需要示例。如果分配被认为对TLAB来说太大,系统也会对分配进行抽样,从而覆盖TLAB分配内和TLAB分配外进行抽样。
  4. 当请求一个示例时,堆栈上有一个收集器对象设置在一个安全的位置,用于将信息发送到本机代理。收集器跟踪采样分配,并在销毁自己的帧时向代理发送回调。该机制确保对象被正确初始化。
  5. 如果JVMTI代理为SampledObjectAlloc事件注册了回调,则该事件将被触发,并且它将获得抽样分配。在libHeapMonitorTest.c文件中可以找到一个示例实现,该文件用于JTreg测试。

Alternatives(选择)

There are multiple alternatives to the system presented in this JEP. The introduction presented two already: Flight Recorder provides an interesting alternative. This implementation provides several advantages. First, JFR does not allow the sampling size to be set or provide a callback. Next, JFR’s use of a buffer system can lead to lost allocations when the buffer is exhausted. Finally, the JFR event system does not provide a means to track objects that have been garbage collected, which means it is not possible to use it to provide information about live and garbage collected objects.
对于这个JEP中提出的系统,有多种替代方案。介绍中已经介绍了两个:Flight Recorder提供了一个有趣的替代方案。这个实现提供了几个优点。首先,JFR不允许设置抽样大小或提供回调。其次,当缓冲区耗尽时,JFR使用缓冲区系统可能导致分配丢失。最后,JFR事件系统没有提供跟踪已被垃圾收集的对象的方法,这意味着不可能使用它来提供有关活动对象和垃圾收集对象的信息。
Another alternative is bytecode instrumentation using ASM. Its overhead makes it prohibitive and not a workable solution.
另一种替代方法是使用ASM的字节码插装。它的开销让人望而却步,不是一个可行的解决方案。
This JEP adds a new feature into JVMTI, which is an important API/framework for various development and monitoring tools. With it, a JVMTI agent can use a low overhead heap profiling API along with the rest of the JVMTI functionality, which provides great flexibility to the tools. For instance, it is up to the agent to decide if a stack trace needs to be collected at each event point.
这个JEP向JVMTI添加了一个新特性,JVMTI是用于各种开发和监视工具的重要API/框架。有了它,JVMTI代理可以使用低开销的堆分析API以及其他JVMTI功能,这为工具提供了极大的灵活性。例如,由代理决定是否需要在每个事件点收集堆栈跟踪。

Testing(测试)

There are 16 tests in the JTreg framework for this feature that test: turning on/off with multiple threads, multiple threads allocating at the same time, testing if the data is being sampled at the right interval, and if the gathered stacks reflect the correct program information.
JTreg框架中针对该特性有16个测试:使用多个线程打开/关闭,同时分配多个线程,测试数据是否以正确的间隔采样,以及收集的堆栈是否反映正确的程序信息。

Risks and Assumptions(风险和假设)

There are no performance penalties or risks with the feature disabled. A user who does not enable the system will not perceive a performance difference.
禁用该特性不会造成性能损失或风险。没有启用系统的用户不会感知到性能差异。
However, there is a potential performance/memory penalty with the feature enabled. In the initial prototype implementation, the overhead was minimal (<2%). This used a more heavyweight mechani** that modified JIT’d code. In the final version presented here, the system piggy-backs on the TLAB code, and should not experience that regression.
但是,启用该特性会有潜在的性能/内存损失。在最初的原型实现中,开销是最小的(<2%)。这使用了一个更重量级的机制来修改JIT代码。在这里给出的最终版本中,系统依赖于TLAB代码,并且不应该经历这种回归。
Current evaluation of the Dacapo benchmark puts the overhead at:

  • 0% when the feature is disabled
  • 1% when the feature is enabled at the default 512KB interval, but no callback action is performed (i.e., the SampledAllocEvent method is empty but registered to the JVM).
  • 3% overhead with a sampling callback that does a ***** implementation to store the data (using the one in the tests)

目前对Dacapo基准测试的评估显示开销为:

  • 禁用时为0%
  • 1%,当以默认的512KB间隔启用该特性,但不执行回调动作(即SampledAllocEvent方法为空,但已注册到JVM)。
  • 3%开销,使用抽样回调,执行简单的实现来存储数据(使用测试中的实现)
点赞收藏
大禹的足迹

在阿里搬了几年砖的大龄码农,头条号:大禹的足迹

请先登录,查看2条精彩评论吧
快去登录吧,你将获得
  • 浏览更多精彩评论
  • 和开发者讨论交流,共同进步
4
2