Docker对JVM一些限制的研究 | HeapDump性能社区

首先说一个老生常谈的限制：我们在对Docker中的Java应用使用诸如jmap等命令时常常会报错：

Can't attach to the process: ptrace(PTRACE_ATTACH, ..).

这个主要是因为像jstack、jmap等工具主要是通过两种方式来实现的:

Attach机制（也可以叫做Vitural Machine.attach()，主要是用通过Socket 与目标JVM的Attach Listener线程进行交互，详情可看笨神文章《JVM源码分析之Attach机制实现完全解读》）.
Serviceability Agent(其实也是一种Attach，在Linux中要靠系统调用ptrace来实现).

而 Docker 自 1.10 版本开始，默认的 seccomp 配置文件中禁用了 ptrace，所以一些通过SA进行的操作如：jmap -heap就会报错，而Docker官方也给出了解决方法：

使用–cap-add=SYS_PTRACE明确添加指定功能：[docker run --cap-add=SYS_PTRACE ...]
关闭 seccomp /将ptrace添加到允许的名单中：docker run --security-opt seccomp:unconfined ...

除了这个限制，前一段时间我在翻JDK的JDK BUG SYSTEM的时候无意间发现了这么一个Bug:JDK-8140793

getAvailableProcessors may incorrectly report the number of cpus in Docker container

BUG大致描述的现象是，Java在Docker容器中运行时，获取到的CPU的数目可能是不正确的。

Docker大家都知道是依托于Cgroups和Namespace的，而Cgroups 是一种 Linux 内核功能，可以限制和隔离进程的资源使用情况（CPU、内存、磁盘 I/O、网络等），所以我猜可能是JVM在运行时并没有读取到Docker使用Cgroups进行的限制.

继续查看这个BUG，发现状态是RESOLVED，于是继续翻找，在官方的Blog中发现了这么一篇文章

：《Java SE support for Docker CPU and memory limits》（文章关联了反应Docker中CPU计算出错的JDK-8140793、Docker中内存限制的增强JDK-8170888、容器检测和资源配置使用率增强的JDK-8146115）.

文章中提到在JDK8u121之前的版本中（Java SE 8u121 and earlier），JVM读取的CPU数以及内存等都是不受到Cgroups限制的数据，那么这么做又会出现什么问题呢？据我所知，在我们不显式的指明一些参数的时候，往往会用到JVM读取的数据做一些默认的配置。比如如果不显式的指定 -XX:ParallelGCThreads and -XX:CICompilerCount，那么JVM就会根据读到的CPU数目进行计算来设置数值，如在计算Parallel GC的Threads数目的地方runtime\vm_version.cpp（以下基于openJDK1.8 b120）：

if (FLAG_IS_DEFAULT(ParallelGCThreads)) {
assert(ParallelGCThreads == 0, "Default ParallelGCThreads is not 0");
// For very large machines, there are diminishing returns
// for large numbers of worker threads. Instead of
// hogging the whole system, use a fraction of the workers for every
// processor after the first 8. For example, on a 72 cpu machine
// and a chosen fraction of 5/8
// use 8 + (72 - 8) * (5/8) == 48 worker threads.
unsigned int ncpus = (unsigned int) os::active_processor_count();
return (ncpus <= switch_pt) ?
ncpus :
(switch_pt + ((ncpus - switch_pt) * num) / den);
} else {
return ParallelGCThreads;
}

进入到获取CPU数目的os::active_processor_count()(linux实现os_linux.cpp)

int os::active_processor_count() {
// Linux doesn't yet have a (official) notion of processor sets,
// so just return the number of online processors.
int online_cpus = ::sysconf(_SC_NPROCESSORS_ONLN);
assert(online_cpus > 0 && online_cpus <= processor_count(), "sanity check");
return online_cpus;
}

我们发现确实是通过::sysconf(_SC_NPROCESSORS_ONLN)来读取的物理机的CPU，如此看来GC的线程数目的计算就会出现一定的问题,同理JIT compiler threads也会遇到同样的问题。

而除了CPU的读取会出错，内存也是如此，我们在不显式的指定一些参数时如-Xmx（MaxHeapSize）、-Xms（InitialHeapSize）时，JVM会根据它读取到的机器的内存大小做一些默认的设置如：

void Arguments::set_heap_size() {
if (!FLAG_IS_DEFAULT(DefaultMaxRAMFraction)) {
// Deprecated flag
FLAG_SET_CMDLINE(uintx, MaxRAMFraction, DefaultMaxRAMFraction);
}
const julong phys_mem =
FLAG_IS_DEFAULT(MaxRAM) ? MIN2(os::physical_memory(), (julong)MaxRAM)
: (julong)MaxRAM;
// If the maximum heap size has not been set with -Xmx,
// then set it as fraction of the size of physical memory,
// respecting the maximum and minimum sizes of the heap.
if (FLAG_IS_DEFAULT(MaxHeapSize)) {
julong reasonable_max = phys_mem / MaxRAMFraction;
if (phys_mem <= MaxHeapSize * MinRAMFraction) {
// Small physical memory, so use a minimum fraction of it for the heap
reasonable_max = phys_mem / MinRAMFraction;
}
.
.
.
.
}
}

其中读取内存的os::physical_memory()读取也是physical memory，而这在Docker中运行可能引发一系列的错误比如被OOMKiller给杀掉（参考）.

可见当我们使用一些比较老的JDK8版本时，如果我们没有显式指定一些参数可能会遇到一些稀奇古怪的问题，我在JDK-8146115中发现此对Docker支付的增强已经在JDK10中实现了，使用-XX:+UseContainerSupport可以开启容器支持，而且这一增强已经被backport到了JDK8的一些新版本中（JDK8u131之后的版本）.

我下载了新版本的OpenJDK8，翻阅源码发现Oracle果然做了相应的处理.

原先os::active_processor_count()变成了：

// Determine the active processor count from one of
// three different sources:
//
// 1. User option -XX:ActiveProcessorCount
// 2. kernel os calls (sched_getaffinity or sysconf(_SC_NPROCESSORS_ONLN)
// 3. extracted from cgroup cpu subsystem (shares and quotas)
//
// Option 1, if specified, will always override.
// If the cgroup subsystem is active and configured, we
// will return the min of the cgroup and option 2 results.
// This is required since tools, such as numactl, that
// alter cpu affinity do not update cgroup subsystem
// cpuset configuration files.
int os::active_processor_count() {
// User has overridden the number of active processors
if (ActiveProcessorCount > 0) {
if (PrintActiveCpus) {
tty->print_cr("active_processor_count: "
"active processor count set by user : %d",
ActiveProcessorCount);
}
return ActiveProcessorCount;
}
int active_cpus;
if (OSContainer::is_containerized()) {
active_cpus = OSContainer::active_processor_count();
if (PrintActiveCpus) {
tty->print_cr("active_processor_count: determined by OSContainer: %d",
active_cpus);
}
} else {
active_cpus = os::Linux::active_processor_count();
}
return active_cpus;
}

可以清晰的看到，如果有-XX:ActiveProcessorCount参数则使用参数，如果没有就会去OSContainer::is_containerized()判断是否容器化：

inline bool OSContainer::is_containerized() {
assert(_is_initialized, "OSContainer not initialized");
return _is_containerized;
}

而_is_containerized是由Threads::create_vm调用OSContainer::init()时检查虚拟机是否运行在容器中得来的（具体方法太长了）：

/* init
*
* Initialize the container support and determine if
* we are running under cgroup control.
*/
void OSContainer::init() {
int mountid;
int parentid;
int major;
int minor;
FILE *mntinfo = NULL;
FILE *cgroup = NULL;
char buf[MAXPATHLEN+1];
char tmproot[MAXPATHLEN+1];
char tmpmount[MAXPATHLEN+1];
char tmpbase[MAXPATHLEN+1];
char *p;
jlong mem_limit;
assert(!_is_initialized, "Initializing OSContainer more than once");
_is_initialized = true;
_is_containerized = false;
_unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size();
if (PrintContainerInfo) {
tty->print_cr("OSContainer::init: Initializing Container Support");
}
if (!UseContainerSupport) {
if (PrintContainerInfo) {
tty->print_cr("Container Support not enabled");
}
return;
}
...........
_is_containerized = true;
}

方法就是对一些地方做了检查，如UseContainerSupport参数是否开启、/proc/self/mountinfo、/proc/self/cgroup是否可读等等，如果判断JVM运行在容器中，那么就会调用OSContainer::active_processor_count()获取容器限制的CPU数目：

/* active_processor_count
*
* Calculate an appropriate number of active processors for the
* VM to use based on these three inputs.
*
* cpu affinity
* cgroup cpu quota & cpu period
* cgroup cpu shares
*
* Algorithm:
*
* Determine the number of available CPUs from sched_getaffinity
*
* If user specified a quota (quota != -1), calculate the number of
* required CPUs by dividing quota by period.
*
* If shares are in effect (shares != -1), calculate the number
* of CPUs required for the shares by dividing the share value
* by PER_CPU_SHARES.
*
* All results of division are rounded up to the next whole number.
*
* If neither shares or quotas have been specified, return the
* number of active processors in the system.
*
* If both shares and quotas have been specified, the results are
* based on the flag PreferContainerQuotaForCPUCount. If true,
* return the quota value. If false return the smallest value
* between shares or quotas.
*
* If shares and/or quotas have been specified, the resulting number
* returned will never exceed the number of active processors.
*
* return:
* number of CPUs
*/
int OSContainer::active_processor_count() {
int quota_count = 0, share_count = 0;
int cpu_count, limit_count;
int result;
cpu_count = limit_count = os::Linux::active_processor_count();
int quota = cpu_quota();
int period = cpu_period();
int share = cpu_shares();
...........
}

通过注释发现，此时的计算是通过cgroup cpu quota & cpu period、cgroup cpu shares得来的，而Docker可以通过–cpu-period、–cpu-quota等来进行设置。

同理，对于Memory的处理，如果不标明-Xmx，JVM可以开启*-XX:+UnlockExperimentalVMOptions*、 -XX:+UseCGroupMemoryLimitForHeap这两个参数，来使得JVM使用Linux cgroup的配置确定最大Java堆大小。

Arguments::set_heap_size()方法：

void Arguments::set_heap_size() {
if (!FLAG_IS_DEFAULT(DefaultMaxRAMFraction)) {
// Deprecated flag
FLAG_SET_CMDLINE(uintx, MaxRAMFraction, DefaultMaxRAMFraction);
}
julong phys_mem =
FLAG_IS_DEFAULT(MaxRAM) ? MIN2(os::physical_memory(), (julong)MaxRAM)
: (julong)MaxRAM;
// Experimental support for CGroup memory limits
if (UseCGroupMemoryLimitForHeap) {
// This is a rough indicator that a CGroup limit may be in force
// for this process
const char* lim_file = "/sys/fs/cgroup/memory/memory.limit_in_bytes";
FILE *fp = fopen(lim_file, "r");
if (fp != NULL) {
julong cgroup_max = 0;
int ret = fscanf(fp, JULONG_FORMAT, &cgroup_max);
if (ret == 1 && cgroup_max > 0) {
// If unlimited, cgroup_max will be a very large, but unspecified
// value, so use initial phys_mem as a limit
if (PrintGCDetails && Verbose) {
// Cannot use gclog_or_tty yet.
tty->print_cr("Setting phys_mem to the min of cgroup limit ("
JULONG_FORMAT "MB) and initial phys_mem ("
JULONG_FORMAT "MB)", cgroup_max/M, phys_mem/M);
}
phys_mem = MIN2(cgroup_max, phys_mem);
} else {
warning("Unable to read/parse cgroup memory limit from %s: %s",
lim_file, errno != 0 ? strerror(errno) : "unknown error");
}
fclose(fp);
} else {
warning("Unable to open cgroup memory limit file %s (%s)", lim_file, strerror(errno));
}
}
....................
}