字节对齐与Java的指针压缩（下）-指针压缩

在上一篇文章《字节对齐与Java的指针压缩（上）-字节对齐的渊源》中我们讲到了字节对齐的部分渊源与Java选择8字节对齐的情况，我们也观察到了在64位JDK下如果开启了指针压缩classx对象的内存布局会有如下一个变化：
未开启指针压缩时:

+0:  [ _mark (8 bytes)  ]
+8:  [ _klass (8 bytes) ]
+16: [ field : long ClassX.l (8 bytes) ]
...
+40: [ field : java.lang.Object ClassX.o1 (8 bytes) ]
+48: [ field : java.lang.Object ClassX.o2 (8 bytes) ]

在开启指针压缩后：

+0:  [      _mark (8 bytes)             ]
+8:  [      _narrow_klass (4 bytes)     ]
+12: [ padding or first field (4 bytes) ]  (此时为int ClassX.i  4bytes)
+16: [ field : long ClassX.l (8 bytes)  ]
...
+32: [ field : java.lang.Object ClassX.o1 (4 bytes) ]
+36: [ field : java.lang.Object ClassX.o2 (4 bytes) ]

此时我们可以清晰的看到开启指针压缩后，klass指针由_klass的8bytes压缩成了_narrow_klass的4bytes，而我们的普通对象指针oop也由8bytes被压缩成了4bytes。

首先来看下官网对于指针压缩的简介：《CompressedOops》

官网首先介绍了

什么是oop：

An “oop”, or “ordinary object pointer” in HotSpot parlance is a managed pointer to an object. It is normally the same size as a native machine pointer. A managed pointer is carefully tracked by the Java application and GC subsystem, so that storage for unused objects can be reclaimed. This process can also involve relocating (copying) objects which are in use, so that storage can be compacted.

简单的说oop就是普通对象指针，即是指向对象的管理指针，它通常与本机指针的大小相同。我们在之前的文章《从new Class()入手浅看JVM的oop-klass模型》中讲到了的Java的对象instanceOop就属于一种oop(当然还有其他oop，比如代表数组类型的typeArrayOop或objArrayOop)。

oop用来表示对象的实例信息，看起来像个指针，而实际上对象实例数据都藏在指针所指向的内存首地址后面的一片内存区域中。
而klass 则包含元数据和方法信息，用来描述Java类或者JVM内部自带的C++类型信息。

为什么说实际上对象实例数据都藏在指针所指向的内存首地址后面的一片内存区域中呢，让我们来深入一下代码，首先创建oop是在instanceKlass中（因为instanceKlass了解对象的所有信息如字段个数、大小等，所以可以根据信息来创建对应的instanceOop/arrayOop），具体方法就是InstanceKlass::allocate_instance（基于openJDK8u）：

instanceOop InstanceKlass::allocate_instance(TRAPS) {
  bool has_finalizer_flag = has_finalizer(); // Query before possible GC
  int size = size_helper();  // Query before forming handle.

  KlassHandle h_k(THREAD, this);

  instanceOop i;

  i = (instanceOop)CollectedHeap::obj_allocate(h_k, size, CHECK_NULL);
  if (has_finalizer_flag && !RegisterFinalizersAtInit) {
    i = register_finalizer(i, CHECK_NULL);
  }
  return i;
}

其中在堆上创建分配对象的步骤是i = (instanceOop)CollectedHeap::obj_allocate(h_k, size, CHECK_NULL);：

oop CollectedHeap::obj_allocate(KlassHandle klass, int size, TRAPS) {
  debug_only(check_for_valid_allocation_state());
  assert(!Universe::heap()->is_gc_active(), "Allocation during gc not allowed");
  assert(size >= 0, "int won't convert to size_t");
  HeapWord* obj = common_mem_allocate_init(klass, size, CHECK_NULL);
  post_allocation_setup_obj(klass, obj, size);
  NOT_PRODUCT(Universe::heap()->check_for_bad_heap_word_value(obj, size));
  return (oop)obj;
}

可以清晰的看到，JVM根据获取到的对象的大小size去进行操作common_mem_allocate_init：

HeapWord* CollectedHeap::common_mem_allocate_init(KlassHandle klass, size_t size, TRAPS) {
  HeapWord* obj = common_mem_allocate_noinit(klass, size, CHECK_NULL);
  init_obj(obj, size);
  return obj;
}

先是common_mem_allocate_noinit申请了一块内存，返回的是一个HeapWord（这块内存的首地址，等价于char*指针），然后init_obj初始化了这块内存最前面的一个机器字，将它设置成对象头的数据（((oop)obj)->set_klass_gap(0);）。再然后我们可以看到这块内存地址被强转为了oop类型（指针）返回，当然最后由InstanceKlass::allocate_instance又转换为了instanceOop返回。

经过源码追溯，我们发现Java对象的实例instanceOop其实本质上仍然是一个oop，不仅instanceOop，其他众多的对象指针都是oop。（至于oop这个名称的来源，则是Smalltalk and Self遗留的产物，历史渊源可以参看：Self (an prototype-based relative of Smalltalk)、Strongtalk (a Smalltalk implementation)）。

为什么要压缩

我们了解了oop，那为啥要压缩呢？看下官网解释的压缩的原因：

On an LP64 system, a machine word, and hence an oop, requires 64 bits, while on an ILP32 system, oops are only 32 bits. But on an ILP32 system there is a maximum heap size of somewhat less than 4Gb, which is not enough for many applications. On an LP64 system, though, the heap for any given run may have to be around 1.5 times as large as for the corresponding ILP32 system (assuming the run fits both modes). This is due to the expanded size of managed pointers. Memory is pretty cheap, but these days bandwidth and cache is in short supply, so significantly increasing the size of the heap just to get over the 4Gb limit is painful.

他的意思是说在32位机器上，一个机器字（也就是oop）只有32位大小，但是到了64位系统就需要64位大小。由于指针大小的扩展，64位系统上任何运行的堆可能都要比相应的32位系统大1.5倍左右(不知道他这是怎么个计算过程，猜测此处1.5倍是64位压缩后的object head size（8bytes + 4bytes = 12bytes）/32位object head size（4bytes + 4bytes = 8bytes）= 1.5)，每个对象的oop与klass都增大一倍，随着对象数量的累计真的是吃了不少内存。

占用的内存越多，也就意味着更长时间的GC周期，同时也意味着性能的下降，我们来看下Oracle官方的描述《Frequently Asked Questions》：

What are the performance characteristics of 64-bit versus 32-bit VMs?

Generally, the benefits of being able to address larger amounts of memory come with a small performance loss in 64-bit VMs versus running the same application on a 32-bit VM. This is due to the fact that every native pointer in the system takes up 8 bytes instead of 4. The loading of this extra data has an impact on memory usage which translates to slightly slower execution depending on how many pointers get loaded during the execution of your Java program. The good news is that with AMD64 and EM64T platforms running in 64-bit mode, the Java VM gets some additional registers which it can use to generate more efficient native instruction sequences. These extra registers increase performance to the point where there is often no performance loss at all when comparing 32 to 64-bit execution speed.

The performance difference comparing an application running on a 64-bit platform versus a 32-bit platform on SPARC is on the order of 10-20% degradation when you move to a 64-bit VM. On AMD64 and EM64T platforms this difference ranges from 0-15% depending on the amount of pointer accessing your application performs.

简单的来说就是：加载这些额外数据（指针）会对内存使用产生影响，根据Java程序执行期间加载的指针数量，内存使用会导致执行速度稍慢。当您迁移到64位虚拟机时，在SPARC上运行在64位平台上的应用程序与运行在32位平台上的应用程序之间的性能差异大约降低10-20%。在AMD64和EM64T平台上，根据应用程序执行的指针访问量的不同，这种差异从0-15%不等。

哪些需要被压缩

首先看下官网《CompressedOops》给出的需要被压缩的类型：

Which oops are compressed?

In an ILP32-mode JVM, or if the UseCompressedOops flag is turned off in LP64 mode, all oops are the native machine word size.

If UseCompressedOops is true, the following oops in the heap will be compressed:

the klass field of every object

every oop instance field

every element of an oop array (objArray)

我们可以看出指针压缩只有在64位JVM下才有效，在32位JVM或未开启指针压缩的64位JVM下oop大小都是原生的机器字大小。而具体压缩的则是klass field、oop instance field、以及每个数组中元素的oop。

相必大家都知道-XX:+UseCompressedOops这个参数，关于这个参数，我们可以查看arguments.cpp。

此处引用内容摘自R大文章，传送门：

HotSpot VM从JDK5开始会根据运行环境来自动设定VM的一些参数（ergonomics）。其中大家最熟悉的可能是它会自动选择client与server模式、堆的初始和最大大小等。事实上ergonomics会设置非常多的内部参数，包括自动选择GC算法、并行GC的线程数、GC的工作区分块大小、对象晋升阈值等等。

Ergonomics相关的逻辑大都在hotspot/src/share/vm/runtime/arguments.cpp中，值得留意的是使用了FLAG_SET_ERGO()的地方。

在arguments.cpp中找到相关的方法Arguments::set_use_compressed_oops()：

void Arguments::set_use_compressed_oops() {
#ifndef ZERO
#ifdef _LP64
  // MaxHeapSize is not set up properly at this point, but
  // the only value that can override MaxHeapSize if we are
  // to use UseCompressedOops is InitialHeapSize.
  size_t max_heap_size = MAX2(MaxHeapSize, InitialHeapSize);

  if (max_heap_size <= max_heap_for_compressed_oops()) {
#if !defined(COMPILER1) || defined(TIERED)
    if (FLAG_IS_DEFAULT(UseCompressedOops)) {
      FLAG_SET_ERGO(bool, UseCompressedOops, true);
    }
#endif
#ifdef _WIN64
    if (UseLargePages && UseCompressedOops) {
      // Cannot allocate guard pages for implicit checks in indexed addressing
      // mode, when large pages are specified on windows.
      // This flag could be switched ON if narrow oop base address is set to 0,
      // see code in Universe::initialize_heap().
      Universe::set_narrow_oop_use_implicit_null_checks(false);
    }
#endif //  _WIN64
  } else {
    if (UseCompressedOops && !FLAG_IS_DEFAULT(UseCompressedOops)) {
      warning("Max heap size too large for Compressed Oops");
      FLAG_SET_DEFAULT(UseCompressedOops, false);
      FLAG_SET_DEFAULT(UseCompressedClassPointers, false);
    }
  }
#endif // _LP64
#endif // ZERO
}

我们发现UseCompressedOops在非64位和手动设定去除UseCompressedOops参数的值的情况下是不会默认开启的，而只有在64位系统（#ifdef _LP64）、不是client VM（!defined(COMPILER1)）、并且max_heap_size <= max_heap_for_compressed_oops()的情况下会默认开启。max_heap_for_compressed_oops计算方式如下：

size_t Arguments::max_heap_for_compressed_oops() {
  // Avoid sign flip.
  assert(OopEncodingHeapMax > (uint64_t)os::vm_page_size(), "Unusual page size");
  // We need to fit both the NULL page and the heap into the memory budget, while
  // keeping alignment constraints of the heap. To guarantee the latter, as the
  // NULL page is located before the heap, we pad the NULL page to the conservative
  // maximum alignment that the GC may ever impose upon the heap.
  size_t displacement_due_to_null_page = align_size_up_(os::vm_page_size(),
                                                        _conservative_max_heap_alignment);

  LP64_ONLY(return OopEncodingHeapMax - displacement_due_to_null_page);
  NOT_LP64(ShouldNotReachHere(); return 0);
}

OopEncodingHeapMax初始化的地方set_object_alignment()：

void set_object_alignment() {
  // Object alignment.
  assert(is_power_of_2(ObjectAlignmentInBytes), "ObjectAlignmentInBytes must be power of 2");
  MinObjAlignmentInBytes     = ObjectAlignmentInBytes;
  assert(MinObjAlignmentInBytes >= HeapWordsPerLong * HeapWordSize, "ObjectAlignmentInBytes value is too small");
  MinObjAlignment            = MinObjAlignmentInBytes / HeapWordSize;
  assert(MinObjAlignmentInBytes == MinObjAlignment * HeapWordSize, "ObjectAlignmentInBytes value is incorrect");
  MinObjAlignmentInBytesMask = MinObjAlignmentInBytes - 1;

  LogMinObjAlignmentInBytes  = exact_log2(ObjectAlignmentInBytes);
  LogMinObjAlignment         = LogMinObjAlignmentInBytes - LogHeapWordSize;

  // Oop encoding heap max
  OopEncodingHeapMax = (uint64_t(max_juint) + 1) << LogMinObjAlignmentInBytes;

#if INCLUDE_ALL_GCS
  // Set CMS global values
  CompactibleFreeListSpace::set_cms_values();
#endif // INCLUDE_ALL_GCS
}

其中(uint64_t(max_juint) + 1) 的值也被称为NarrowOopHeapMax，也就是2的32次方，0x100000000；
ObjectAlignmentInBytes在64位HotSpot上默认为8，LogMinObjAlignmentInBytes在默认8字节对齐时是3，所以最后OopEncodingHeapMax = 0x800000000 // 32GB。displacement_due_to_null_page需要根据不同的垃圾回收器进行计算，但是一般不大。

所以max_heap_size <= 一个近似32GB的数值，即UseCompressedOops会在64位系统（#ifdef _LP64）、不是client VM（!defined(COMPILER1)）、并且max_heap_size <= 一个近似32GB的数值的情况下会默认开启。

而开启了UseCompressedOops会有什么效果呢，在64位机器上开启该参数后，可以用32位无符号整数值（narrowOop）来代替64位的oop指针：

typedef juint narrowOop; // Offset instead of address for an oop within a java object

这就是我们在JOL中观察到[ field : java.lang.Object ClassX.o1 ]由8 bytes -> 4 bytes的原因。

好，此时我们的oop已经被压缩了，如果我们继续翻看，会在arguments.cpp中找到另外一个方法Arguments::set_use_compressed_klass_ptrs()（就在Arguments::set_use_compressed_oops()方法下面）：

// NOTE: set_use_compressed_klass_ptrs() must be called after calling
// set_use_compressed_oops().
void Arguments::set_use_compressed_klass_ptrs() {
#ifndef ZERO
#ifdef _LP64
  // UseCompressedOops must be on for UseCompressedClassPointers to be on.
  if (!UseCompressedOops) {
    if (UseCompressedClassPointers) {
      warning("UseCompressedClassPointers requires UseCompressedOops");
    }
    FLAG_SET_DEFAULT(UseCompressedClassPointers, false);
  } else {
    // Turn on UseCompressedClassPointers too
    if (FLAG_IS_DEFAULT(UseCompressedClassPointers)) {
      FLAG_SET_ERGO(bool, UseCompressedClassPointers, true);
    }
    // Check the CompressedClassSpaceSize to make sure we use compressed klass ptrs.
    if (UseCompressedClassPointers) {
      if (CompressedClassSpaceSize > KlassEncodingMetaspaceMax) {
        warning("CompressedClassSpaceSize is too large for UseCompressedClassPointers");
        FLAG_SET_DEFAULT(UseCompressedClassPointers, false);
      }
    }
  }
#endif // _LP64
#endif // !ZERO
}

这个方法设置了啥呢，是一个-XX:+UseCompressedClassPointers的启动参数，即如果我们启动了-XX:+UseCompressedOops那么JVM默认帮我们启动-XX:+UseCompressedClassPointers。那他是干啥的呢，还记得我们上面写到的需要被压缩的对象，除了oop，还有the klass field of every object。

回想我们之前的文章《从new Class()入手浅谈Java的oop-klass模型》，里面提到了_metadata：

class oopDesc {
  //友元类
  friend class VMStructs;
 private:
  volatile markOop  _mark;
  union _metadata {
    Klass*      _klass;
    narrowKlass _compressed_klass;
  } _metadata;

  // Fast access to barrier set.  Must be initialized.
  //用于内存屏障
  static BarrierSet* _bs;
  ........
}

没错！我们启动-XX:+UseCompressedClassPointers之后，_metadata的指针就会由64位的Klass压缩为32位无符号整数值narrowKlass：

// If compressed klass pointers then use narrowKlass.
typedef juint  narrowKlass;

这也正是我们在JOL中观察到

+0: [ _mark (8 bytes) ]
+8: [ _klass (8 bytes) ]
+16: [ field : long ClassX.l (8 bytes) ]

+0: [ _mark (8 bytes) ]
+8: [ _narrow_klass (4 bytes) ]
+12: [ padding or first field (4 bytes) ] (此时为int ClassX**.i** 4bytes)
+16: [ field : long ClassX.l (8 bytes) ]

的原因（当然如果是typeArrayOop或objArrayOop，省出来的4 bytes还能存放数组长度）.

Java压缩方案的选择

32位oop虽然内存更紧凑，占用更小，更快，但是有一个无法摆脱的缺陷，那就是32位只能引用到4GB的内存空间（2的32次方）。我们看下官方的描述《Frequently Asked Questions》：

Why can’t I get a larger heap with the 32-bit JVM?

The maximum theoretical heap limit for the 32-bit JVM is 4G. Due to various additional constraints such as available swap, kernel address space usage, memory fragmentation, and VM overhead, in practice the limit can be much lower. On most modern 32-bit Windows systems the maximum heap size will range from 1.4G to 1.6G. On 32-bit Solaris kernels the address space is limited to 2G. On 64-bit operating systems running the 32-bit VM, the max heap size can be higher, approaching 4G on many Solaris systems.

As of Java SE 6, the Windows /3GB boot.ini feature is not supported.

If your application requires a very large heap you should use a 64-bit VM on a version of the operating system that supports 64-bit applications. See Java SE Supported System Configurations for details.

32位JVM的最大理论堆限为4G,但由于各种额外的约束，例如可用交换swap，内核地址空间使用，内存碎片和VM自身开销，实际上限制可以低得多。根据《DOES 32-BIT OR 64-BIT JVM MATTER ANYMORE?》所做的实验，各个32位系统下堆可用内存为：

OS	Max heap
Linux	2 – 3GB
AIX	3.25GB
Windows	1.5GB
Solaris	2 – 4GB
Mac OS X	3.8GB

这就导致了我们在使用堆内存大于理论最大值4G的场景下必须选择64位的JVM，而JVM选择了在部分情况下压缩oop来弥补额外的内存开销。

那么应该怎么选择压缩方案呢，我们上一篇文章《字节对齐与Java的指针压缩（上）-字节对齐的渊源》中我们讲到了默认情况下整个JVM的字节对齐是8 bytes，而我们的指针压缩方案则必须要对应我们的字节对齐情况。

首先，压缩后oop只有32位这是不会变的，但是32位，2的32次方就只有4G啊，所以JVM做了一个位移操作：即原来的32位取出做左移3位操作，让引用的最后3位都变成了0，变成35位放入寄存器中进行操作，操作完后又右移3位丢弃末尾的3个0。这样的情况下我们就可以在32位oop的情况下，使用2的35次方=32GB的内存了。

为什么是移动3位呢，因为此时我们整体的字节对齐是8字节（对象位于8字节的边界上），而从压缩的oop得到的任何地址位移操作后均需要以3个0结尾，才能访问被8整除的每一个对象，对于不能被8整除的地址上的任何一个对象，JVM都无法访问。

让我们翻看源码发现：

minObjAlignmentInBytes = getObjectAlignmentInBytes();
    if (minObjAlignmentInBytes == 8) {
      logMinObjAlignmentInBytes = 3;
    } else if (minObjAlignmentInBytes == 16) {
      logMinObjAlignmentInBytes = 4;
    } else {
      throw new RuntimeException("Object alignment " + minObjAlignmentInBytes + " not yet supported");
    }

minObjAlignmentInBytes最小的字节对齐（ObjectAlignmentInBytes默认是8 bytes，也可以使用-XX:ObjectAlignmentInBytes=16来16字节对齐），minObjAlignmentInBytes是8的时候，我们取oop做的位移是3位；而当minObjAlignmentInBytes为16的时候，我们取oop做的位移操作就是4位，即当我们使用16字节对齐的时候，我们的oop可以利用oop<<4 即2的36次方=64GB内存（但是使用16字节对齐后在堆中保存压缩指针所节约的成本，就被为对齐对象而浪费的内存抵消了）。

HotSpot VM现在只使用3种模式的压缩指针：

当整个GC堆所预留的虚拟地址范围的最高的地址在4GB以下的时候，使用32-bits Oops模式，也就是基地址为0、shift也为0；
当GC堆的最高地址超过了4GB，但在32GB以下的时候，使用zero based Compressed Oops模式，也就是基地址为0、shift为 LogMinObjAlignmentInBytes (默认为3)的模式；
当GC堆的最高地址超过了32GB，但整个GC堆的大小仍然在32GB以下的时候，使用Compressed Oops with base模式,非零基地址、shift为 LogMinObjAlignmentInBytes (默认为3)的模式。

如果上面三种情况都无法满足，那压缩指针就无法使用了。

// For UseCompressedOops
  // Narrow Oop encoding mode:
  // 0 - Use 32-bits oops without encoding when
  //     NarrowOopHeapBaseMin + heap_size < 4Gb
  // 1 - Use zero based compressed oops with encoding when
  //     NarrowOopHeapBaseMin + heap_size < 32Gb
  // 2 - Use compressed oops with heap base + encoding.
  enum NARROW_OOP_MODE {
    UnscaledNarrowOop  = 0,
    ZeroBasedNarrowOop = 1,
    HeapBasedNarrowOop = 2
  };