In this paper, we quantify this expectation by building and evaluating a baseline volatile-by-default JVM for ARM called VBDA-HotSpot, using the same technique previously used for x86. Through this baseline we report, to the best of our knowledge, the first comprehensive study of the cost of providing language-level SC for a production compiler on ARM. VBDA-HotSpot indeed incurs a considerable performance penalty on ARM, with average overheads on the DaCapo benchmarks on two ARM servers of 57% and 73% respectively.
Motivated by these experimental results, we then present a novel speculative technique to optimize language-level SC. While several prior works have shown how to optimize SC in the context of an offline, whole-program compiler, to our knowledge this is the first optimization approach that is compatible with modern implementation technology, including dynamic class loading and just-in-time (JIT) compilation. The basic idea is to modify the JIT compiler to treat each object as thread-local initially, so accesses to its fields can be compiled without fences. If an object is ever accessed by a second thread, any speculatively compiled code for the object is removed, and future JITed code for the object will include the necessary fences in order to ensure SC. We demonstrate that this technique is effective, reducing the overhead of enforcing VBD by one-third on average, and additional experiments validate the thread-locality hypothesis that underlies the approach.
[PDF | Video | Implementation | Project Page]