The termination phase of G1 is too time-consuming. - Codes Helper

ladies and gentlemen, I have encountered a problem when using G1. The termination phase of G1 takes more than 90% of the total young gc, which is generally about 100ms, which leads to the failure of the whole program.
GC log is as follows:

2018-05-16T00:00:49.665+0800: 10761.953: [GC pause (G1 Evacuation Pause) (young) 10761.953: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 7554, predicted base time: 4.22 ms, remaining time: 25.78 ms, target pause time: 30.00 ms]
 10761.953: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1680 regions, survivors: 40 regions, predicted young region time: 25.59 ms]
 10761.953: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1680 regions, survivors: 40 regions, old: 0 regions, predicted pause time: 29.81 ms, target pause time: 30.00 ms]
, 0.2301618 secs]
   [Parallel Time: 225.6 ms, GC Workers: 13]
      [GC Worker Start (ms): Min: 10761953.5, Avg: 10761953.6, Max: 10761953.7, Diff: 0.2]
      [Ext Root Scanning (ms): Min: 0.3, Avg: 0.4, Max: 0.5, Diff: 0.2, Sum: 5.4]
      [Update RS (ms): Min: 0.9, Avg: 1.2, Max: 1.6, Diff: 0.7, Sum: 15.3]
         [Processed Buffers: Min: 2, Avg: 2.7, Max: 5, Diff: 3, Sum: 35]
      [Scan RS (ms): Min: 0.1, Avg: 0.4, Max: 0.6, Diff: 0.5, Sum: 4.9]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 15.5, Avg: 23.2, Max: 34.8, Diff: 19.3, Sum: 301.0]
      [Termination (ms): Min: 188.5, Avg: 200.2, Max: 207.8, Diff: 19.3, Sum: 2602.1]
         [Termination Attempts: Min: 3064, Avg: 3240.0, Max: 3397, Diff: 333, Sum: 42120]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.8]
      [GC Worker Total (ms): Min: 225.2, Avg: 225.4, Max: 225.5, Diff: 0.3, Sum: 2929.6]
      [GC Worker End (ms): Min: 10762178.9, Avg: 10762179.0, Max: 10762179.0, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.9 ms]
   [Other: 3.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.2 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.1 ms]
      [Humongous Register: 0.3 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 2.4 ms]
   [Eden: 1680.0M(1680.0M)->0.0B(1742.0M) Survivors: 40.0M->40.0M Heap: 2882.1M(8192.0M)->1206.7M(8192.0M)]
Heap after GC invocations=1702 (full 0):
 garbage-first heap   total 8388608K, used 1235671K [0x00000005fa400000, 0x00000005fa510000, 0x00000007fa400000)
  region size 1024K, 40 young (40960K), 40 survivors (40960K)
 Metaspace       used 20923K, capacity 21108K, committed 21376K, reserved 1069056K
  class space    used 2342K, capacity 2414K, committed 2432K, reserved 1048576K
}
 [Times: user=2.95 sys=0.00, real=0.23 secs]

the main time consuming is that the total GC of termination, actually takes 240ms, but the avg of termination phase is 210ms

[Termination (ms): Min: 201.7, Avg: 212.0, Max: 219.3, Diff: 17.6, Sum: 2755.5]
[Times: user=3.11 sys=0.00, real=0.24 secs]

my JVM parameters are as follows:

-XX:+AlwaysPreTouch -XX:CompressedClassSpaceSize=96468992 
-XX:ErrorFile=/dev/shm/hs_error%p.log 
-XX:G1HeapRegionSize=1048576 
-XX:G1ReservePercent=25 -XX:GCLogFileSize=31457280 
-XX:InitialHeapSize=8589934592 
-XX:InitiatingHeapOccupancyPercent=30 
-XX:MaxDirectMemorySize=34359738368 
-XX:MaxGCPauseMillis=30 
-XX:MaxHeapSize=8589934592 
-XX:MaxMetaspaceSize=104857600 
-XX:NumberOfGCLogFiles=5 
-XX:-OmitStackTraceInFastThrow 
-XX:ParallelGCThreads=13 
-XX:+PrintAdaptiveSizePolicy 
-XX:+PrintGC -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCDateStamps -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=8 
-XX:+UnlockExperimentalVMOptions -XX:-UseBiasedLocking 
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops 
-XX:+UseG1GC -XX:+UseGCLogFileRotation

the JVM version used is jdk1.8.0_171 (it should have nothing to do with the JDK version, I tried the latest jdk10,termination phase is still the most time-consuming), and the machine hardware configuration is a physical machine with 32core 96GB memory.
about my program:
my program is currently a test demo, is very simple:
the whole program consists of three threads:

produceThread is responsible for creating 128byte fixed-size objects ProducerRequest, with a maximum of 256bytes (including some monitoring information, running time-consuming dotting), and then putting it into a blocking queue (LinkedBlockingQueue)
WriteThread, is responsible for taking the ProducerRequest out of the blocking queue and writing it to a directByteBuffer. If there is not enough space left in the directByteBuffer to store a ProducerRequest, then throw the directByteBuffer into another blocking queue dirtyQueue (LinkedBlockingQueue)
FlushThread, is responsible for taking the directByteBuffer, from the dirtyQueue and writing the data in the directByteBuffer to disk. Every time (randomAccessFile), writes a directByteBuffer, the directByteBuffer is put into another blocking queue (cleanQueue) for use by the WriteThread

private final int PAGE_SIZE = 1024 * 4 ;//4K 
private int pageNum = 1024 * 1024 * 4;    //4K * 1024 * 1024 * 4= 16G

private final LinkedBlockingQueue<ByteBuffer> cleanQueue = new LinkedBlockingQueue<>() ;
private final LinkedBlockingQueue<ByteBuffer> dirtyQueue = new LinkedBlockingQueue<>() ;

the points I have checked so far:

looks at a problem on Stack Overflow: https://stackoverflow.com/que.. The above reply is that there are too many GC worker threads, which leads to too much CPU competition. Then I set the XX:ParallelGCThreads to 13 (13 for online 16Core machines), but the actual effect is not obvious
then commented on the same question that it was caused by Object Copy being too slow. Although I disagreed with him (because my program is very simple, for the creation of giant objects does not exist, basically are small objects), I also checked the GC log and did not find the GC log caused by humongous allocations
looking at the previous log, it is found that the GC, caused by the small metadata area has been configured with a XX:MaxMetaspaceSize of 100m, and the latter GC has basically disappeared, but the GC triggered by termination still exists
by forcing the size of the younger generation to be 1G, the delay caused by termination can be effectively reduced. I set the younger generation to 5G, and then changed to 1G. The time spent on recording is reduced from 500ms to 100ms, and the corresponding overall delay is also proportionally reduced
modified the program once: did not use a cache pool of direct memory (still created a cache pool), merged WriteThread and FlushThread, and wrote ProducerRequest directly to disk through randomAccessFile every time ProducerRequest entered. In this way, there will not be a lot of time-consuming termination, the overall gc is below 10ms, and most of them are within 1ms. Through this test, I can basically be sure that it has something to do with the use of out-of-heap memory, but now I can"t think of a direct relationship to explain this guess

the complete log: of the relevant GC https://pan.baidu.com/s/16iVf.
also hopes that the relevant will give me a lot of advice. Thank you ~