Deliberate Thinking: tuning

Tuning G1 GC for Cassandra is too complicated, but it can make a big difference in cluster health.

Symptoms:

High p99 read/write latencies (because of long GC pauses)
High CPU causing lower read throughput (because of low GC throughput)
Dropped mutations (because of full GC collections on write-heavy clusters)

Here are some options that made a difference for me:

JVM: options for getting GC details out for inspection
-XX:+PrintGCDetails

-XX:+PrintGCDateStamps

-Xloggc:/var/log/cassandra/gc.log
JVM: options for having enough buffer for collections
# Pre-allocate full heap
# Pre-size new size for high-throughput young collections
-Xms24G

-Xmx24G

-Xmn8G
JVM: options for avoiding longer pauses (do reference scanning concurrently with app)
# Have the JVM do less remembered set work during STW, instead

# preferring concurrent GC.

-XX:G1RSetUpdatingPauseTimePercent=5

# Scan references in parallel to avoid long RSet scan times

-XX:+ParallelRefProcEnabled
JVM: options for better young collection throughput (avoid copying short-lived objects)
# Save CPU time by avoiding copying objects repeatedly

# Improve collection throughput by making heap regions larger

-XX:MaxTenuringThreshold=1

-XX:G1HeapRegionSize=32m
JVM: option cocktail to reduce risk of long mixed collections
# Avoid to-space exhaustion by starting sooner, capping new size, and being more aggressive during mixed collections

-XX:InitiatingHeapOccupancyPercent=40

-XX:+UnlockExperimentalVMOptions

-XX:G1MaxNewSizePercent=50

-XX:G1MixedGCLiveThresholdPercent=50

-XX:G1MixedGCCountTarget=32

-XX:G1OldCSetRegionThresholdPercent=5

# Reduce pause time target to make mixed collections shorter

-XX:MaxGCPauseMillis=300
JVM: option to get extra buffer for use in allocation emergency
# Reserve extra heap space to reduce risk of to-space overflows

-XX:G1ReservePercent=20
JVM: options for top collection throughput during pauses
# Max out the parallel effort during pause
# Set to number of cores

-XX:ParallelGCThreads=16

-XX:ConcGCThreads=16
Cassandra: option to avoid excess spikes of garbage from compaction
# Reduce load of garbage generation & CPU used for compaction
compaction_throughput_mb_per_sec: 2
Cassandra: option to aggressively flush to disk on write-heavy clusters
# Reduce amount of memtable heap load to reduce object copying
memtable_heap_space_in_mb: 1024 # instead of default 1/3 heap

The net effect of the above combined settings is as follows:

for a read-heavy cluster on i3.4xlarge:

young collection p90 pause times around 50ms
mixed collection p90 pause times around 90ms
no Full GCs, no dropped mutations

for write-heavy clusters on r5.2xlarge:

young collection p90 pause times around 175ms
mixed collection p90 pause times around 175ms
no Full GCs, no dropped mutations

Tuning process:

Turn on GC logging
Gather pause times for young collections, mixed collections, and any full collections

get logs for at least 2-3 cycles of young => mixed/full transitions

Decide which of the above you want to optimize for, pick a single set of settings

Apply the settings to one node on one rack
Decide whether it had the desired effect
Tweak and repeat on single node until you get to a stable point

Apply settings to all nodes on one rack

Wait for a peak traffic period or apply stress
Compare results from non-tuned racks with the tuned rack
Tweak and repeat on single rack until settings are rock solid

Apply settings to full cluster

Wait for a peak traffic period or apply stress
Make sure settings are rock solid for full cluster

Start again on step 2 until you have nothing left to tune

29 May 2019

Tuning G1 GC for Cassandra