Symptoms:
- High p99 read/write latencies (because of long GC pauses)
- High CPU causing lower read throughput (because of low GC throughput)
- Dropped mutations (because of full GC collections on write-heavy clusters)
Here are some options that made a difference for me:
- JVM: options for getting GC details out for inspection
-XX:+PrintGCDetails-XX:+PrintGCDateStamps-Xloggc:/var/log/cassandra/gc.log
- JVM: options for having enough buffer for collections
# Pre-allocate full heap
# Pre-size new size for high-throughput young collections
-Xms24G-Xmx24G-Xmn8G - JVM: options for avoiding longer pauses (do reference scanning concurrently with app)# Have the JVM do less remembered set work during STW, instead# preferring concurrent GC.-XX:G1RSetUpdatingPauseTimePercent=5# Scan references in parallel to avoid long RSet scan times-XX:+ParallelRefProcEnabled
- JVM: options for better young collection throughput (avoid copying short-lived objects)# Save CPU time by avoiding copying objects repeatedly# Improve collection throughput by making heap regions larger-XX:MaxTenuringThreshold=1-XX:G1HeapRegionSize=32m
- JVM: option cocktail to reduce risk of long mixed collections
# Avoid to-space exhaustion by starting sooner, capping new size, and being more aggressive during mixed collections-XX:InitiatingHeapOccupancyPercent=40-XX:+UnlockExperimentalVMOptions-XX:G1MaxNewSizePercent=50-XX:G1MixedGCLiveThresholdPercent=50-XX:G1MixedGCCountTarget=32-XX:G1OldCSetRegionThresholdPercent=5# Reduce pause time target to make mixed collections shorter-XX:MaxGCPauseMillis=300
- JVM: option to get extra buffer for use in allocation emergency
# Reserve extra heap space to reduce risk of to-space overflows-XX:G1ReservePercent=20
- JVM: options for top collection throughput during pauses# Max out the parallel effort during pause
# Set to number of cores-XX:ParallelGCThreads=16-XX:ConcGCThreads=16 - Cassandra: option to avoid excess spikes of garbage from compaction
# Reduce load of garbage generation & CPU used for compaction
compaction_throughput_mb_per_sec: 2 - Cassandra: option to aggressively flush to disk on write-heavy clusters
# Reduce amount of memtable heap load to reduce object copying
memtable_heap_space_in_mb: 1024 # instead of default 1/3 heap
The net effect of the above combined settings is as follows:
- for a read-heavy cluster on i3.4xlarge:
- young collection p90 pause times around 50ms
- mixed collection p90 pause times around 90ms
- no Full GCs, no dropped mutations
- for write-heavy clusters on r5.2xlarge:
- young collection p90 pause times around 175ms
- mixed collection p90 pause times around 175ms
- no Full GCs, no dropped mutations
Tuning process:
- Turn on GC logging
- Gather pause times for young collections, mixed collections, and any full collections
- get logs for at least 2-3 cycles of young => mixed/full transitions
- Decide which of the above you want to optimize for, pick a single set of settings
- Apply the settings to one node on one rack
- Decide whether it had the desired effect
- Tweak and repeat on single node until you get to a stable point
- Apply settings to all nodes on one rack
- Wait for a peak traffic period or apply stress
- Compare results from non-tuned racks with the tuned rack
- Tweak and repeat on single rack until settings are rock solid
- Apply settings to full cluster
- Wait for a peak traffic period or apply stress
- Make sure settings are rock solid for full cluster
- Start again on step 2 until you have nothing left to tune