29 May 2019

Tuning G1 GC for Cassandra

Tuning G1 GC for Cassandra is too complicated, but it can make a big difference in cluster health.

Symptoms:

  • High p99 read/write latencies (because of long GC pauses)
  • High CPU causing lower read throughput (because of low GC throughput)
  • Dropped mutations (because of full GC collections on write-heavy clusters)
Here are some options that made a difference for me:
  • JVM: options for getting GC details out for inspection
    -XX:+PrintGCDetails
    -XX:+PrintGCDateStamps
    -Xloggc:/var/log/cassandra/gc.log
  • JVM: options for having enough buffer for collections
    # Pre-allocate full heap
    # Pre-size new size for high-throughput young collections
    -Xms24G
    -Xmx24G
    -Xmn8G
  • JVM: options for avoiding longer pauses (do reference scanning concurrently with app)
    # Have the JVM do less remembered set work during STW, instead
    # preferring concurrent GC.
    -XX:G1RSetUpdatingPauseTimePercent=5
    # Scan references in parallel to avoid long RSet scan times
    -XX:+ParallelRefProcEnabled
  • JVM: options for better young collection throughput (avoid copying short-lived objects)
    # Save CPU time by avoiding copying objects repeatedly
    # Improve collection throughput by making heap regions larger
    -XX:MaxTenuringThreshold=1
    -XX:G1HeapRegionSize=32m
  • JVM: option cocktail to reduce risk of long mixed collections
    # Avoid to-space exhaustion by starting sooner, capping new size, and being more aggressive during mixed collections
    -XX:InitiatingHeapOccupancyPercent=40
    -XX:+UnlockExperimentalVMOptions
    -XX:G1MaxNewSizePercent=50
    -XX:G1MixedGCLiveThresholdPercent=50
    -XX:G1MixedGCCountTarget=32
    -XX:G1OldCSetRegionThresholdPercent=5
    # Reduce pause time target to make mixed collections shorter
    -XX:MaxGCPauseMillis=300
  • JVM: option to get extra buffer for use in allocation emergency
    # Reserve extra heap space to reduce risk of to-space overflows
    -XX:G1ReservePercent=20
  • JVM: options for top collection throughput during pauses
    # Max out the parallel effort during pause
    # Set to number of cores
    -XX:ParallelGCThreads=16
    -XX:ConcGCThreads=16
  • Cassandra: option to avoid excess spikes of garbage from compaction
    # Reduce load of garbage generation & CPU used for compaction
    compaction_throughput_mb_per_sec: 2
  • Cassandra: option to aggressively flush to disk on write-heavy clusters
    # Reduce amount of memtable heap load to reduce object copying
    memtable_heap_space_in_mb: 1024  # instead of default 1/3 heap
The net effect of the above combined settings is as follows:
  • for a read-heavy cluster on i3.4xlarge:
    • young collection p90 pause times around 50ms
    • mixed collection p90 pause times around 90ms
    • no Full GCs, no dropped mutations
  • for write-heavy clusters on r5.2xlarge:
    • young collection p90 pause times around 175ms
    • mixed collection p90 pause times around 175ms
    • no Full GCs, no dropped mutations
Tuning process:
  1. Turn on GC logging
  2. Gather pause times for young collections, mixed collections, and any full collections
    • get logs for at least 2-3 cycles of young => mixed/full transitions
  3. Decide which of the above you want to optimize for, pick a single set of settings
    • Apply the settings to one node on one rack
    • Decide whether it had the desired effect
    • Tweak and repeat on single node until you get to a stable point
  4. Apply settings to all nodes on one rack
    • Wait for a peak traffic period or apply stress
    • Compare results from non-tuned racks with the tuned rack
    • Tweak and repeat on single rack until settings are rock solid
  5. Apply settings to full cluster
    • Wait for a peak traffic period or apply stress
    • Make sure settings are rock solid for full cluster
  6. Start again on step 2 until you have nothing left to tune

No comments:

Post a Comment