String deduplication is great, but not a silver bullet

Many same Strings are a notorious cause for memory usage.
Java can deduplicate strings – that sounds awesome, and is awesome, but comes with limitations.

Ok, I want it now, what to I do?
Just add:

1
-XX:+UseStringDeduplication
to your command line, and magic should happen.

What kind of magic? GC will take on an additional duty of scanning Strings, identifying duplicates, and deduplicating the underlying char array.
So, caveat number 1: String objects will not be deduplicated.
That task means additional work, which will takes time and CPU cycles.

Let’s look at the flags


1
2
3
4
5
6
7
8
9
10
11
❯ java -XX:+PrintFlagsFinal | grep -i dedup
     uint StringDeduplicationAgeThreshold          = 3                                         {product} {default}
   size_t StringDeduplicationCleanupDeadMinimum    = 100                                  {experimental} {default}
      int StringDeduplicationCleanupDeadPercent    = 5                                    {experimental} {default}
   double StringDeduplicationGrowTableLoad         = 14.000000                            {experimental} {default}
 uint64_t StringDeduplicationHashSeed              = 0                                      {diagnostic} {default}
   size_t StringDeduplicationInitialTableSize      = 500                                  {experimental} {default}
     bool StringDeduplicationResizeALot            = false                                  {diagnostic} {default}
   double StringDeduplicationShrinkTableLoad       = 1.000000                             {experimental} {default}
   double StringDeduplicationTargetTableLoad       = 7.000000                             {experimental} {default}
     bool UseStringDeduplication                   = false                                     {product} {default}

Which Strings?
They need to be “old enough”, flag used to control this is

1
-XX:StringDeduplicationAgeThreshold
and defaults to 3.
So a String will be considered for deduplication only after 3 GC cycles. You could change that, but checking every very young string is at odds with generational hypothesis – they would most likely die, so why waste deduplication logic on them.
(This would probably not help when receiving a large json with lots of duplicate Strings)

Ok, so… not a silver bullet.
What if: there are many incoming Strings, many of which are known to be duplicate, but this feature is not the one?

Deduplicate by pooling. There is a catch here: that pool could be a source of a leak itself.

Should you add this to your application?
Answer, as always, is “it depends”. Try and test.