Many same Strings are a notorious cause for memory usage.
Java can deduplicate strings – that sounds awesome, and is awesome, but comes with limitations.
Ok, I want it now, what to I do?
Just add:
1 | -XX:+UseStringDeduplication |
What kind of magic? GC will take on an additional duty of scanning Strings, identifying duplicates, and deduplicating the underlying char array.
So, caveat number 1: String objects will not be deduplicated.
That task means additional work, which will takes time and CPU cycles.
Let’s look at the flags
1
2
3
4
5
6
7
8
9
10
11 ❯ java -XX:+PrintFlagsFinal | grep -i dedup
uint StringDeduplicationAgeThreshold = 3 {product} {default}
size_t StringDeduplicationCleanupDeadMinimum = 100 {experimental} {default}
int StringDeduplicationCleanupDeadPercent = 5 {experimental} {default}
double StringDeduplicationGrowTableLoad = 14.000000 {experimental} {default}
uint64_t StringDeduplicationHashSeed = 0 {diagnostic} {default}
size_t StringDeduplicationInitialTableSize = 500 {experimental} {default}
bool StringDeduplicationResizeALot = false {diagnostic} {default}
double StringDeduplicationShrinkTableLoad = 1.000000 {experimental} {default}
double StringDeduplicationTargetTableLoad = 7.000000 {experimental} {default}
bool UseStringDeduplication = false {product} {default}
Which Strings?
They need to be “old enough”, flag used to control this is
1 | -XX:StringDeduplicationAgeThreshold |
So a String will be considered for deduplication only after 3 GC cycles. You could change that, but checking every very young string is at odds with generational hypothesis – they would most likely die, so why waste deduplication logic on them.
(This would probably not help when receiving a large json with lots of duplicate Strings)
Ok, so… not a silver bullet.
What if: there are many incoming Strings, many of which are known to be duplicate, but this feature is not the one?
Deduplicate by pooling. There is a catch here: that pool could be a source of a leak itself.
Should you add this to your application?
Answer, as always, is “it depends”. Try and test.