When HashMap starts killing production: the engineering story of ConcurrentHashMap

Imagine a typical production service.

  • 32 CPU
  • hundreds of threads
  • configuration / session / rate limits cache
  • tens of thousands of operations per second

And somewhere inside — a regular Map.

At first, everything looks harmless.

Map<String, Session> sessions = new HashMap<>();

Then a couple of threads appear.

Then dozens.

Then production load begins.

And suddenly:

  • latency spikes
  • CPU goes to 100%
  • strange freezes appear
  • sometimes — infinite loops in HashMap

This is not a hypothesis. These are real production incidents that occurred in large Java systems before the advent of proper concurrent structures.

It is from such stories that ConcurrentHashMap was born.

But to understand why it is designed this way, one must go through the entire evolution:

HashMap
   ↓
synchronized Map
   ↓
Segmented ConcurrentHashMap (JDK7)
   ↓
CAS + bin locking (JDK8)
   ↓
tree bins + lock striping + resize migration

And this is one of the most beautiful engineering evolutions in the JVM.

Tree bins are a mechanism in HashMap and ConcurrentHashMap (starting from Java 8), whereby a bucket with a large number of collisions is automatically transformed from a linked list to a balanced red-black tree. This prevents performance degradation when there are many elements with the same hash: instead of linear complexity O(n) for search, insertion, and deletion operations, they are performed in O(log n). Such a transformation occurs when the number of elements in a bucket exceeds a specified threshold (usually 8), and serves as protection against both bad hashCode and potential hash collision attacks.

First naive idea: just synchronized

The most obvious protection for HashMap:

Map<K, V> map = Collections.synchronizedMap(new HashMap<>());

Or:

public synchronized V get(K key)
public synchronized V put(K key, V value)

This works.

But only until real concurrency appears.

What happens under load

32 threads
200k ops/sec

Each call:

lock(map)
   → read/write
unlock(map)

We get a global lock.

Diagram

Thread1 ─┐
Thread2 ─┤
Thread3 ─┤
Thread4 ─┤ → LOCK(map) → operation → unlock
Thread5 ─┤
Thread6 ─┘

All threads line up in a queue.

parameter effect
throughput falls
latency increases
CPU idles
contention huge
Even get() blocks all other get().

This is a fundamental problem.


Second attempt: lock striping

The Java engineers decided:

"What if we divide the map into several independent parts?"

Thus, ConcurrentHashMap JDK7 was born.

Segments

ConcurrentHashMap
 ├─ Segment 0
 ├─ Segment 1
 ├─ Segment 2
 └─ Segment 3

Each segment is a separate mini HashMap with its own lock.

segment = hash(key) % segments

Now:

Thread1 → Segment1 → lock
Thread2 → Segment2 → lock
Thread3 → Segment3 → lock

And they do not interfere with each other.

Operation code

Segment<K,V> s = segments[(hash >>> segmentShift) & segmentMask];
s.lock();

try {
    s.put(key, value);
} finally {
    s.unlock();
}

Performance improvement

threads synchronizedMap segmented CHM
1 1x 1x
8 0.7x 5x
32 0.3x 18x

But new problems arose.


Problem #1 — Fixed Parallelism

The number of segments is fixed.

16 segments

But the server:

64 threads

As a result again:

Thread1 → Segment3
Thread2 → Segment3
Thread3 → Segment3

Contention returns.


Problem №2 — resize

Resize — the most difficult part of any hash map.

capacity = capacity * 2

In the segmented model:

resize → lock entire segment

This is expensive.


Problem #3 — memory overhead

Each Segment contains:

ReentrantLock
table
metadata

On large maps, this turns into memory waste.

The JVM engineers decided:

We need a structure without segments.

Thus, ConcurrentHashMap JDK8 was born.


Architecture of ConcurrentHashMap (JDK8)

Node<K,V>[] table

Each bucket is managed separately.

table
 ├─ bin0
 ├─ bin1
 ├─ bin2
 ├─ bin3

Main idea

operation mechanism
empty bucket CAS
collision synchronized
resize cooperative

End-to-End put operation

map.put("user42", session);

Step 1 — hash

static final int spread(int h) {
    return (h ^ (h >>> 16)) & HASH_BITS;
}

This protects against bad hashCode.

Step 2 — index

index = (n - 1) & hash

Bit mask is faster than modulo.

Step 3 — CAS insertion

if (casTabAt(tab, i, null,
      new Node<>(hash, key, value)))
      break;

CAS = Compare And Swap.

if (table[i] == null)
    table[i] = node

But atomically at CPU level.

CAS is performed through Unsafe.compareAndSwapObject and maps to the CPU instruction cmpxchg.

Why this is important

CAS avoids lock.

Thread1 → CAS success
Thread2 → CAS fail → retry

No kernel switch.


When CAS Does Not Help

If the bucket is not empty:

bin0 → node → node → node

Local lock is used:

synchronized (f) {
    // modify chain
}

Lock is only for one bucket.


When a bucket turns into a tree

If:

bucket size > 8

List turns into a Red-Black Tree.

O(log n)

Diagram

before

bucket
  ↓
  N → N → N → N → N


after

        N
      /   \
     N     N
    / \   / \

Resize — the most difficult part

ConcurrentHashMap performs resize without global pause.

Cooperative migration

old table
 ├─ bin0
 ├─ bin1
 ├─ bin2
 └─ bin3
Thread1 → bin0
Thread2 → bin1
Thread3 → bin2
Thread4 → bin3

Threads help move buckets.

ForwardingNode

ForwardingNode

This is a signal:

"this bucket has already moved"

Java Memory Model

volatile Node<K,V>[] table;

This ensures visibility between threads.

put → happens-before → get

CPU Cache Effects

CAS invokes:

invalidate cache line

If many threads hit one bucket:

cache ping-pong

Performance drops.


Impact on GC

Node
 key
 value
 next

Millions of Node → load on GC scanning.


Real production case

Map<IP, Counter>

load:

150k req/sec

Replacement:

synchronized HashMap
↓
ConcurrentHashMap
metric before after
latency p99 120ms 8ms
CPU 90% 40%
throughput 30k rps 150k rps

Main trade-offs

solution plus minus
CAS without lock retry loops
bucket lock simple contention
tree bins protection against hash attack memory
cooperative resize no stop-the-world complexity

The most important engineering idea

ConcurrentHashMap is a combination of three strategies:

optimistic (CAS)
pessimistic (locks)
structural (tree bins)

Each is used only when needed.


Engineering Intuition

If:

high concurrency
frequent reads
rare writes

ConcurrentHashMap is ideal.

But if:

writes dominate

CAS can turn into a retry storm.


Main conclusion

ConcurrentHashMap works quickly not because it is a completely lock-free structure. In fact, it actively uses locks inside. Its performance is achieved through another, much more important engineering idea — minimizing contention between threads.

Instead of one global lock that turns all operations into a sequential queue, ConcurrentHashMap breaks the work of the structure into many small independent sections. Each thread tries to work with the most isolated part of the data structure, which dramatically reduces the likelihood that two threads will block each other.

For this, several different synchronization strategies are combined inside the map:

  • CAS (Compare-And-Swap) — used for fast optimistic operations, for example, for inserting an element into an empty bucket without using a lock. If the operation succeeds, the thread does not enter the critical section at all.
  • Fine-grained locks — when collisions do occur, the lock is applied not to the entire map, but only to a specific bucket. This dramatically reduces the scope of the lock and allows other threads to continue working with the remaining parts of the structure.
  • Tree bins — if too many elements accumulate in one bucket, the linked list automatically transforms into a balanced red-black tree. This protects the structure from performance degradation when there are many collisions and prevents operations from turning into a linear search.
  • Cooperative resize — when increasing the table, the work of moving elements is distributed among threads. Several threads help move buckets into the new table, instead of stopping the entire data structure during the resize.

Thanks to the combination of these mechanisms, ConcurrentHashMap can operate efficiently at a very high degree of parallelism: dozens and hundreds of threads can perform read and write operations practically independently of each other.

This makes ConcurrentHashMap one of the best examples of evolutionary engineering in the JVM — a structure that gradually evolved from simply locking the entire map to a complex system of optimistic operations, local locks, and adaptive data structures.

🌐 На русском
Total Likes:0

Оставить комментарий

My social media channel
By sending an email, you agree to the terms of the privacy policy

Useful Articles:

Generics, Reflection and Channels - Go vs Java | Types - Language
In this article we will analyze advanced type system features in Go: generics (type parameters), reflection, and channel types for concurrency. We will compare Go and Java approaches, so Java develope...
Breaking down: Rate-limiter, non-blocking operations, scheduler Go vs Java | Concurrency part 4
This article is dedicated to understanding the principles of working with concurrency and synchronization in Go and Java. We will look at key approaches such as rate-limiter, non-blocking operations, ...
Struct, methods and interfaces in Go vs Java | Types - Language
Series: Go for Java Developers — exploring struct, interface, receiver types and type embedding In this article, we will examine how types architecture is built in Go. For a Java developer, this is e...

New Articles:

When HashMap starts killing production: the engineering story of ConcurrentHashMap
Imagine a typical production service. 32 CPU hundreds of threads configuration / session / rate limits cache tens of thousands of operations per second And somewhere inside — a regular Map. At first...
Zero Allocation in Java: what it is and why it matters
Zero Allocation — is an approach to writing code in which no unnecessary objects are created in heap memory during runtime. The main idea: fewer objects → less GC → higher stability and performance. ...
Stream vs For in Java: how to write the fastest code possible
In Java, performance is often determined not by the "beauty of the code," but by how it interacts with memory, the JIT compiler, and CPU cache. Let s analyze why the usual for is often faster than Str...
Fullscreen image