Stream vs For in Java: how to write the fastest code possible

In Java, performance is often determined not by the "beauty of the code," but by how it interacts with memory, the JIT compiler, and CPU cache. Let's analyze why the usual for is often faster than Stream, and how to write truly fast code.


1. Basic truth: Stream vs For — this is not an equal battle

When comparing Stream and for, it is important to understand: Stream is an abstraction over iteration.

  • for — direct access to memory
  • Stream — pipeline + lambdas + additional calls

Each layer of abstraction adds overhead.


2. Why for is faster

The optimal loop looks like this:


long sum = 0;
for (int i = 0; i < data.length; i++) {
    int x = data[i];
    if (x > 100) {
        sum += x;
    }
}

Reasons for high speed:

  • No objects
  • No lambda calls
  • No pipeline
  • Linear memory access (cache-friendly)
  • JIT easily optimizes and vectorizes

3. Why Stream is Slower

Stream looks simple:


Arrays.stream(data)
    .filter(x -> x > 100)
    .sum();

But inside it happens:

  • creation of IntPipeline
  • lambda calls
  • iterator model
  • chain of operations

Even with JIT optimizations, there is still overhead.


4. Boxing — the main killer of performance

The most common mistake:


List<Integer>

Problem:

  • every int → Integer (boxing)
  • load on GC
  • poor cache locality

Correct:


int[] data

5. When Stream is NOT worse

Stream can be almost as fast if:

  • IntStream is used
  • a simple chain of operations
  • no collect()
  • no boxing

long sum = Arrays.stream(data)
    .filter(x -> x > 100)
    .sum();

6. When Stream is severely losing

  • List<Integer> (boxing)
  • complex pipeline chains
  • collect() in collection
  • small arrays (overhead > work)

7. JIT and why the results "float"

Java Virtual Machine dynamically:

  • compiles hot methods
  • inlines calls
  • optimizes loops
  • changes behavior at runtime

Therefore:

Stream can be faster than Loop in one execution and slower in another.

8. CPU cache and call order

The order of execution affects:

  • cache warm-up
  • branch prediction
  • memory prefetch

This explains why the results may swap places.


9. Practical Rule for Selection

Use for if:

  • performance-critical hot path
  • working with arrays
  • minimal latency is important

Use Stream if:

  • readability is important
  • micro-performance is not critical
  • simple data processing

10. Final Conclusion

The main principle of Java performance:

Closer to memory — faster code.
In Java, performance is determined not by style (for vs stream), but by:
- memory
- allocations
- boxing/unboxing
- cache locality
- JIT inline decisions
⚠️ Stream is worse when:

📦 List<Integer> (boxing)
🔗 many pipeline operations
🧠 complex lambdas
🚫 short-circuit logic breaks
⚡ small datasets (overhead > work)
⚡ When Stream is NOT worse

Stream can be nearly as fast if:

🔢 primitive stream (IntStream)
🧩 simple chain
🚫 no collect()
🚫 no boxing
⚙️ JIT inlines everything

Stream is convenience. For is control and speed.

⚔️ Stream vs For in Java — maximum detailed comparative table

Criterion For loop Stream Comment (what really happens inside the JVM)
🏎️ Execution speed Very high Average / high (depends on the case) For is closer to machine code. Stream adds pipeline overhead, even after JIT optimizations.
🧠 Abstraction overhead Minimal High Stream = Iterator + Spliterator + Pipeline + Lambda chain. Each layer = potential overhead.
📦 Boxing / Unboxing No (with int[] / long[]) Often present (if List<Integer>) Boxing = creating Integer objects → load on GC + cache miss.
💾 Cache locality Excellent Average / poor For works linearly through the array → CPU prefetch is effective. Stream can break locality through the pipeline.
⚙️ JIT optimization Highly optimizable Optimizable, but more complex Loop can be easily inlined and vectorized (SIMD). Stream requires analysis of the call chain.
🔥 Inlining Almost always Partially Stream pipeline can hinder complete inlining of the chain.
🧩 Lambda overhead No Yes Lambda can be inlined, but not always. Sometimes invokedynamic remains.
🚀 SIMD / Vectorization Often possible Rarely JVM finds it easier to vectorize a simple loop than a Stream pipeline.
🧾 Readability Average High Stream is better read with complex data processing logic.
⚡ Small datasets Very fast Slower (overhead outweighs performance) Stream overhead does not pay off with small data.
📊 Large datasets Very fast Almost comparable When the work dominates overhead — the difference decreases.
🧪 GC pressure Low Medium / high Stream may create intermediate objects → more GC cycles.
🔁 Short-circuit (break/continue) Full control Limited Stream poorly models complex break/continue scenarios.
🧱 Pipeline complexity Linear code Chain of operations Stream builds an execution graph → scheduling overhead.
⚠️ Predictability Very high Average Loop behavior is stable. Stream depends on JIT optimizations.

⚡ Conclusion:

- For = control + maximum performance
- Stream = expressiveness + convenience

🔥 In Java, speed is determined not by style, but by:
- memory
- allocations
- boxing/unboxing
- cache locality
- JIT inline decisions

⚡ Stream vs Loop - Code Example with Comments


import java.util.Arrays;

public class Test {

    static int[] data;

    public static void main(String[] args) {

        int size = 20_000_000;
        data = new int[size];

        // =========================
        // INIT DATA (once)
        // =========================
        for (int i = 0; i < size; i++) {
            data[i] = i;
        }

        // =========================
        // WARMUP (very important for JVM)
        // =========================
        // JVM has NOT fully optimized the code yet
        // JIT (Just-In-Time compiler) now:
        // - analyzes hot methods
        // - may replace bytecode with native code
        // - performs inline optimizations
        for (int i = 0; i < 5; i++) {
            streamSum(); // warming up Stream pipeline
            loopSum();   // warming up the usual loop
        }

        // =========================
        // TEST ORDER 1: STREAM -> LOOP
        // =========================
        System.out.println("=== ORDER 1: STREAM -> LOOP ===");

        long t1 = System.nanoTime();

        // Stream pipeline:
        // Arrays.stream -> IntPipeline -> lambda filter -> sum
        // intermediate objects are created (even if optimized by JIT)
        long r1 = streamSum();

        long t2 = System.nanoTime();

        // Loop:
        // direct access to the array
        // without objects, without pipeline
        long t3 = System.nanoTime();

        long r2 = loopSum();

        long t4 = System.nanoTime();

        System.out.println("Stream result = " + r1);
        System.out.println("Stream time   = " + (t2 - t1) / 1_000_000 + " ms");

        System.out.println("Loop result   = " + r2);
        System.out.println("Loop time     = " + (t4 - t3) / 1_000_000 + " ms");


        // =========================
        // TEST ORDER 2: LOOP -> STREAM
        // =========================
        System.out.println("\n=== ORDER 2: LOOP -> STREAM ===");

        long t5 = System.nanoTime();

        // now Loop runs FIRST
        // CPU cache + branch predictor can already be "warmed up"
        long r3 = loopSum();

        long t6 = System.nanoTime();

        long t7 = System.nanoTime();

        // Stream is now second
        // it may win or lose due to cache state
        long r4 = streamSum();

        long t8 = System.nanoTime();

        System.out.println("Loop result   = " + r3);
        System.out.println("Loop time     = " + (t6 - t5) / 1_000_000 + " ms");

        System.out.println("Stream result = " + r4);
        System.out.println("Stream time   = " + (t8 - t7) / 1_000_000 + " ms");


        // =========================
        // IMPORTANT MOMENT
        // =========================
        // Even if the code is the same:
        // - CPU cache changes
        // - JIT may already "inline" the loop
        // - branch predictor gets trained
        // - GC may have occurred between measurements
    }

    // =========================
    // STREAM VERSION
    // =========================
    static long streamSum() {
        return Arrays.stream(data)
                // lambda -> can be:
                // - inline
                // - or called via invokedynamic
                .filter(x -> x > 100)
                .sum();
    }

    // =========================
    // LOOP VERSION
    // =========================
    static long loopSum() {

        long sum = 0;

        // direct for-each:
        // - no objects
        // - no pipeline
        // - minimal overhead
        for (int x : data) {
            if (x > 100) {
                sum += x;
            }
        }

        return sum;
    }
}

//Response
//=== ORDER 1: STREAM -> LOOP ===
//Stream result = 542889414
//Stream time   = 14 ms
//Loop result   = 199999989994950
//Loop time     = 9 ms

//=== ORDER 2: LOOP -> STREAM ===
//Loop result   = 199999989994950
//Loop time     = 9 ms
//Stream result = 542889414
//Stream time   = 15 ms
// ---------------- Summary
//✔ JVM has already warmed up
//✔ order is not important
//✔ results are stable
//✔ Loop is faster than Stream in this case


🌐 На русском
Total Likes:0

Оставить комментарий

My social media channel
By sending an email, you agree to the terms of the privacy policy

Useful Articles:

How to keep a legacy project from dying and give it another 10 years
Signs of a legacy project: how to recognize an old ship A legacy is not just old code. It is a living organism that has survived dozens of changes, team shifts, outdated technologies, and numerous tem...
Low-level mechanisms | Go ↔ Java
In this article, we will examine the key low-level mechanisms of Go, comparing them to similar tools in Java. The article is intended for Java developers who want to gain a deeper understanding of Go,...
Reflection on why the completeness of knowledge is unattainable and how to build a personal architecture of professional growth. Every developer has at least once thought: “How to keep up with ev...

New Articles:

Zero Allocation in Java: what it is and why it matters
Zero Allocation — is an approach to writing code in which no unnecessary objects are created in heap memory during runtime. The main idea: fewer objects → less GC → higher stability and performance. ...
Stream vs For in Java: how to write the fastest code possible
In Java, performance is often determined not by the "beauty of the code," but by how it interacts with memory, the JIT compiler, and CPU cache. Let s analyze why the usual for is often faster than Str...
Compiler, Build, and Tooling in Go and Java: how assembly, initialization, analysis, and diagnostics are organized in two ecosystems
This article is dedicated to a general overview of how the compiler, build, and tooling practices are arranged in Go, and how to better understand them through comparison with Java. We will not delve ...
Fullscreen image