All presentations

Talk · Interactive edition

Multi-Threading

What really happens when code runs “at the same time”? We start at a single CPU executing one function and climb — through registers, context switches, fork, the Android Looper, all the way to the state machine Kotlin hides behind suspend. Every layer is something you can step through.

Tehran, Iran 1h 35m Persian Recorded
Watch the talk

The full talk

Watch the original — 1 hour, 35 minutes

Start here if you want the complete journey in Persian — every diagram, every code dive, and the back-and-forth between both speakers. Then scroll on for the interactive companion below.

Open on YouTube Continue the interactive journey

Bottom-up

Concurrency is taught backwards. We learn the high-level APIs first — threads, executors, coroutines — and the machinery underneath stays a black box. This talk does the opposite: it builds the whole stack from the bottom, so the abstractions at the top finally make sense.

By the end, a coroutine won't be magic. It'll be a compiler-generated state machine, scheduling itself onto a thread, which is itself a set of registers the kernel swaps onto a CPU. Here's the climb.

01

The CPU

Registers, the program counter, and executing one function.

02

Processes

The PCB, context switching, fork and clone.

03

Threads

Shared vs isolated memory, and Thread-Local Storage.

04

Android

MessageQueue, Looper, Handler — and syncing to VSYNC.

05

Coroutines

Virtual threads, suspend, and the CPS state machine.

01

The CPU executes one thing

A core runs a single stream of instructions, shuffling values between a handful of registers.

Underneath everything is a CPU with an arithmetic unit, a control unit, and a small set of registers — tiny, ultra-fast slots of memory. A few are special: the PC (program counter) points at the next instruction, the SP tracks the stack, and the LR remembers where to return.

That's all “running a program” is: read instruction, update registers, advance the PC. Step through sum(10, 20) and watch the registers change one instruction at a time.

02

Processes, and the art of pretending

One core, many programs. The OS fakes simultaneity by switching fast.

A running program is a process. The kernel tracks each one in a Process Control Block (PCB): its ID, its state, its memory, and — crucially — a saved copy of all the registers. With a single core, only one process runs at any instant. The illusion of “at the same time” comes from switching between them many times a second.

Context switching

To switch, the CPU saves its registers into the current PCB, then restores the next process's registers from its PCB. Press play and watch a register set get parked and another swapped in:

Context switch

Process 1 is running on the core.

This exact dance is the Linux cpu_switch_to routine below.

It isn't hand-waving — here's the real ARM64 routine the Linux kernel runs to do it. Save the callee-saved registers of the previous task, load the next task's, and ret straight into it:

// linux/arch/arm64/kernel/entry.S — switching AArch64 tasks
SYM_FUNC_START(cpu_switch_to)
    mov   x10, #THREAD_CPU_CONTEXT
    add   x8, x0, x10            // x0 = previous task
    mov   x9, sp
    stp   x19, x20, [x8], #16    // save callee-saved registers
    stp   x21, x22, [x8], #16
    // …x23–x28…
    stp   x29, x9,  [x8], #16
    str   lr, [x8]

    add   x8, x1, x10            // x1 = next task
    ldp   x19, x20, [x8], #16    // restore callee-saved registers
    // …
    ldr   lr, [x8]
    mov   sp, x9
    ret                          // returns into the *next* task
SYM_FUNC_END(cpu_switch_to)

Creating a process: fork()

fork() clones the current process whole — same code, same memory (copy-on-write), its own PCB. The twist: it returns twice. The child sees 0; the parent sees the child's PID. Press the button to fork:

fork()

One process, about to fork.

int main() {
    printf("First\n");
    int res = fork();

    if (res == 0) {
        printf("Child\n");                       // res == 0  → we are the child
    } else if (res > 0) {
        printf("Parent: child PID is %d\n", res); // res > 0   → child's PID
    } else {
        printf("Parent: fork failed\n");
    }
}

And the truth: clone()

On Linux, fork() is really a thin wrapper over clone() — a system call that lets you choose precisely what to share with the new execution context:

int clone(
    int flags,             // what to SHARE with the parent
    void *child_stack,     // the new execution context's stack
    int *ptid,
    int *ctid,
    unsigned long newtls   // Thread-Local Storage descriptor
);

Share nothing and you get a process. Share the memory space, file descriptors and signal handlers — and keep the same PID — and you get a thread. A thread isn't a special primitive; it's just clone() with the right flags. This is literally how glibc spawns a pthread:

// glibc/nptl/pthread_create.c — how a thread is really born
const int clone_flags =
      CLONE_VM        // ← share the same memory space
    | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
    | CLONE_SIGHAND
    | CLONE_THREAD    // ← same PID: it's a thread, not a process
    | CLONE_SETTLS    // ← give it its own Thread-Local Storage
    | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID;

__clone_internal(&args, &start_thread, pd);
03

Threads share everything but the stack

Many register-sets, one memory space — power and peril in one.

Threads live inside one process. They share its memory and resources, but each carries its own registers and stack so it can be scheduled independently. That shared memory is exactly why threads are fast to create and communicate — and exactly why they're dangerous. Toggle between the two models:

Memory model

object : Thread() {
    override fun run() {
        var sum = 0
        for (i in 0..Int.MAX_VALUE / 2) {
            if (!isAlive) return
            sum += i
        }
        println("DONE=$sum")
    }
}.start()

Thread-Local Storage

If everything is shared, how can each thread keep its own value? Thread-Local Storage. The same ThreadLocal object used as a key returns a different value per thread — because the value actually lives in a map on the thread itself. Run it:

Thread-Local explorer

Same key, different value per thread.

class ThreadLocal<T> {
    fun set(value: T) {
        val map = Thread.currentThread().threadLocals
        map[this] = value          // 'this' ThreadLocal is the key
    }
    fun get(): T {
        val map = Thread.currentThread().threadLocals
        return map[this]           // each thread, its own map
    }
}
04

Android is one big loop

A thread, a queue, and a loop that never ends — that's the UI.

You rarely create raw threads on Android. Instead, work is posted as messages onto a MessageQueue, ordered by when they're due. A Looper spins forever, pulling the next ready message and handing it to a Handler to run. Post a few — immediate and delayed — and watch the loop drain them in time order:

Looper · MessageQueue · Handler

The Looper is idle, waiting for messages.

To turn any thread into a message-processing thread, you give it a Looper and loop it. That's the whole pattern:

class MyLooperThread {
    private var handler: Handler? = null

    init {
        Thread {
            Looper.prepare()                       // 1 — give this thread a Looper
            handler = Handler(Looper.myLooper()!!) // 2 — a Handler to post into it
            Looper.loop()                          // 3 — block & process forever
        }.start()
    }

    fun post(task: () -> Unit) = handler?.post(task)
    fun postDelayed(delay: Long, task: () -> Unit) =
        handler?.postDelayed(task, delay)

    fun stop() {
        handler?.post { Looper.myLooper()?.quit() }
        handler = null
    }
}

The Main Thread is just a Looper thread

There's nothing magic about the UI thread. It was prepared as a Looper thread before your app's code ran, and every touch, draw and callback is a message on its queue. Which is why this works from anywhere:

val mainHandler = Handler(Looper.getMainLooper())

mainHandler.post {
    println("Hello from the Main Thread!")
}

Synchronising with VSYNC

Animations can't just update whenever — they'd tear or waste frames. The Choreographer posts frame callbacks aligned to the display's VSYNC pulse (~16.6ms at 60Hz), and ValueAnimator computes its value once per pulse. One heartbeat drives every animation in sync:

Choreographer · VSYNC

With sync off, updates land between pulses — work the display can't show yet.

This page covers the gist. For the full story of how ValueAnimator and Choreographer work together, see the companion article linked at the end.

05

Past the thread: tasks & coroutines

OS threads are heavy. The modern answer is to multiplex many tasks onto few threads.

OS threads are expensive — each needs a stack and a kernel slot, so you can have thousands, not millions. The modern fix decouples the task from the thread: run a huge number of lightweight tasks on a small pool of real threads, parking a task whenever it would block. Watch many virtual threads share a few carriers:

Virtual vs platform threads

A few platform threads (carriers) run a crowd of virtual threads. When one waits, another is mounted in its place.

// JDK 21+ — millions of these are cheap
val virtualThread = Thread.ofVirtual().name("vt").start(runnable)

// the classic, expensive 1:1 OS thread
val platformThread = Thread.ofPlatform().name("pt").start(runnable)

Project Loom brought this to the JVM as Virtual Threads in JDK 21. Kotlin had been doing the same for years with coroutines — the engine behind it is the suspend function:

// Coroutines do the same — thanks to suspend functions
lifecycleScope.launch(Dispatchers.IO) { myService() }
lifecycleScope.launch(Dispatchers.IO) { myService2() }

Four ways to write async

The same “call two services, combine the results” task, across the models that led us to coroutines. Notice how the last one reads like ordinary sequential code:

fun myService(callback: (String) -> Unit)

myService { res1 ->
    myService2 { res2 ->
        println(res1 + res2)
    }
}

Nesting begets nesting — the pyramid of doom.

fun myService(): CompletableFuture<String>

val cf = myService()
    .thenCombine(myService2()) { a, b -> a + b }
println(cf.join())

Composable, but verbose and easy to leak.

fun myService(): Mono<String>

myService()
    .flatMap { myService2() }
    .subscribe { res -> println(res) }

Powerful streams — and a steep learning curve.

suspend fun myService(): String

lifecycleScope.launch {
    println(myService() + myService2())
}

Reads like blocking code, runs asynchronously.

The secret: a state machine

A suspend function looks sequential, but the compiler rewrites it. Every suspension point becomes a state; the function can return early at a state, release its thread, and later resume exactly where it left off. Take the simplest possible example:

suspend fun helloWorld(): String {
    delay(1000)
    return "Hello World!"
}

The compiler splits it at delay() into two states, wrapped in a hidden Continuation that remembers a label and the last result. Step through what actually runs:

This is Continuation-Passing Style: instead of blocking, each step is handed a continuation to call when its result is ready. The generated code is roughly:

// what the compiler actually generates (simplified)
fun helloWorld(completion: Continuation<*>): Any? {
    val cont = completion as? HelloWorldContinuation
        ?: HelloWorldContinuation(completion)

    when (cont.label) {
        0 -> {
            cont.label = 1
            val r = delay(1000, cont)            // pass the continuation
            if (r == COROUTINE_SUSPENDED)        // not ready? give up the thread
                return COROUTINE_SUSPENDED
        }
        1 -> { /* resumed here after the delay */ }
    }
    return "Hello World!"
}

So what's the difference?

This is the punchline. Thread.sleep() blocks its thread — it sits there, occupied, doing nothing. delay() suspends — it returns COROUTINE_SUSPENDED, freeing the thread to run other work, and resumes later via its continuation. Same one second, very different cost:

suspend fun test1() {
    println("Start")
    delay(1000)        // suspends — the thread is FREE to do other work
    println("End")
}

fun test2() {
    println("Start")
    Thread.sleep(1000) // blocks — the whole thread sits idle, wasted
    println("End")
}

delay() vs Thread.sleep()

Each lane gets the same 5 tasks and one thread. Watch which one keeps working while it "waits".

A two-handed talk

Presented together

This was a joint session — built and delivered together, trading off between the low-level systems story and the high-level Kotlin one.

AmirHossein Aghajari
Ali Nasrabadi

Take it further

Further reading