Talk · Interactive edition
Multi-Threading
What really happens when code runs “at the same time”? We start at a
single CPU executing one function and climb — through registers,
context switches, fork, the Android Looper, all the way
to the state machine Kotlin hides behind suspend. Every layer
is something you can step through.
The full talk
Watch the original — 1 hour, 35 minutes
Start here if you want the complete journey in Persian — every diagram, every code dive, and the back-and-forth between both speakers. Then scroll on for the interactive companion below.
Bottom-up
Concurrency is taught backwards. We learn the high-level APIs first — threads, executors, coroutines — and the machinery underneath stays a black box. This talk does the opposite: it builds the whole stack from the bottom, so the abstractions at the top finally make sense.
By the end, a coroutine won't be magic. It'll be a compiler-generated state machine, scheduling itself onto a thread, which is itself a set of registers the kernel swaps onto a CPU. Here's the climb.
The CPU
Registers, the program counter, and executing one function.
Processes
The PCB, context switching, fork and clone.
Threads
Shared vs isolated memory, and Thread-Local Storage.
Android
MessageQueue, Looper, Handler — and syncing to VSYNC.
Coroutines
Virtual threads, suspend, and the CPS state machine.
The CPU executes one thing
A core runs a single stream of instructions, shuffling values between a handful of registers.
Underneath everything is a CPU with an arithmetic unit, a control
unit, and a small set of registers — tiny, ultra-fast
slots of memory. A few are special: the PC (program counter)
points at the next instruction, the SP tracks the stack, and
the LR remembers where to return.
That's all “running a program” is: read instruction, update registers,
advance the PC. Step through sum(10, 20) and
watch the registers change one instruction at a time.
Processes, and the art of pretending
One core, many programs. The OS fakes simultaneity by switching fast.
A running program is a process. The kernel tracks each one in a Process Control Block (PCB): its ID, its state, its memory, and — crucially — a saved copy of all the registers. With a single core, only one process runs at any instant. The illusion of “at the same time” comes from switching between them many times a second.
Context switching
To switch, the CPU saves its registers into the current PCB, then restores the next process's registers from its PCB. Press play and watch a register set get parked and another swapped in:
Context switch
Process 1 is running on the core.
This exact dance is the Linux cpu_switch_to routine below.
It isn't hand-waving — here's the real ARM64 routine the Linux kernel
runs to do it. Save the callee-saved registers of the previous task,
load the next task's, and ret straight into it:
// linux/arch/arm64/kernel/entry.S — switching AArch64 tasks
SYM_FUNC_START(cpu_switch_to)
mov x10, #THREAD_CPU_CONTEXT
add x8, x0, x10 // x0 = previous task
mov x9, sp
stp x19, x20, [x8], #16 // save callee-saved registers
stp x21, x22, [x8], #16
// …x23–x28…
stp x29, x9, [x8], #16
str lr, [x8]
add x8, x1, x10 // x1 = next task
ldp x19, x20, [x8], #16 // restore callee-saved registers
// …
ldr lr, [x8]
mov sp, x9
ret // returns into the *next* task
SYM_FUNC_END(cpu_switch_to)
Creating a process: fork()
fork() clones the current process whole — same code, same
memory (copy-on-write), its own PCB. The twist: it returns
twice. The child sees 0; the parent sees the
child's PID. Press the button to fork:
fork()
One process, about to fork.
int main() {
printf("First\n");
int res = fork();
if (res == 0) {
printf("Child\n"); // res == 0 → we are the child
} else if (res > 0) {
printf("Parent: child PID is %d\n", res); // res > 0 → child's PID
} else {
printf("Parent: fork failed\n");
}
} And the truth: clone()
On Linux, fork() is really a thin wrapper over
clone() — a system call that lets you choose precisely
what to share with the new execution context:
int clone(
int flags, // what to SHARE with the parent
void *child_stack, // the new execution context's stack
int *ptid,
int *ctid,
unsigned long newtls // Thread-Local Storage descriptor
);
Share nothing and you get a process. Share the memory space, file
descriptors and signal handlers — and keep the same PID — and you get
a
thread. A thread isn't a special primitive; it's just
clone() with the right flags. This is literally how glibc
spawns a pthread:
// glibc/nptl/pthread_create.c — how a thread is really born
const int clone_flags =
CLONE_VM // ← share the same memory space
| CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
| CLONE_SIGHAND
| CLONE_THREAD // ← same PID: it's a thread, not a process
| CLONE_SETTLS // ← give it its own Thread-Local Storage
| CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID;
__clone_internal(&args, &start_thread, pd); Threads share everything but the stack
Many register-sets, one memory space — power and peril in one.
Threads live inside one process. They share its memory and resources, but each carries its own registers and stack so it can be scheduled independently. That shared memory is exactly why threads are fast to create and communicate — and exactly why they're dangerous. Toggle between the two models:
Memory model
object : Thread() {
override fun run() {
var sum = 0
for (i in 0..Int.MAX_VALUE / 2) {
if (!isAlive) return
sum += i
}
println("DONE=$sum")
}
}.start() Thread-Local Storage
If everything is shared, how can each thread keep its own
value? Thread-Local Storage. The same ThreadLocal object used
as a key returns a different value per thread — because the value actually
lives in a map on the thread itself. Run it:
Thread-Local explorer
Same key, different value per thread.
class ThreadLocal<T> {
fun set(value: T) {
val map = Thread.currentThread().threadLocals
map[this] = value // 'this' ThreadLocal is the key
}
fun get(): T {
val map = Thread.currentThread().threadLocals
return map[this] // each thread, its own map
}
} Android is one big loop
A thread, a queue, and a loop that never ends — that's the UI.
You rarely create raw threads on Android. Instead, work is posted as
messages onto a MessageQueue, ordered by
when they're due. A Looper spins forever, pulling the next
ready message and handing it to a Handler to run. Post a few
— immediate and delayed — and watch the loop drain them in time order:
Looper · MessageQueue · Handler
The Looper is idle, waiting for messages.
To turn any thread into a message-processing thread, you give it a Looper and loop it. That's the whole pattern:
class MyLooperThread {
private var handler: Handler? = null
init {
Thread {
Looper.prepare() // 1 — give this thread a Looper
handler = Handler(Looper.myLooper()!!) // 2 — a Handler to post into it
Looper.loop() // 3 — block & process forever
}.start()
}
fun post(task: () -> Unit) = handler?.post(task)
fun postDelayed(delay: Long, task: () -> Unit) =
handler?.postDelayed(task, delay)
fun stop() {
handler?.post { Looper.myLooper()?.quit() }
handler = null
}
} The Main Thread is just a Looper thread
There's nothing magic about the UI thread. It was prepared as a Looper thread before your app's code ran, and every touch, draw and callback is a message on its queue. Which is why this works from anywhere:
val mainHandler = Handler(Looper.getMainLooper())
mainHandler.post {
println("Hello from the Main Thread!")
} Synchronising with VSYNC
Animations can't just update whenever — they'd tear or waste frames.
The Choreographer posts frame callbacks aligned to the display's
VSYNC pulse (~16.6ms at 60Hz), and
ValueAnimator computes its value once per pulse. One heartbeat
drives every animation in sync:
Choreographer · VSYNC
With sync off, updates land between pulses — work the display can't show yet.
This page covers the gist. For the full story of how
ValueAnimator and Choreographer work together,
see the companion article linked at the end.
Past the thread: tasks & coroutines
OS threads are heavy. The modern answer is to multiplex many tasks onto few threads.
OS threads are expensive — each needs a stack and a kernel slot, so you can have thousands, not millions. The modern fix decouples the task from the thread: run a huge number of lightweight tasks on a small pool of real threads, parking a task whenever it would block. Watch many virtual threads share a few carriers:
Virtual vs platform threads
A few platform threads (carriers) run a crowd of virtual threads. When one waits, another is mounted in its place.
// JDK 21+ — millions of these are cheap
val virtualThread = Thread.ofVirtual().name("vt").start(runnable)
// the classic, expensive 1:1 OS thread
val platformThread = Thread.ofPlatform().name("pt").start(runnable)
Project Loom brought this to the JVM as Virtual Threads in JDK 21.
Kotlin had been doing the same for years with coroutines — the engine
behind it is the suspend function:
// Coroutines do the same — thanks to suspend functions
lifecycleScope.launch(Dispatchers.IO) { myService() }
lifecycleScope.launch(Dispatchers.IO) { myService2() } Four ways to write async
The same “call two services, combine the results” task, across the models that led us to coroutines. Notice how the last one reads like ordinary sequential code:
fun myService(callback: (String) -> Unit)
myService { res1 ->
myService2 { res2 ->
println(res1 + res2)
}
} Nesting begets nesting — the pyramid of doom.
fun myService(): CompletableFuture<String>
val cf = myService()
.thenCombine(myService2()) { a, b -> a + b }
println(cf.join()) Composable, but verbose and easy to leak.
fun myService(): Mono<String>
myService()
.flatMap { myService2() }
.subscribe { res -> println(res) } Powerful streams — and a steep learning curve.
suspend fun myService(): String
lifecycleScope.launch {
println(myService() + myService2())
} Reads like blocking code, runs asynchronously.
The secret: a state machine
A suspend function looks sequential, but the compiler rewrites
it. Every suspension point becomes a state; the function can return early
at a state, release its thread, and later resume exactly where it left
off. Take the simplest possible example:
suspend fun helloWorld(): String {
delay(1000)
return "Hello World!"
}
The compiler splits it at delay() into two states, wrapped
in a hidden Continuation that remembers a
label and the last result. Step through what actually runs:
This is Continuation-Passing Style: instead of blocking, each step is handed a continuation to call when its result is ready. The generated code is roughly:
// what the compiler actually generates (simplified)
fun helloWorld(completion: Continuation<*>): Any? {
val cont = completion as? HelloWorldContinuation
?: HelloWorldContinuation(completion)
when (cont.label) {
0 -> {
cont.label = 1
val r = delay(1000, cont) // pass the continuation
if (r == COROUTINE_SUSPENDED) // not ready? give up the thread
return COROUTINE_SUSPENDED
}
1 -> { /* resumed here after the delay */ }
}
return "Hello World!"
} So what's the difference?
This is the punchline. Thread.sleep() blocks its thread —
it sits there, occupied, doing nothing. delay() suspends —
it returns COROUTINE_SUSPENDED, freeing the thread to run
other work, and resumes later via its continuation. Same one second,
very different cost:
suspend fun test1() {
println("Start")
delay(1000) // suspends — the thread is FREE to do other work
println("End")
}
fun test2() {
println("Start")
Thread.sleep(1000) // blocks — the whole thread sits idle, wasted
println("End")
} delay() vs Thread.sleep()
Each lane gets the same 5 tasks and one thread. Watch which one keeps working while it "waits".
A two-handed talk
Presented together
This was a joint session — built and delivered together, trading off between the low-level systems story and the high-level Kotlin one.
Thank you.