Part 2 of 2 : this post builds on the GMP model from Part 1.
The GMP model explains the structure of Go’s scheduler. This post explains how it stays efficient under real-world load. Three mechanisms that run silently behind every Go program:
- Work stealing – balances uneven workloads across processors
- The Netpoller – handles network I/O without blocking OS threads
- Async preemption – stops CPU-heavy goroutines from starving others
Once you understand these three, the scheduler stops feeling magical and starts feeling inevitable.
1. Work Stealing: No Idle Processors
Every P maintains its own Local Run Queue (LRQ), a fixed-size ring buffer of up to 256 goroutines. When a goroutine becomes runnable, Go places it into the LRQ of the current P rather than a shared global queue. This matters for performance: local queues require no global lock, produce less cache contention, and exhibit better CPU locality.
The problem local queues create is uneven load. One P can drain its queue while another is backed up with hundreds of runnable goroutines. Go’s answer is work stealing, an idle P hunts for work rather than parking.
When a P’s LRQ is empty, it searches in this order:
- Its own LRQ
- The Global Run Queue (GRQ) – goroutines not yet assigned to any P
- The Netpoller – goroutines whose network I/O just completed
- Another busy P’s LRQ – stealing half the goroutines from the tail
The tail-steal detail is deliberate. Each P consumes its LRQ from the head, so stealing from the tail takes work that wasn’t about to run anyway, it minimises disruption to the victim P.
The 61-tick rule: preventing GRQ starvation
If Ps only ever steal from each other, goroutines sitting in the GRQ
could wait indefinitely. To prevent this, runtime/proc.go has a
hard-coded rule: every 61st scheduling tick, a P checks the GRQ
before its own LRQ.
The number 61 is intentionally prime, it distributes GRQ checks unevenly across Ps so they don’t all hammer the shared queue simultaneously. Small detail, large consequence.
2. The Netpoller: Network I/O Without Blocking Threads
In Part 1 we covered blocking syscalls: file I/O blocks the M, a P-handoff is required, another M picks up the P and keeps running. The OS thread is genuinely blocked for the duration.
Network I/O is handled completely differently, and understanding why is one of the more valuable things you can take from this series.
Modern operating systems expose readiness notification APIs – epoll
on Linux, kqueue on macOS, IOCP on Windows. Instead of blocking
a thread until data arrives, you register interest in a file descriptor
and the OS notifies you when it’s readable. Go’s Netpoller wraps
these OS primitives behind a single internal interface.
Think of it like waiting for food delivery. One option: stand frozen at the door until the doorbell rings. The other: go about your day, and let the doorbell tell you when it’s time. The first option is what most runtimes do with network I/O – the OS thread literally stops and waits. Go uses the second. The OS is the doorbell.
Here’s what actually happens when your goroutine calls conn.Read():
Phase 1 : Register and park. The runtime tells the OS to watch the socket and notify it when data is ready. The goroutine is then parked, removed from the run queue entirely and placed in a waiting list. It’s not scheduled anywhere. It’s suspended.
Phase 2 : M keeps going. The OS thread that was running your goroutine is now free. It immediately picks up the next G from its local run queue. From the OS’s perspective the thread never stopped, it just switched tasks. The thread is not waiting. Only the goroutine is.
Phase 3 : OS signals readiness. Data arrives. The OS notifies the Netpoller that the socket is readable. The Netpoller marks the goroutine runnable and places it back on a run queue.
Phase 4 : Any P resumes it. Whichever processor is free next
picks up that goroutine and runs it, not necessarily the same M
that started it. conn.Read() returns data to your code as if it
had been a simple blocking call all along.
The distinction that follows from this explains a huge amount of Go’s network performance:
| Operation | What blocks | What Go does |
|---|---|---|
| File I/O | The M (OS thread) | P-handoff. Another M takes the P |
| Network I/O | The G (goroutine) | Park G only. M keeps running |
File I/O lands in a different row because the kernel doesn’t offer a readiness notification for disk reads the same way it does for sockets. The OS thread genuinely blocks inside the kernel, which is why the P-handoff is necessary, the blocked M has to detach its P so another M can keep running other goroutines. With network I/O, the M never blocks, so no handoff is needed.
That’s why Go servers can sustain massive concurrency on relatively few OS threads. A Go HTTP server handling 50,000 simultaneous connections doesn’t need 50,000 OS threads waiting on sockets, it needs enough threads to keep the CPUs busy doing actual compute work. The goroutines waiting on network I/O cost almost nothing while parked.
The boundary to remember: file I/O blocks the M, network I/O parks the G. Everything downstream of that distinction, thread count, memory use, throughput under load follows from it.
3. Async Preemption: Stopping a Goroutine That Won’t Yield
Before Go 1.14, the scheduler was cooperative. Goroutines yielded at natural pause points: function calls, channel operations, syscalls. For the vast majority of programs this worked fine, because function calls are ubiquitous.
The failure case was a tight compute loop:
func hotLoop() {
for {
// heavy computation, no function calls, no yield points
}
}
This goroutine would hold its P indefinitely. Every other goroutine scheduled on that P would starve. The scheduler had no mechanism to interrupt running code; it could only wait for a voluntary yield that was never coming.
sysmon and SIGURG
Go 1.14 introduced asynchronous preemption via a background
thread called sysmon (system monitor). It runs outside the normal
scheduler loop, no P, never visible in goroutine counts, and its
only job is to watch for misbehaving goroutines.
Every ~10ms, sysmon scans all running goroutines. If one has held
its P for more than 10ms without yielding, sysmon sends a SIGURG
signal to the M running it. SIGURG was chosen deliberately, it was
historically used for out-of-band TCP data and almost no real
application uses it, so intercepting it doesn’t interfere with user
code.
On receiving SIGURG, the M:
- Pauses the goroutine at its current instruction
- Saves the full register state
- Moves the goroutine to the tail of the GRQ
- Picks up the next G from its LRQ
A compute-heavy goroutine still runs for substantial stretches, up to 10ms uninterrupted is real throughput. But it no longer has the power to freeze an entire processor. The scheduler forcibly rebalances execution.
How It All Fits Together
| Mechanism | Problem it solves |
|---|---|
| P-handoff (Part 1) | Blocking syscalls freeze the M |
| Work stealing | Uneven load leaves some Ps idle |
| Netpoller | Network I/O would block OS threads |
| Async preemption | CPU-bound goroutines starve others |
None of this requires anything from you as the programmer. The scheduler is watching load distribution, parking goroutines on I/O, and preempting runaway compute, constantly, transparently, on every goroutine you launch.
Seeing It in Action: go tool trace
The best way to make this concrete is go tool trace. Add tracing to
your program:
package main
import (
"os"
"runtime/trace"
)
func main() {
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// your code here
}
Then run:
go run main.go
go tool trace trace.out
Once you open the trace viewer, you’ll start recognising patterns:
- Goroutines disappearing and reappearing → Netpoller park/wakeup
- Goroutines jumping between P lanes → work stealing
- Periodic short interruptions during CPU loops → async preemption
- Threads detaching from Ps and reattaching → P-handoff
At that point the scheduler stops being abstract. You can watch it make decisions in real time, and every decision will make sense.
Write straightforward Go. The runtime earns its keep.