Part 1 of 2: Go Concurrency internals, from the scheduler up.
If you’re coming from Java or C++, you’ve likely been taught that Threads are the fundamental unit of concurrency. But in Go, we have Goroutines.
To understand why Go can comfortably juggle millions of goroutines while traditional apps sweat at 5,000, we have to look at the GMP Model – the secret “Kitchen” architecture of the Go runtime
The Trinity: G, M, and P
The Go scheduler manages three distinct entities to keep your CPU cores humming:
-
G (Goroutine): The “Order.”
Starts with a tiny 2KB stack that grows on demand, unlike the fixed 1MB+ stacks in other languages.
-
M (Machine / OS Thread): The “Chef.”
A real OS thread. The muscle that actually runs the code.
-
P (Processor): The “Kitchen Counter.”
A logical resource representing the right to execute, not a physical core.
The Intuition: The Busy Kitchen
- Gs are the Order Slips. They don’t do anything until a chef picks them up.
- Ms are the Chefs. Expensive to hire, and they take up real space.
- Ps are the Stoves. No stove, no cooking. Regardless of how many chefs or orders you have.
Rule: A Chef (M) cannot cook an Order (G) unless they have a Stove (P) to work at.
This constraint is the entire scheduler in one sentence. You can have thousands of order slips and a full crew of chefs, but if there are only 8 stoves, at most 8 dishes are being cooked at any instant. The rest of the kitchen is organized waiting, not wasted capacity.
Why this is technically superior
1. The 2KB “Order Slip” (Stack Management)
In traditional languages, every thread (M) gets a fixed, massive stack (often 1MB to 8MB). This is like giving every chef a giant industrial oven even if they’re only making toast. It wastes memory fast.
Go is smarter. A Goroutine (G) starts with a tiny 2KB stack. If the “order” gets more complex (like a deep recursive function), the runtime automatically allocates a larger stack and copies the old data over. This “demand-based” sizing is why you can have millions of Gs without running out of RAM.
2. Cheaper Context Switching
When the OS switches between two Threads (Ms), it’s a heavy operation. The kernel has to save everything, jump into “manager mode,” and find a new thread. This costs roughly 1,000 nanoseconds.
Because the Go scheduler lives in your app (user-space), it switches between Goroutines without asking the OS for help:
| Switch type | Approximate cost |
|---|---|
| OS thread context switch | ~1,000 ns |
| Go goroutine context switch | ~200 ns |
From the OS’s perspective, the Chef never stopped working; they just picked up a new order slip.
These are approximation figures from well-known benchmarks, not official spec guarantees. Actual costs vary by hardware and workload, but the order-of-magnitude difference is real and consistent.
3. The Role of GOMAXPROCS
The variable GOMAXPROCS defines how many Stoves (Ps) are available. By default, Go gives you one P for every virtual core on your machine.
If you have 8 cores, you have 8 Ps. This means only 8 goroutines can be executing in parallel at any given moment, even if you have 100 Chefs (Ms) waiting in the wings.
You can inspect and override this at runtime:
package main
import (
"fmt"
"runtime"
)
func main() {
// Number of virtual cores available to the process
fmt.Printf("CPU cores: %d\n", runtime.NumCPU())
// Number of active Ps matches CPU cores by default
// Pass 0 to query without changing the value
fmt.Printf("Active Ps: %d\n", runtime.GOMAXPROCS(0))
}
Changing
GOMAXPROCSat runtime is valid but rarely needed. Go’s default is correct for most workloads. The main use case is benchmarking concurrency behaviour at different parallelism levels.
Here’s how the three layers relate. Note the local run queues sitting under each P, and the work-steal path between them:
Each P maintains its own local run queue. When a P’s queue empties, it steals goroutines from another P’s queue, this is work stealing, covered in depth in Part 2.
The P-Handoff: Staying Responsive Under Blocking Calls
Here’s a subtlety worth understanding: what happens when an M makes a blocking syscall (like reading from disk)?
The Go runtime detects the block and detaches the P from the blocked M, then attaches that P to another M (either an idle one or a newly created one) so work continues. The original M keeps the syscall and the goroutine that triggered it. Once the syscall returns, that goroutine is placed back on a run queue and the extra M parks itself.
This is called a P-handoff, and it’s what keeps your program responsive even when the OS is being slow.
The important nuance: Go doesn’t “spawn a new chef to replace you.” It hands your stove to someone else while you’re stuck on a long task. When the syscall returns, you get back an available stove, or park and wait for one.
There’s also a case where Ms run without any P at all, during certain runtime-internal operations like GC.
Key Insight: Parallelism vs. Concurrency
This distinction is worth cementing before moving on:
| What it controls | Limited by | |
|---|---|---|
| Parallelism | How many Gs execute simultaneously | GOMAXPROCS (number of Ps) |
| Concurrency | How many Gs are in-progress | Effectively unlimited |
You can have 100,000 concurrent goroutines on a machine with GOMAXPROCS=8. At any given nanosecond, 8 of them are running. The rest are waiting, either in a run queue or blocked on I/O.
Summary
The GMP model is Go’s answer to a fundamental problem: OS-level thread management is too coarse-grained and too expensive for fine-grained concurrency.
By introducing P as an explicit scheduling resource, Go ensures:
- Expensive OS threads stay saturated with real work
- Goroutines start tiny at 2KB and grow only as needed
- Context switches stay in user space at ~200ns instead of the OS-level ~1,000ns
- Blocking syscalls don’t stall the scheduler, the P-handoff keeps other Gs moving
The mental model to carry forward: P is the gating resource, G is cheap, M is the workhorse.