Go Concurrency [1/2]: The Anatomy of GMP

Part 1 of 2: Go Concurrency internals, from the scheduler up.

If you’re coming from Java or C++, you’ve likely been taught that Threads are the fundamental unit of concurrency. But in Go, we have Goroutines.

To understand why Go can comfortably juggle millions of goroutines while traditional apps sweat at 5,000, we have to look at the GMP Model – the secret “Kitchen” architecture of the Go runtime

The Trinity: G, M, and P

The Go scheduler manages three distinct entities to keep your CPU cores humming:

G (Goroutine): The “Order.”

Starts with a tiny 2KB stack that grows on demand, unlike the fixed 1MB+ stacks in other languages.
M (Machine / OS Thread): The “Chef.”

A real OS thread. The muscle that actually runs the code.
P (Processor): The “Kitchen Counter.”

A logical resource representing the right to execute, not a physical core.

The Intuition: The Busy Kitchen

Gs are the Order Slips. They don’t do anything until a chef picks them up.
Ms are the Chefs. Expensive to hire, and they take up real space.
Ps are the Stoves. No stove, no cooking. Regardless of how many chefs or orders you have.

Rule: A Chef (M) cannot cook an Order (G) unless they have a Stove (P) to work at.

This constraint is the entire scheduler in one sentence. You can have thousands of order slips and a full crew of chefs, but if there are only 8 stoves, at most 8 dishes are being cooked at any instant. The rest of the kitchen is organized waiting, not wasted capacity.

Why this is technically superior

1. The 2KB “Order Slip” (Stack Management)

In traditional languages, every thread (M) gets a fixed, massive stack (often 1MB to 8MB). This is like giving every chef a giant industrial oven even if they’re only making toast. It wastes memory fast.

Go is smarter. A Goroutine (G) starts with a tiny 2KB stack. If the “order” gets more complex (like a deep recursive function), the runtime automatically allocates a larger stack and copies the old data over. This “demand-based” sizing is why you can have millions of Gs without running out of RAM.

2. Cheaper Context Switching

When the OS switches between two Threads (Ms), it’s a heavy operation. The kernel has to save everything, jump into “manager mode,” and find a new thread. This costs roughly 1,000 nanoseconds.

Because the Go scheduler lives in your app (user-space), it switches between Goroutines without asking the OS for help:

Switch type	Approximate cost
OS thread context switch	~1,000 ns
Go goroutine context switch	~200 ns

From the OS’s perspective, the Chef never stopped working; they just picked up a new order slip.

These are approximation figures from well-known benchmarks, not official spec guarantees. Actual costs vary by hardware and workload, but the order-of-magnitude difference is real and consistent.

3. The Role of GOMAXPROCS

The variable GOMAXPROCS defines how many Stoves (Ps) are available. By default, Go gives you one P for every virtual core on your machine.

If you have 8 cores, you have 8 Ps. This means only 8 goroutines can be executing in parallel at any given moment, even if you have 100 Chefs (Ms) waiting in the wings.

You can inspect and override this at runtime:

package main

import (
    "fmt"
    "runtime"
)

func main() {
    // Number of virtual cores available to the process
    fmt.Printf("CPU cores: %d\n", runtime.NumCPU())

    // Number of active Ps matches CPU cores by default
    // Pass 0 to query without changing the value
    fmt.Printf("Active Ps: %d\n", runtime.GOMAXPROCS(0))
}

Changing GOMAXPROCS at runtime is valid but rarely needed. Go’s default is correct for most workloads. The main use case is benchmarking concurrency behaviour at different parallelism levels.

Here’s how the three layers relate. Note the local run queues sitting under each P, and the work-steal path between them:

Go GMP scheduler model

Each P maintains its own local run queue. When a P’s queue empties, it steals goroutines from another P’s queue, this is work stealing, covered in depth in Part 2.

The P-Handoff: Staying Responsive Under Blocking Calls

Here’s a subtlety worth understanding: what happens when an M makes a blocking syscall (like reading from disk)?

The Go runtime detects the block and detaches the P from the blocked M, then attaches that P to another M (either an idle one or a newly created one) so work continues. The original M keeps the syscall and the goroutine that triggered it. Once the syscall returns, that goroutine is placed back on a run queue and the extra M parks itself.

This is called a P-handoff, and it’s what keeps your program responsive even when the OS is being slow.

The important nuance: Go doesn’t “spawn a new chef to replace you.” It hands your stove to someone else while you’re stuck on a long task. When the syscall returns, you get back an available stove, or park and wait for one.

There’s also a case where Ms run without any P at all, during certain runtime-internal operations like GC.

Key Insight: Parallelism vs. Concurrency

This distinction is worth cementing before moving on:

	What it controls	Limited by
Parallelism	How many Gs execute simultaneously	`GOMAXPROCS` (number of Ps)
Concurrency	How many Gs are in-progress	Effectively unlimited

You can have 100,000 concurrent goroutines on a machine with GOMAXPROCS=8. At any given nanosecond, 8 of them are running. The rest are waiting, either in a run queue or blocked on I/O.

Summary

The GMP model is Go’s answer to a fundamental problem: OS-level thread management is too coarse-grained and too expensive for fine-grained concurrency.

By introducing P as an explicit scheduling resource, Go ensures:

Expensive OS threads stay saturated with real work
Goroutines start tiny at 2KB and grow only as needed
Context switches stay in user space at ~200ns instead of the OS-level ~1,000ns
Blocking syscalls don’t stall the scheduler, the P-handoff keeps other Gs moving

The mental model to carry forward: P is the gating resource, G is cheap, M is the workhorse.

The Trinity: G, M, and P#

The Intuition: The Busy Kitchen#

Why this is technically superior#

1. The 2KB “Order Slip” (Stack Management)#

2. Cheaper Context Switching#

3. The Role of GOMAXPROCS#

The P-Handoff: Staying Responsive Under Blocking Calls#

Key Insight: Parallelism vs. Concurrency#

Summary#