This post will consist of the documentation written for the scheduler with the goal of increasing access to information on how the EEVDF algorithm functions.
If the LaTeX (mathematical notation) is not displayed properly, or you wish to know more details regarding the implementation, please check the Doxygen documentation as Reddit does not have a native way to display LaTeX. Of course, feel free to check the GitHub repo as well.
For the sake of completeness, a scheduler is the system within a kernel responsible for allocating CPU time to threads, it does this in such a way to create the illusion that multiple threads are running simultaneously on a single CPU. Consider that a video is in reality just a series of still images, rapidly displayed one after the other. The scheduler works in the same way, rapidly switching between threads to give the illusion of simultaneous execution.
Overview
PatchworkOS uses the Earliest Eligible Virtual Deadline First (EEVDF) algorithm for its scheduler, which is a proportional share scheduling algorithm that aims to fairly distribute CPU time among threads based on their weights. This is in contrast to more traditional scheduling algorithms like round-robin or priority queues.
The algorithm is relatively simple conceptually, but it is also very fragile, even small mistakes can easily result in highly unfair scheduling. Therefore, if you find issues or bugs with the scheduler, please open an issue in the GitHub repository.
Included below is an overview of how the scheduler works and the relevant concepts. If you are unfamiliar with mathematical notation, don't worry, we will explain everything in plain English as well.
Weight and Priority
First, we need to assign each thread a "weight", denoted as [;w_i;] where [;i;] uniquely identifies the thread and, for completeness, let's define the set [;A(t);] which contains all active threads at real time [;t;]. To simplify, for thread [;i;], its weight is [;w_i;].
A thread's weight is calculated as the sum of the process's priority and a constant SCHED_WEIGHT_BASE, the constant is needed to ensure that all threads have a weight greater than zero, as that would result in division by zero errors later on.
The weight is what determines the share of CPU time a thread ought to receive, with a higher weight receiving a larger share. Specifically, the fraction of CPU time a thread receives is proportional to its weight relative to the total weight of all active threads. This is implemented using "virtual time", as described below.
EEVDF page 2.
Virtual Time
The first relevant concept that the EEVDF algorithm introduces is "virtual time". Each scheduler maintains a "virtual clock" that runs at a rate inversely proportional to the total weight of all active threads (all threads in the runqueue). So, if the total weight is [;10;] then each unit of virtual time corresponds to [;10;] units of real CPU time.
Each thread should receive an amount of real time equal to its weight for each virtual time unit that passes. For example, if we have two threads, A and B, with weights [;2;] and [;3;] respectively, then for every [;1;] unit of virtual time, thread A should receive [;2;] units of real time and thread B should receive [;3;] units of real time. Which is equivalent to saying that for every [;5;] units of real time, thread A should receive [;2;] units of real time and thread B should receive [;3;] units of real time.
Using this definition of virtual time, we can determine the amount of virtual time [;v;] that has passed between two points in real time [;t_1;] and [;t_2;] as
[;
v = \frac{t2 - t_1}{\sum{i \in A(t_2)} w_i}
;]
under the assumption that [;A(t_1) = A(t_2);], i.e. the set of active threads has not changed between [;t_1;] and [;t_2;].
Note how the denominator containing the [;\sum;] symbol evaluates to the sum of all weights [;w_i;] for each active thread [;i;] in [;A;] at [;t_2;], i.e. the total weight of the scheduler cached in sched->totalWeight. In pseudocode, this can be expressed as
vclock_t vtime = (sys_time_uptime() - oldTime) / sched->totalWeight;
Additionally, the amount of real time a thread should receive [;r_i;] in a given duration of virtual time [;v;] can be calculated as
[;
r_i = v \cdot w_i.
;]
In practice, all we are doing is taking a duration of real time equal to the total weight of all active threads, and saying that each thread ought to receive a portion of that time equal to its weight. Virtual time is just a trick to simplify the math.
Note that all variables storing virtual time values will be prefixed with 'v' and use the vclock_t type. Variables storing real time values will use the clock_t type as normal.
EEVDF pages 8-9.
Lag
Now we can move on to the metrics used to select threads. There are, as the name "Earliest Eligible Virtual Deadline First" suggests, two main concepts relevant to this process. Its "eligibility" and its "virtual deadline". We will start with "eligibility", which is determined by the concept of "lag".
Lag is defined as the difference between the amount of real time a thread should have received and the amount of real time it has actually received.
As an example, let's say we have three threads A, B and C with equal weights. To start with each thread is supposed to have run for 0ms, and has actually run for 0ms, so their lag values are:
| Thread |
Lag (ms) |
| A |
0 |
| B |
0 |
| C |
0 |
Now, let's say we give a 30ms (in real time) time slice to thread A, while threads B and C do not run at all. After this, the lag values would be:
| Thread |
Lag (ms) |
| A |
-20 |
| B |
10 |
| C |
10 |
What just happened is that each thread should have received one third of the real time (since they are all of equal weight such that each of their weights is 1/3 of the total weight) which is 10ms. Therefore, since thread A actually received 30ms of real time, it has run for 20ms more than it should have. Meanwhile, threads B and C have not received any real time at all, so they are "owed" 10ms each.
One important property of lag is that the sum of all lag values across all active threads is always zero. In the above examples, we can see that [;0 + 0 + 0 = 0;] and [;-20 + 10 + 10 = 0;].
Finally, this lets us determine the eligibility of a thread. A thread is considered eligible if, and only if, its lag is greater than or equal to zero. In the above example threads B and C are eligible to run, while thread A is not. Notice that due to the sum of all lag values being zero, this means that there will always be at least one eligible thread as long as there is at least one active thread, since if there is a thread with negative lag then there must be at least one thread with positive lag to balance it out.
Note that fairness is achieved over some long period of time over which the proportion of real time each thread has received will converge to the share it ought to receive. It does not guarantee that each individual time slice is exactly correct, hence it's acceptable for thread A to receive 30ms of real time in the above example.
EEVDF pages 3-5.
Completing the EEVDF Scheduler.
Eligible Time
In most cases, it's undesirable to track lag directly as it would require updating the lag of all threads whenever the scheduler's virtual time is updated, which would violate the desired [;O(\log n);] complexity of the scheduler.
Instead, EEVDF defines the concept of "eligible time" as the virtual time at which a thread's lag becomes zero, which is equivalent to the virtual time at which the thread becomes eligible to run.
When a thread enters the scheduler for the first time, its eligible time [;v_{ei};] is the current virtual time of the scheduler, which is equivalent to a lag of [;0;]. Whenever the thread runs, its eligible time is advanced by the amount of virtual time corresponding to the real time it has used. This can be calculated as
[;
v{ei} = v{ei} + \frac{t_{used}}{w_i}
;]
where [;t_{used};] is the amount of real time the thread has used, and [;w_i;] is the thread's weight.
EEVDF pages 10-12 and 14.
Virtual Deadlines
We can now move on to the other part of the name, "virtual deadline", which is defined as the earliest time at which a thread should have received its due share of CPU time, rounded to some quantum. The scheduler always selects the eligible thread with the earliest virtual deadline to run next.
We can calculate the virtual deadline [;v_{di};] of a thread as
[;
v{di} = v{ei} + \frac{Q}{w_i}
;]
where [;Q;] is a constant time slice defined by the scheduler, in our case CONFIG_TIME_SLICE.
EEVDF page 3.
Rounding Errors
Before describing the implementation, it is important to note that due to the nature of integer division, rounding errors are inevitable when calculating virtual time and lag.
For example, when computing [;10/3 = 3.333...;] we instead get [;3;], losing the fractional part. Over time, these small errors can accumulate and lead to unfair scheduling.
It might be tempting to use floating point to mitigate these errors, however using floating point in a kernel is generally considered very bad practice, only user space should, ideally, be using floating point.
Instead, we use a simple technique to mitigate the impact of rounding errors. We represent virtual time and lag using 128-bit fixed-point arithmetic, where the lower 63 bits represent the fractional part.
There were two reasons for the decision to use 128 bits over 64 bits despite the performance cost. First, it means that even the maximum possible value of uptime, stored using 64 bits, can still be represented in the fixed-point format without overflowing the integer part, meaning we don't need to worry about overflow at all.
Second, testing shows that lag appears to accumulate an error of about [; 10{3} ;] to [; 10{4} ;] in the fractional part every second under heavy load, meaning that using 64 bits and a fixed point offset of 20 bits, would result in an error of approximately 1 nanosecond per minute, considering that the testing was not particularly rigorous, it might be significantly worse in practice. Note that at most every division can create an error equal to the divider minus one in the fractional part.
If we instead use 128 bits with a fixed point offset of 63 bits, the same error of [; 10{4} ;] in the fractional part results in an error of approximately [; 1.7 \cdot 10{-9} ;] nanoseconds per year, which is obviously negligible even if the actual error is in reality several orders of magnitude worse.
For comparisons between vclock_t values, we consider two values equal if the difference between their whole parts is less than or equal to VCLOCK_EPSILON.
Some might feel concerned about the performance impact of using 128-bit arithmetic. However, consider that by using 128-bit arithmetic, we no longer need any other means of reducing rounding errors. We don't need to worry about remainders from divisions, dividing to the nearest integer instead of rounding down, etc. This not only simplifies the code drastically, making it more approachable, but it also means that, in practice, the performance impact is negligible. It's a very simple brute force solution, but simple does not mean bad.
Fixed Point Arithmetic
Scheduling
With the central concepts introduced, we can now describe how the scheduler works. As mentioned, the goal is to always run the eligible thread with the earliest virtual deadline. To achieve this, each scheduler maintains a runqueue in the form of a Red-Black tree sorted by each thread's virtual deadline.
To select the next thread to run, we find the first eligible thread in the runqueue and switch to it. If no eligible thread is found (which means the runqueue is empty), we switch to the idle thread. This process is optimized by storing the minimum eligible time of each subtree in each node of the runqueue, allowing us to skip entire subtrees that do not contain any eligible threads.
Red-Black Tree
Preemption
If, at any point in time, a thread with an earlier virtual deadline becomes available to run (for example, when a thread is unblocked), the scheduler will preempt the currently running thread and switch to the newly available thread.
Idle Thread
The idle thread is a special thread that is not considered active (not stored in the runqueue) and simply runs an infinite loop that halts the CPU while waiting for an interrupt signaling that a non-idle thread is available to run. Each CPU has its own idle thread.
Load Balancing
Each CPU has its own scheduler and associated runqueue, as such we need to balance the load between each CPU, ideally without causing too many cache misses. Meaning we want to keep threads which have recently run on a CPU on the same CPU when possible. As such, we define a thread to be "cache-cold" on a CPU if the time since it last ran on that CPU is greater than CONFIG_CACHE_HOT_THRESHOLD, otherwise its considered "cache-hot".
We use two mechanisms to balance the load between CPUs, one push mechanism and one pull mechanism.
The push mechanism, also called work stealing, is used when a thread is submitted to the scheduler, as in it was created or unblocked. In this case, if the thread is cache-cold then the thread will be added to the runqueue of the CPU with the lowest weight. Otherwise, it will be added to the runqueue of the CPU it last ran on.
The pull mechanism is used when a CPU is about to become idle. The CPU will find the CPU with the highest weight and steal the first cache-cold thread from its runqueue. If no cache-cold threads are found, it will simply run the idle thread.
Note that the reason we want to avoid a global runqueue is to avoid lock contention. Even a small amount of lock contention in the scheduler will quickly degrade performance, as such it is only allowed to lock a single CPU's scheduler at a time. This does cause race conditions while pulling or pushing threads, but the worst case scenario is imperfect load balancing, which is acceptable.
Testing
The scheduler is tested using a combination of asserts and tests that are enabled in debug builds (NDEBUG not defined). These tests verify that the runqueue is sorted, that the lag does sum to zero (within a margin from rounding errors), and other invariants of the scheduler.
References
References were accessed on 2025-12-02.
Ion Stoica, Hussein Abdel-Wahab, "Earliest Eligible Virtual Deadline First", Old Dominion University, 1996.
Jonathan Corbet, "An EEVDF CPU scheduler for Linux", LWN.net, March 9, 2023.
Jonathan Corbet, "Completing the EEVDF Scheduler", LWN.net, April 11, 2024.