Inside Windows NT SMP Scheduling

                     Copyright © 1997 Mark Russinovich

                        last updated March 18, 1997

 Flaw in NT SMP Support Could Hinder Scalability

 Note: The information presented here is the result of my own study. I have
 no access to NT 4.0 source code.
     Introduction   Windows NT was designed from the start to support
                    symmetric multiprocessing (SMP) execution. As
                    multiprocessors have become more common-place however,
                    NT has gotten a reputation of not scaling well as the
                    number of processors increases. As more CPUs are added,
                    the increase in performance rapidly declines. One way
                    operating systems like NT try to make themselves
                    SMP-friendly is to use affinity heuristics in their
                    scheduling algorithms. Given a choice of threads to run
                    on a CPU, they will try to run one that has recently
                    been scheduled on that CPU, with the hope that part of
                    the thread's data is still present in the CPU's on-chip
                    cache.

                    This article will describe NT's use of affinity in its
                    SMP scheduling algorithms. In addition, I'll present
                    the algorithm NT uses to pick which CPU a thread should
                    run on. I'll show how the algorithm is flawed in one
                    significant way, involving improper use of affinity,
                    how the flaw can lead to unnecessary thread migration
                    across processors, and how it could be a factor in
                    preventing NT from scaling well on SMPs.

     NT Scheduling  The NT scheduler is a very simple one. A thread can
           Basics   have one of 32 priorities, 0-31, with higher numbers
                    corresponding to higher priorities. Priorities from 0
                    to 15 are dynamic priorities, because the system can
                    temporarily change the priorities of threads in this
                    range when certain events take place. The 16 to 31
                    range is called the "real-time" range. Real-time
                    threads execute at fixed, unchanging priorities. The
                    goal of the NT scheduler on an n-processor system is to
                    always be executing the n highest priority threads that
                    are ready for execution.

                    Processors on an SMP system are numbered from 0 to n-1.
                    The numbering is only used for the purposes of
                    identification, and is not priority-related.

        Scheduling  There are two situations when NT's scheduler is
        Decisions   invoked. The first is when a thread's turn on a CPU is
                    ended in one of the following ways:

                       * it blocks (waits) for some event to take place
                       * it terminates
                       * it yields its turn
                       * its turn (quantum) ends

                    The NT scheduler must decide which thread of the
                    threads that are ready to run it should schedule next
                    on the CPU. Factors that can affect its choice include
                    which threads have hard-affinity for the CPU, and which
                    threads have soft-affinity for the CPU. Hard-affinity
                    describes a thread's ability, as specified by
                    applications, to run on certain CPUs and not on others.
                    By default a thread can execute on any processor.
                    Soft-affinity is the name given to a scheduler's
                    attempt to keep a thread executing on the same
                    processor in the hope that some of its data stored in
                    the processor's cache the previous time it ran will
                    still be present.

                    The scheduler is also necessary to handle the
                    transition of a thread from a waiting state to a
                    ready-to-run state. For example, a thread might block
                    on a timer so that it will wait until an interval has
                    expired. When the timer triggers the thread is ready to
                    continue. The system could wait until another one of
                    the executing thread's ends its turn in one of the
                    above listed ways. Instead, NT determines if the thread
                    that has become ready to execute should preempt
                    (replace) a lower priority executing thread.

  Picking the Next  NT maintains a list for each priority level. These
           Thread   lists are where threads are placed that are ready to
                    execute, but not scheduled on a CPU. A 32-bit mask is
                    also maintained so that NT can quickly determine which
                    priority levels have entries in their lists. An 'on'
                    bit in the mask means that the list for the priority
                    corresponding to the bit's position is not empty.

                    The function in the scheduler that is called when a new
                    thread must be picked to execute is called
                    KiFindReadyThread. The algorithm it implements proceeds
                    as follows:

                      1. Determine the highest priority level with ready
                         threads
                      2. Consider the next thread in the list. If there are
                         no more entries on the list, move to the next
                         lower priority level's list and consider the first
                         thread on it.
                      3. If the thread's hard affinity prevents it from
                         running on this CPU goto 2.
                      4. Mark this thread as the primary candidate
                      5. If the thread last executed on this CPU, its been
                         more than 20ms since the thread last ran, or the
                         thread has a priority above 24 (high real-time),
                         schedule this thread.
                      6. Otherwise, scan the list for another thread that
                         last executed on this CPU or that hasn't run in
                         more than 20ms. If one is found, schedule it
                         instead. If one is not found, schedule the primary
                         candidate.

                    The algorithm is illustrated with the following example
                    of a 2-processor SMP system with CPUs numbered 0 and 1.
                    CPU 0's system timer interrupts and the quantum
                    tracking code determines that the currently executing
                    thread's quantum has expired. The thread is executing
                    at priority 8 and the dispatcher queues look like
                    Figure 1, below.

                                            [Image]

                     Figure 1. KiFindReadyThread finding the next thread to
                                            schedule

                    The scheduler starts with the first thread on the
                    priority 8 queue, which is thread 0. The thread's
                    affinity mask, which is a bitmask of the CPUs the
                    thread can run on, indicates by having bits 0 and 1
                    "on" that the thread can run on either processor. Since
                    the thread's priority is not above 24,. it has not been
                    more than 20ms since the thread last ran, and the
                    thread didn't run on CPU 0 the last time it ran, the
                    scheduler marks it as the primary candidate, but moves
                    on to examine the other threads in the list. When it
                    looks at thread 1, it sees that one of the two
                    conditions that would result in it becoming the primary
                    candidate holds. The thread last ran on the CPU on
                    which the scheduler is making its decision, so
                    soft-affinity dictates that it will be the chosen
                    thread. Therefore, the scheduler discards its original
                    choice and makes thread 1 the next thread to execute.

                    This algorithm honors hard-affinity, while attempting
                    to schedule one of the highest priority threads that
                    last ran on the given CPU (soft-affinity), or that have
                    not run in more than 20ms.

       Should this  When a thread becomes ready to execute, NT must decide
  Thread Execute?   if the thread should preempt one of those currently
                    running on a CPU. This is because NT's goal is to run
                    the highest priority threads in the system. The
                    subroutine that performs the decision is called
                    KiReadyThread. Its algorithm is listed below. In the
                    description, ready thread refers to the thread that has
                    become ready to execute.

                      1. If there are one or more idle processors, schedule
                         the ready thread on current CPU if its idle, or
                         the one with the highest number.
                      2. Otherwise, if the thread executing on the
                         processor that the ready thread last ran on has a
                         lower priority than the ready thread, schedule the
                         ready thread on that processor. This is the flaw
                         that can cause unnecessary thread migrations.
                      3. If no processor was found in the above steps,
                         place the thread at the front of the ready list
                         associated with its priority level if it was
                         preempted, and at the end if it was not. The
                         thread will be scheduled by KiFindReadyThread when
                         a subsequent thread turn ends. This is not good
                         for real-time applications, and can also cause
                         unnecessary thread migrations.

                    When a thread is scheduled to take over a processor, it
                    is placed in standby-mode on that processor. If it is
                    not the same processor on which the scheduler is
                    executing, NT must inform the processor that it should
                    swap threads. It does so by sending an inter-processor
                    interrupt (IPI). When a CPU receives an IPI it checks
                    to see if a thread has been placed in standby-mode and
                    if so, swaps it onto the CPU and executes the
                    KiReadyThread algorithm on the preempted thread.

                    This algorithm is demonstrated with Figure 2. The
                    system has 4 processors with threads of the specified
                    priorities executing on each one. A thread becomes
                    ready to execute because it was waiting on an event
                    that one of the active threads has signaled.

                                            [Image]

                     Figure 2. The scheduler finding a processor for thread
                                               5

                    The scheduler first looks at the CPU thread 5 last ran
                    on, CPU 1, and sees if the thread executing on it has a
                    lower priority. Thread 4 has a priority higher than
                    thread 5, so the scheduler places thread 5 on the
                    scheduler's priority 10 ready list. Soon after, the
                    quantum of thread 1 expires on CPU 2. KiFindReadyThread
                    in invoked to find the highest priority ready thread to
                    execute, and it finds thread 5. The scheduler preempts
                    thread 1 and calls KiReadyThread on it, which results
                    in its being placed in the ready list for priority 9.

         The Flaw   The algorithm employed by KiReadyThread contains a flaw
                    which can cause unnecessary overhead when a SMP system.
                    The flaw is in Step 2 where the scheduler will run the
                    ready thread on the last processor it executed on if
                    the thread has a higher priority than the thread
                    currently executing on that processor. It will do so
                    regardless of how long its been since that the thread
                    last ran; it could have been hours. A second place
                    where unnecessary overhead can be caused is Step 3. In
                    that step the scheduler gives up preempting another
                    thread with the ready thread and just places the ready
                    thread in the scheduler's ready lists.

                    Why is Step 2 a flaw and how can Step 3 cause extra
                    thread migration? Lets look at a worst-case scenario,
                    shown below in Figure 3.

                                            [Image]

                      Figure 3. Worst-case thread migration on a 4-way SMP
                                             system

                    This 4-processor system is running the threads shown.
                    At some point thread 4 becomes ready to execute and
                    KiReadyThread is invoked, which leads to the activation
                    of the Step 2 flaw. The algorithm starts by examining
                    the thread executing on the CPU that thread 4 last
                    executed on, CPU 3. The thread there, thread 3, has a
                    priority lower than thread 4's, so the scheduler places
                    thread 4 in standby on CPU 3 and sends CPU 3 an IPI.

                    CPU 3, upon receiving the IPI, preempts thread 3,
                    schedules thread 4, and calls KiReadyThread with thread
                    3 because it is still ready to execute. Not being able
                    to place thread 3 on the last CPU it ran on, CPU 3,
                    because thread 4 has higher priority, the scheduler
                    places the thread in the ready lists. Until a
                    scheduling event like a thread exit, a quantum
                    expiration, or a thread block occurs, threads with a
                    lower priority than a thread in the ready lists will be
                    executing. This makes it impossible for NT to provide
                    real-time guarantees to threads on a multiprocessor
                    system. In addition, during this time the NT scheduling
                    rule of running the n highest priority threads on an
                    n-way SMP is violated.

                    Unnecessary thread migration is also caused by Step 3
                    and the first occurs when the thread running on CPU 2
                    has its turn end. Here, threads are migrated because
                    the scheduler ignores global priorities. The scheduler
                    looks for a new thread for CPU 2. KiFindReadyThread
                    finds thread 3 in the ready lists so it places it on
                    the processor and puts thread 2 in the ready lists. CPU
                    1's thread quantum expires next and the scheduler
                    preempts thread 1 in favor of thread 2. Finally, thread
                    0's turn on processor 0 ends and KiFindReadyThread
                    replaces it with thread 1.

                    This example shows how the algorithm causes 3
                    unnecessary context switches: 1 from the flaw in Step 2
                    of the KiReadyThread algorithm and 2 from Step 3. These
                    3 switches could be avoided if KiReadyThread found the
                    lowest priority executing thread (in this case thread
                    0), and replaced it with the higher-priority ready
                    thread if possible.

                    The situation is made worse when there are high
                    priority threads that are bound to each processor by
                    the algorithm's use of soft-affinity and they
                    periodically become ready to do work, causing a
                    ping-pong effect on lower priority threads. We'll see
                    this effect in an example.

                    A context switch results in the following actions, each
                    of which carries with it overheads:

                       * the old processor state must be saved and the new
                         one restored. This may include floating point/MMX
                         state as well.
                       * the newly scheduled thread will begin executing
                         with a cold cache (none of the data it needs will
                         be in the on-chip cache).
                       * the system's page tables will be reloaded and the
                         processor's virtual memory Translation Lookaside
                         Table (TLB) will be flushed if the newly scheduled
                         thread belongs to a process different than the one
                         being switched out.

                    Thus, it is in the interest of performance and
                    scalability that the scheduler should try and minimize
                    context switches, and therefore thread migrations.

       An Example   The flaw of Step 2 in the algorithm is graphically
                    visible in the Performance Monitor screen shot
                    displayed in Figure 4. The picture shows a two
                    processor system running the Cpuhog program, but that
                    is otherwise idle. Cpuhog is a program with a single
                    thread that executes at priority 15 and sits in an
                    infinite busy loop. Processor utilization is shown
                    along the left side of the graph, and the utilization
                    of each processor is assigned a separate color. 100
                    seconds of time-window is pictured lengthwise.

                                        [Cpuhog on SMP]

                    Figure 4. Cpuhog being forced to migrate from processor
                                          to processor

                    Because Cpuhog consumes all available CPU cycles in its
                    loop, the processor it executes on will be 100%
                    utilized. Everywhere the colored lines criss-cross
                    Cpuhog has migrated from one processor to the other. In
                    this particular example, the migration is caused by
                    system worker threads becoming ready to execute. System
                    worker threads perform system activity and execute at
                    priority 16. Most of the time the work items are short
                    in duration, taking no more than a few milliseconds to
                    complete. When a system thread becomes ready to
                    execute, KiReadyThread examines the thread executing on
                    the processor the system thread last ran on, and unless
                    another system thread of equal or higher priority
                    happens to be executing on it, the system thread will
                    take over the CPU. For example, say Cpuhog is executing
                    on processor 0 and Explorer (priority 8) is executing
                    on processor 1, and a system thread that last ran on
                    processor 0 becomes ready. Cpuhog will be forced off
                    the CPU and be rescheduled on processor 1. This results
                    in an unnecessary context switch. This sequence of
                    events can be seen quite regularly in the picture, and
                    there are likely many more switches not visible because
                    of Performance Monitors low sampling resolution.

      Scalability   The Cpuhog example above demonstrates how the flaws in
                    the NT scheduling algorithm can affect a partially
                    loaded, relatively static machine. One can easily
                    envision scenarios where a 4 or 8 processor server is
                    running several enterprise-level server applications,
                    each with a dozen or more threads spread across the
                    priority spectrum. Low priority threads handle
                    background work, mid-range threads update graphical
                    monitors, process user input and handle connections,
                    and real-time threads process time-critical operations.

                    Under medium to heavy loads Steps 2 and 3 of
                    KiReadyThread could cause a significant number of
                    needless thread migrations per second, where high
                    priority threads would constantly displace lower
                    priority threads that would be forced to migrate. The
                    overhead of these switches would then be a measurable
                    factor on system performance, and lead to server
                    applications not scaling well as processors are added.

  What's the Fix?   What's surprising about the NT scheduler flaw of Step 2
                    of KiReadyThread and the unnecessary switches of Step
                    3, is not so much the fact that they have been there
                    for several years, but that they were designed into NT
                    in the first place. A common-sense way to perform
                    priority-preserving scheduling across multiple
                    processors is to check the lowest priority executing
                    threads. The algorithm would start with the lowest
                    priority thread that is running on a CPU the ready
                    thread has hard affinity for, and see if the ready
                    thread is higher priority. If its priority is higher
                    then it will preempt the lower priority thread. If it
                    has a lower priority it will be placed in the ready
                    list for its priority. This algorithm preserves
                    priority globally across the processors and thus
                    prevents wasted thread migrations. Sun's Solaris
                    operating system employs this algorithm.

                    An SMP scheduling algorithm could attempt to maintain
                    soft-affinity, but doing so has complex trade-offs that
                    are certainly not reflected in the NT algorithm. For
                    one thing, NT's use of soft-affinity ignores elapsed
                    time. Second, it uses a 20ms rule in KiFindReadyThread
                    that is a hard-coded value. Both are independent of the
                    factors that determine reasonable time windows, like
                    processor speed and cache size.
----------------------------------------------------------------------------

                                  [Image]