Palmer Dabbelt—October 23, 2017
All Aboard, Part 7: Entering and Exiting the Linux Kernel on RISC-V
Continuing our journey into the RISC-V Linux kernel port, this week we'll discuss context switching. Context switching is one of the more important parts of an architecture port: it is all but impossible to completely abstract away the details of entering and exiting the kernel, Since this is on many critical paths (system calls and scheduling) it must go fast, but since it's the one line of protection the kernel has from userspace it must also be secure.
Traps on RISC-V Systems
One of the more interesting things about the RISC-V ISA is that there is very little that happens when taking an interrupt or exception. In addition to making the ISA simple to implement, this has the advantage of allowing software to have a clean slate on top of which to implement context switching.
The RISC-V supervisor specification defines a single kernel trap entry
point, which can be set by writing the
stvec CSR. The only way to
transfer control to the kernel is via this entry point, and the only
side effects of taking a trap are to change the PC, the exception PC and
exception cause CSRs, and the privilege mode. The supervisor software
is expected to provide a transparent implementation of the userspace
Just like entering the kernel via a single trap entry point, the only
way to leave the kernel is by executing the
sret instruction. This
mirrors taking a trap: all that happens is the privilege mode is changed
and the PC is reset to the exception PC CSR's value. Again, the
supervisor software is expected to provide a transparent implementation
of the userspace ABI.
The one additional bit of support for supervisor mode trap handling that
exists in the RISC-V ISA is the
sscratch CSR. This CSR provides a
single XLEN-sized save region that has no implicit behavior and thus is
entirely used at the supervisor's discretion. All our software context
switching implementations use this register to store a pointer to a
memory region that contains whatever extra information is actually
required to make the context switch, essentially just pushing the entire
context switching implementation into software.
The RISC-V ISA defines that all traps are handled by machine mode by default, with the option to delegate traps in some privilege levels by directly handling them in their respective privilege level. The supervisor mode software is oblivious to the mechanism used to enter its trap handler, it simply assumes that the relevant traps eventually make it to supervisor mode.
Machine mode software is expected to either filter traps in hardware or software. For example, if the machine mode implementation has emulation routines for some unsupported instructions, it would have to handle the illegal instruction trap in software but could delegate the remaining traps via the hardware mechanism.
The trap delegation mechanism allows high performance traps by delegating them directly to supervisor mode in hardware while still allowing the flexibility of handling any trap in a lower privilege mode. This allows for a simple implementation of virtual machines while still allowing for high performance on implementations that don't utilize virtualisation.
handle_exception, the Trap Entry Point
Since most of the interesting aspects of context switching on RISC-V systems are handled by the supervisor mode software, the hardware enters the kernel at one single trap entry point and then the supervisor-mode software determines how to handle the trap. There are two categories of traps defined by the RISC-V ISA:
- Interrupts, which are asynchronous. RISC-V defines a software interrupt, a timer interrupt, and an external interrupt.
- Exceptions, which are synchronous. RISC-V defines exceptions to handle instruction, load, store, and AMO access faults; environment calls (used for system calls on Linux); illegal instructions; and breakpoints.
The trap type is determined by the
scause CSR upon entry to the trap
handler. After saving the integer registers to the kernel stack, which
can be looked up via
sscratch, we examine the trap cause and determine
how to handle the trap. RISC-V delineates interrupts by setting the
high bit in
scause, which makes it easy to filter those out and handle
them. As most exceptions result from userspace emitting an
instruction to begin a system call, we then check for that condition and
handle the system call using Linux's generic system call handling
infrastructure. The remainder of the exceptions and handled via a jump
table, each having a fairly straight-forward implementation that passes
control back to the kernel's relevant generic exception handling
All the exception-type traps are very simple to handle on RISC-V because we essentially just pass control directly back to the relevant generic Linux routine. Interrupts are, however, a different story: as far as I can tell, there isn't any generic infrastructure to handle the first level of interrupt muxing we have on RISC-V so at least for the time being we have our own mechanism. What we have now is a bit messy, so while I'm going to describe what's going on here it's all in flux and may have changed by the time you read this blog entry.
The best way to describe why this is messy is to walk through the timer interrupt as an example:
- The SEE implementation determines a timer interrupt has occurred and
enters the supervisor's trap handler with
scauseset accordingly, which in Linux is
- Linux determines this is an interrupt by looking at a bit in
scauseand then calls
do_IRQto handle the interrupt.
do_IRQcalls in to the RISC-V interrupt controller driver's interrupt handling function,
riscv_intc_irq. This is the first bit of messiness: our core arch port is tied to one of our drivers via a RISC-V specific API, which is generally a bad idea.
riscv_intc_irqdetermines this is a timer interrupt, which then calls back into the core RISC-V arch port to handle the timer interrupt via
riscv_timer_interrupt. This is another RISC-V specific API, but this one is a bit less scary because the dependency is from the driver to the arch port, which will be hard to avoid.
riscv_timer_interruptlooks into the RISC-V timer driver's percpu data structure to find the relevant
struct clock_event_device, which it then call to handle the timer interrupt. This is yet another dependency from our arch port into a driver, which is also a bad idea.
Intertwining our arch port with our drivers is generally a bad idea. While we used to have our drivers much more tightly ingrained with our arch port and are in the process of cleaning this up, there's still a bit more work to do in order to get this all sane.
In stark contrast to how our first-level interrupt handling flow works, the PLIC driver has a fairly clean interrupt handling flow. Registering the PLIC driver is handled entirely via standard device tree mechanisms, and handling interrupts doesn't touch any core RISC-V code.
plic_init, the PLIC driver initialization function, hooks into Linux's
irqchip device tree infrastructure so it can be called when a PLIC
exists in the device tree. The PLIC interrupt mappings are also
specified by the device tree, so in order to register the various
interrupt handlers all the PLIC driver needs to do is hook into Linux's
generic IRQ registration subsystem to link a PLIC interrupt to whatever
device driver it should trigger.
The actual PLIC interrupt handling flow is fairly simple: since the PLIC is designed with modern multi-core interrupt handling flows in mind, there aren't a whole lot of hoops we need to jump through in order to handle an interrupt. The general PLIC interrupt handling flow is as follows:
- A given (hart context, interrupt id) pair is enabled. In the current Linux driver we simply support globally enabling interrupts for the time being, but the PLIC is designed to allow per-hart interrupt routing at some point in the future.
- While the PLIC hardware allows for priorities and thresholds, we currently don't support these in the Linux driver. Thus we simply configure the threshold to allow all interrupts in and enable or disable them upon request.
- Once interrupts are enabled and the device eventually triggers an interrupt, the PLIC will pick one hart context (where the triggered interrupt is enabled and over the context's threshold) to interrupt by raising the local interrupt controller's external interrupt line, thus causing control to enter the kernel's trap entry point and eventually filter to the interrupt handling routine that the PLIC driver registered.
- As is common in many interrupt handling flows, the external interrupt
raised by the PLIC simply means "there may be an arbitrary amount of
work to do, you should go check now". When handling PLIC interrupts,
this triggers an interrupt handling loop that has three phases.
- First the driver polls this hart's interrupt claim register via
plic_claimfunction. This memory mapped read serves as a synchronization point: it both informs software of a pending interrupt that should be handled (with the sentinel value of 0 reserved to terminate the handling loop) and allows the PLIC hardware to ensure that an interrupt is processed by only a single hart context.
- The PLIC driver then maps the PLIC's IRQ number to the cooresponding Linux interrupt handler (via Linux's generic IRQ subsystem) and then allows Linux to handle the interrupt.
- To finish an interrupt, the PLIC driver then informs the PLIC
hardware that it's done handling the interrupt via the
plic_completefunction, which writes the interrupt ID back to the claim register. This re-enables the interrupt in the PLIC, allow it to be handled again.
- First the driver polls this hart's interrupt claim register via the
- The PLIC driver continues handling interrupts as long as there are
pending interrupts returned by
plic_claim, which allows multiple interrupts to be handled without introducing additional context switches.
- Once the PLIC driver determines that there are no more pending interrupts, it informs the generic IRQ handling subsystem that it's done handling interrupts for now and then returns.
The PLIC driver as it currently stands is in pretty decent shape, but there are a handful of small cleanups that could be performed:
- The PLIC assumes that the interrupt source IDs can be registered globally in Linux, which may conflict with other interrupt controllers in the system. Right now we only have the local interrupt controller, which doesn't register with the generic IRQ handling subsystem, so this may change.
- The PLIC hardware supports AMOs and is designed to have an efficient software implementation that uses AMOs to enable and disable interrupts. Since we currently can't query PMAs on RISC-V systems to ensure that Linux can actually use AMOs to talk to the PLIC, we don't take advantage of this in the PLIC driver right now.
- The PLIC uses the simple IRQ handling flow, but it may map better to the FastEOI flow. This would allow us to clean up the PLIC driver's implementation by hoisting the interrupt polling loop out of RISC-V specific code and use the generic version instead.
Hopefully we'll have some of these issues resolved by the time the PLIC driver is upstream, but as many of the issues are fairly minor we might take a bit to clean everything up.
Saving and Restoring Extension State
RISC-V was designed to be an extensible ISA, and as a result it has multiple extensions. Linux is largely oblivious to these extensions: it mandates the A extension and will probably run on systems with the M and C extensions, but as those extensions don't have any extra state these are easy things to handle.
The only current extension defined by the RISC-V standard that adds to the user-visible state are the F and D extensions, which add floating point registers. The RISC-V floating point extensions follow the standard pattern: Linux is able to mark the register state as "trap on access" so it can lazily save and restore the F register state. This has one major advantage: since the F register state will never be dirty on systems without F registers, we don't need any explicit code in the Linux kernel to distinguish between systems with or without the F extension.
Returning to Userspace
Returning to userspace on RISC-V systems is fairly straight-forward:
since the hardware does very little on a context switch, we just reverse
everything that happened when entering the kernel and issue a
instruction to get back to userspace.
There are a handful of Linux-specific things we need to do before returning to userspace:
- If we're returning from a system call, we need to check to see if syscall tracing is enabled. This allows the ptrace interface to work, which is the kernel interface used by programs like GDB and strace.
- Before returning to userspace in any manner, we heck to see if there's other work that should be done before entering userspace, invoking either some signal handlers or the scheduler as necessary.
tpis swapped with
sscratch, so the kernel can find its internal data structures again when it is re-entered.
- Restore userspace's copies of the integer register state and jump back to userspace.
Since signal handling is one of the more complicated aspects of Linux,
we try to avoid deviating too much from the standard asm-generic
mechanisms. The RISC-V signal handling infrastructure is primarily
based around passing a
struct sigcontext to userspace, which contains
the user's architectural state at the time of the exception (as saved by
the kernel), to the signal handler function registered by userspace.
Userspace's signal handler function has the same ABI as a regular
function. This makes it easy to enter the signal handler: we simply
sepc to contain the address of the signal handler and then
return to userspace via the normal mechanisms. Signal handlers are
expected to call
sigreturn when they are done, so in order to maintain
the regular function ABI we set the return link to a VDSO-based
trampoline function that does so.
struct sigcontext contains a copy of all the user-visible
architecture-defined state at the time the signal to be handled was
taken, this needs to be extensible in order to allow for future RISC-V
ISA extensions to cleanly integrate with applications that need to
interrogate these contexts. In order to enable these applications we
have defined an extensible format for
struct sigcontext that allows
these future extensions to be made visible to userspace. We currently
only support the F and D extensions, but we've designed this with the
eventual V extension in mind as well.
Stay tuned for next week, where we'll talk about how the RISC-V kernel port handles memory management.