SiFive - September 18, 2017

All Aboard, Part 5: Per-march and per-mabi Library Paths on RISC-V Systems

A previous blog described how the -march and -mabi command-line arguments to GCC can be used to control code generation for the sources you compile as a user, but most programs require linking against system libraries in order to function correctly. Since users generally don't want to compile every library along with their program, either because they're too complicated or because they're meant to be shared, a mechanism is needed for linking against the correct set of system libraries to match the ISA of the user's target system and the ABI of the user's generated code.

The mechanism for handling multiple sets of system libraries is known as "multilib". Like most parts of the RISC-V toolchain, the multilib mechanism is shared between all architecture ports but the specifics of how it applies to RISC-V is specific to our ISA. As RISC-V is a modular ISA, it was natural to have extensive multilib support from the start. This allows our multilib implementation to be significantly cleaner than a lot of other architectures, which is good because the plethora of ISAs and ABIs we have necessitates good multilib support.

The GCC Compiler Wrapper

As discussed in an earlier blog post, the gcc command that users directly interact with is actually just a wrapper that calls each step in the toolchain in order: preprocess, compile, assemble and link your program. gcc isn't actually a script, but instead a small C program that orchestrates the compilation. The architecture-specific hooks for this program consist of a domain-specific language that is specific to GCC's command-line argument handling and describes how the argument to the gcc wrapper should be transformed as they are passed to the various other tools that are called.

In order to ensure things are sufficiently complicated, there are three different languages used to describe how paths are mangled between the user's invocation of gcc and the invocation of cc1 or collect2 that actually does the work. All of these are specific to GCC command-line argument parsing. The gcc command-line wrapper uses these tools in various combinations to specify the following multilib related arguments:

  • The assembler needs to know the ELF class to generate, either ELF32 or ELF64 depending on the target processor's architecture.
  • The linker needs to know the link-time paths that should be searched for libraries.
  • The assembler needs to know the ABI, so it can fill out the relevant ELF flags. This lets the linker to disallow linking objects of different ABIs, which would be incompatible.
  • The linker needs to know the path to the dynamic linker, so it can fill out the ELF interpreter field. The dynamic linker has paths built into it to know where to search for libraries.
  • The linker needs to know the C runtime files that should be linked into executables, as well as any additional libraries that should be linked in by default such as libatomic or libgloss.

All these tools are somewhat coupled together, so we'll go over each below and describe which of the above arguments each tool helps specify.

*_SPEC Domain-Specific Language

The lowest-level, and therefore most general, of the three languages used in describing GCC command-line argument handling is the language used in the various *_SPEC macros that targets can define. These macros describe the transformations used to convert the command-line arguments for every tool GCC calls, so while they're not specific to multilib path handling they're used to produce the full set of argument to the linker so I felt they were at least worth mentioning. One macro is defined for each of the target programs: for example ASM_SPEC defines how to transform command-line arguments for the assembler, LINK_SPEC for the linker, etc.

The *_SPEC macros control a string-to-string transformation that converts the command-line arguments of the gcc command to those passed to another command. While I recall at some point having seen some documentation on what can go in these macros, the best I can find right now lives in the Controlling the Compilation Driver section of the GCC documentation. Since that doesn't really specify how any of that works, I'll try to describe the bits we actually use here -- most of our port came from reading the code in other ports and from trial and error.

As an example of how one of these *_SPEC lines behaves, let's look at RISC-V's STARTFILE_PREFIX_SPEC macro, which determines where the linker should look for C runtime startup files like crt0.o:

#define XLEN_SPEC \
  "%{march=rv32*:32}" \
  "%{march=rv64*:64}" \

#define ABI_SPEC \
  "%{mabi=ilp32:ilp32}" \
  "%{mabi=ilp32f:ilp32f}" \
  "%{mabi=ilp32d:ilp32d}" \
  "%{mabi=lp64:lp64}" \
  "%{mabi=lp64f:lp64f}" \
  "%{mabi=lp64d:lp64d}" \

#define STARTFILE_PREFIX_SPEC                   \
   "/lib" XLEN_SPEC "/" ABI_SPEC "/ "           \
   "/usr/lib" XLEN_SPEC "/" ABI_SPEC "/ "       \
   "/lib/ "                                     \
   "/usr/lib/ "

This is a pretty standard *_SPEC definition for RISC-V: they consume the entire set of gcc command-line arguments as a space-separated list, filter that through some pattern matching, perform a substitution and then pass the result as the space-separated argument list to some other command. We only use a handful of patterns:

  • "STRING": pass "STRING" directly into the output. Anything not wrapped in %{ and } is passed directly to the output.
  • %{argument}: if -argument is in the input as a whole word, then pass -argument to the output.
  • %{argument:substitution}: if -argument is in the input as a whole word, then pass the substitution into the output. These recurse, so something like %{arg1:%{arg2:-arg3}} passes -arg3 if both -arg1 and -arg2.
  • %{glob:substitution}: if an argument matches -glob, pass substitution to the output. Like above, substitution can be recursive. The best reference I could find for the glob syntax is that it looks like very simple shell globbing. For example, %{march=rv32*:32} will pass 32 if passed any of -march=rv32i, -march=rv32imafdc, or -march=rv32INVALID_ISA_STRING (though of course GCC will catch the last one as part of command-line argument parsing).
  • %{!glob:substitution}: like the above, but passes substitution if -glob isn't present.

That's about the extent of what we put in the *_SPEC macros used by the RISC-V port: not all that interesting, just a bit of text pattern matching.

  • -march=rv32imafdc -mabi=ilp32d: /lib32/ilp32d/ /usr/lib32/ilp32d/ /lib /usr/lib
  • -march=rv32imafdc -mabi=ilp32: /lib32/ilp32/ /usr/lib32/ilp32/ /lib /usr/lib
  • -march=rv32i -mabi=ilp32: /lib32/ilp32/ /usr/lib32/ilp32/ /lib /usr/lib
  • -march=rv64i -mabi=ilp32: /lib64/ilp32/ /usr/lib64/ilp32/ /lib /usr/lib

Target Fragments

Since the multilib path descriptions for many targets are too complicated to be described using the spec DSL, GCC contains a second DSL that's used exclusively to specify the library paths in multilib systems. There's a bit more documentation on what this should do in GCC's target fragment section. To sum things up, there's four variables set in this file by the RISC-V port:

  • MULTILIB_OPTIONS: Contains the set of command-line arguments that should be considered when expanding multilib paths. Options that are mutually exclusive are separated by slashes, and groups of those that are unrelated are separated by spaces.
  • MULTILIB_PATHS: A space-separated list of the path components that correspond to each of the above arguments. Multilib paths will be constructed by joining the paths that correspond to the passed arguments with slashes.
  • MULTILIB_MATCHES: When two multilib-related arguments are similar enough that we should use the same library paths when linking in both modes, the mappings go in here.
  • MULTILIB_REQUIRED: Without this argument, GCC will build libraries that cover the cartesian product of what's in MULTILIB_OPTIONS. On systems where that's too many libraries, this variable controls the subset that's actually built.

On RISC-V we have way too many ISA/ABI combinations to build every combination and ship it as a library, so we heavily restrict the set that is actually built via the MULTILIB_REQUIRED variable -- without this we'd end up with hundreds of libraries built, the vast majority of which would never be used because they represent systems that don't make a whole lot of sense -- for example, who would build a system with double-precision floating point but no integer multiplier?

These variables are then provided as arguments to the gcc/genmultilib script, which produces both the tables to decode these arguments that the gcc wrapper uses and the input to various build scripts that instruct GCC to build many copies of each library it installs (for example, libgcc.so).

RISC-V's multilib-generator Script

RISC-V was designed to be a modular ISA. As a result we already have over a hundred ISA and ABI combinations supported by the toolchain, and that number will only ever increase. While we aim to support all these combinations in the toolchain, it would be unreasonable to expect users to build all of these libraries (or even to download all of them as part of a distribution).

To fit this all into GCC's target fragment framework we set MULTILIB_OPTIONS to contain many targets and then set MULTILIB_REQUIRED to the set we actually want to build. We then slightly increase the set of supported ISA/ABI pairs by adding some relevant entries to MULTILIB_MATCHES. Since typing all these in by hand is a pain, we instead use a script to generate our target fragment (which in turn is the input to the genmultilibs script, which then generates the input to the gcc compiler wrapper, which then generates command-line arguments to collect2 to actually do the linking).

The script is called multilib-generator and is written in Python. It takes a list of dash separated arguments on the command line and produces a target fragment that implements the multilib configuration that those arguments describe. The script isn't really meant to be used by end users so it's not well documented, but if you're trying to produce a toolchain with a different set of multilibs than the default set in GCC then you'll have to deal with it.

Each argument is made up of four dash-separated parts. The first two parts control the multilibs that will actually be built. For example:

# This file was generated by multilib-generator with the command:
#  ./multilib-generator ARCH0-ABI0-- ARCH1-ABI1--
MULTILIB_OPTIONS = march=ARCH0/march=ARCH1 mabi=ABI0/mabi=ABI1
MULTILIB_DIRNAMES = ARCH0 \
ARCH1 ABI0 \
ABI1
MULTILIB_REQUIRED = march=ARCH0/mabi=ABI0 \
march=ARCH1/mabi=ABI1
MULTILIB_REUSE =

will generate two multilibs: "-march=ARCH0 -mabi=ABI0" and "-march=ARCH1 -mabi=ABI1". Any other march/mabi pair will result in GCC using the default multilib (the one just installed in "lib"), which will probably cause an error when linking. This "fallback to the default" behavior is something baked into GCC, and while it can be a bit problematic, we don't have the time to fix it right now. If you want to build an extra multilib, you should add an additional argument to multilib-generator that specifies the ISA/ABI pair for that multilib.

# This file was generated by multilib-generator with the command:
#  ./multilib-generator ARCH0-ABI0-ARCHa,ARCHb-
MULTILIB_OPTIONS = march=ARCH0/march=ARCHa/march=ARCHb mabi=ABI0
MULTILIB_DIRNAMES = ARCH0 \
ARCHa \
ARCHb ABI0
MULTILIB_REQUIRED = march=ARCH0/mabi=ABI0
MULTILIB_REUSE = march.ARCH0/mabi.ABI0=march.ARCHa/mabi.ABI0 \
march.ARCH0/mabi.ABI0=march.ARCHb/mabi.ABI0

The next two parts control MULTILIB_REUSE, which specifies how GCC searches for multilibs that don't exactly match those built by MULTILIB_REQUIRED. Both specify an additional set of comma-separated '-march' arguments that map to the multilib specified by the first two arguments.

Arguments of the third position are simpler: it's a comma-separated list of additional ISA values that should be mapped to the multilib specified by the first two parts. For example:

# This file was generated by multilib-generator with the command:
#  ./multilib-generator ARCH0-ABI0-ARCHa,ARCHb-
MULTILIB_OPTIONS = march=ARCH0/march=ARCHa/march=ARCHb mabi=ABI0
MULTILIB_DIRNAMES = ARCH0 \
ARCHa \
ARCHb ABI0
MULTILIB_REQUIRED = march=ARCH0/mabi=ABI0
MULTILIB_REUSE = march.ARCH0/mabi.ABI0=march.ARCHa/mabi.ABI0 \
march.ARCH0/mabi.ABI0=march.ARCHb/mabi.ABI0

adds two additional ISAs that map the generated multilib: "-march=ARCH0 -mabi=ABI0" will be used when passed any of "-march=ARCH0 -mabi=ABI0", "-march=ARCHa -mabi=ABI0", or "-march=ARCHb -mabi=ABI0". You can specify these when there is more than one generated multilib, the additional ISAs apply to the multilib that's in the same argument.

The fourth argument is very similar to the first, but rather than specifying the whole ISA that should be mapped to the specified multilib, it just specifies an additional suffix that should be mapped. For example:

# This file was generated by multilib-generator with the command:
#  ./multilib-generator ARCH0-ABI0--c,d
MULTILIB_OPTIONS = march=ARCH0/march=ARCH0c/march=ARCH0d mabi=ABI0
MULTILIB_DIRNAMES = ARCH0 \
ARCH0c \
ARCH0d ABI0
MULTILIB_REQUIRED = march=ARCH0/mabi=ABI0
MULTILIB_REUSE = march.ARCH0/mabi.ABI0=march.ARCH0c/mabi.ABI0 \
march.ARCH0/mabi.ABI0=march.ARCH0d/mabi.ABI0

adds two additional ISAs that map the generated multilib: "-march=ARCH0 -mabi=ABI0" will be used when passed any of "-march=ARCH0 -mabi=ABI0", "-march=ARCH0c -mabi=ABI0", or "-march=ARCH0d -mabi=ABI0" -- as you can see, largely the same as above

Other Multilib-Aware Components

While GCC handles the vast majority of the multilib support, there's a handful of other components of the system that contribute in other ways to our multilib support:

  • ld, the linker, refuses to link objects with incompatible ABIs. While this doesn't directly support multilib, it does prevent it from getting screwed up silently.
  • ld.so, the dynamic loader, has some multilib paths baked into it so it can search for libraries correctly. We compile one dynamic loader for each multilib and then use GCC to fill out the corresponding ELF interpreter field, so there's not much going on in glibc here.

The Short Way

You might be thinking "that's super complicated, all I really want to do here is just know which library paths are used by my compiler". While you could derive this from looking at the GCC source code, it's simpler to just determine the multilib set experimentally using something like the following script:

#!/bin/bash
for abi in ilp32 ilp32f ilp32d lp64 lp64f lp64d; do
  for isa in rv32e rv32i rv64i; do
    for m in "" m; do
      for a in "" a; do
        for f in "" f fd; do
          for c in "" c; do
            readlink -f $(riscv64-unknown-elf-gcc -march=$isa$m$a$f$c -mabi=$abi -print-search-dirs | grep ^libraries | sed 's/:/ /g') | grep 'riscv64-unknown-elf/lib' | grep -ve 'lib$' | sed 's@^.*/lib/@@' | while read path; do
              echo "riscv64-unknown-elf-gcc -march=$isa$m$a$f$c -mabi=$abi => $path"
            done
          done
        done
      done
    done
  done
done

which produces the entire set of multilibs we support, along with their corresponding arguments:

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 => rv32i/ilp32
riscv64-unknown-elf-gcc -march=rv32ic -mabi=ilp32 => rv32i/ilp32
riscv64-unknown-elf-gcc -march=rv32iac -mabi=ilp32 => rv32iac/ilp32
riscv64-unknown-elf-gcc -march=rv32im -mabi=ilp32 => rv32im/ilp32
riscv64-unknown-elf-gcc -march=rv32imc -mabi=ilp32 => rv32im/ilp32
riscv64-unknown-elf-gcc -march=rv32imac -mabi=ilp32 => rv32imac/ilp32
riscv64-unknown-elf-gcc -march=rv32imafc -mabi=ilp32f => rv32imafc/ilp32f
riscv64-unknown-elf-gcc -march=rv32imafdc -mabi=ilp32f => rv32imafc/ilp32f
riscv64-unknown-elf-gcc -march=rv64imac -mabi=lp64 => rv64imac/lp64
riscv64-unknown-elf-gcc -march=rv64imafdc -mabi=lp64d => rv64imafdc/lp64d

or for the Linux toolchain:

riscv64-unknown-linux-gnu-gcc -march=rv32ima -mabi=ilp32 => lib32/ilp32
riscv64-unknown-linux-gnu-gcc -march=rv32imac -mabi=ilp32 => lib32/ilp32
riscv64-unknown-linux-gnu-gcc -march=rv32imaf -mabi=ilp32 => lib32/ilp32
riscv64-unknown-linux-gnu-gcc -march=rv32imafc -mabi=ilp32 => lib32/ilp32
riscv64-unknown-linux-gnu-gcc -march=rv32imafd -mabi=ilp32 => lib32/ilp32
riscv64-unknown-linux-gnu-gcc -march=rv32imafdc -mabi=ilp32 => lib32/ilp32
riscv64-unknown-linux-gnu-gcc -march=rv32imafd -mabi=ilp32d => lib32/ilp32d
riscv64-unknown-linux-gnu-gcc -march=rv32imafdc -mabi=ilp32d => lib32/ilp32d
riscv64-unknown-linux-gnu-gcc -march=rv64ima -mabi=lp64 => lib64/lp64
riscv64-unknown-linux-gnu-gcc -march=rv64imac -mabi=lp64 => lib64/lp64
riscv64-unknown-linux-gnu-gcc -march=rv64imaf -mabi=lp64 => lib64/lp64
riscv64-unknown-linux-gnu-gcc -march=rv64imafc -mabi=lp64 => lib64/lp64
riscv64-unknown-linux-gnu-gcc -march=rv64imafd -mabi=lp64 => lib64/lp64
riscv64-unknown-linux-gnu-gcc -march=rv64imafdc -mabi=lp64 => lib64/lp64
riscv64-unknown-linux-gnu-gcc -march=rv64imafd -mabi=lp64d => lib64/lp64d
riscv64-unknown-linux-gnu-gcc -march=rv64imafdc -mabi=lp64d => lib64/lp64d

The Rationale Behind Our Multilib Sets

While it may seem like the set of multilibs that are part of our default set is somewhat arbitrary, we actually put quite a lot of thought into each one. Most of the work here went into the embedded set, so let's just go through the list and describe why each one exists:

  • rv32i/ilp32: The simplest RISC-V ISA. While we don't expect this to see much commercial use, we expect that it'll get a lot of educational and hobbyist use. Also, it seems a bit odd not to support the base ISA well -- as otherwise what's the point of one :).
  • rv32iac/ilp32: Despite there being lots of tricks to produce small multipliers that are arbitrarily slow, some people seem to be allergic to hardware multiplication. This target is there to satisfy those people.
  • rv32im/ilp32: This exists largely to support cores retrofitted from other ISAs where simple memory systems preclude the implementation of both the A and C extensions.
  • rv32imac/ilp32: We expect this to get lots of use, it's probably what you'd want to build if you're building a standalone microcontroller chip.
  • rv32imafc/ilp32f: A 32-bit, floating-point target. The other option here would have been rv32imafdc/ilp32d, but we chose this instead under the assumption that if you could deal with having a 64-bit FPU that you'd probably just want to build a 64-bit core.
  • rv64imac/lp64: This will probably be the RISC-V ISA configuration that has the largest number of cores produced for the near future, as there aren't any good options for deeply embedded cores (think power management units, IP control cores, etc) that can talk to SOCs with addresses spaces larger than 32 bits.
  • rv64imafdc/lp64d: The "full featured" embedded core. These probably won't be produced as embedded cores directly, but we think that people will repurpose Linux-class cores as embedded cores as Linux isn't that expensive on RISC-V.

We didn't want the list to become too large, so we decided to limit it to this set. We put less thought into the Linux configurations, as things tend to be a bit more normal in larger systems. Here we just decided to support four library configurations: the Cartesian product of 32/64 bit and soft/hard float.

Changing the Multilib Sets

While we tried to ensure that a reasonable set of libraries are built as part of the default toolchain build, you might want something slightly different. You have a few options here:

  • Build a non-multilib toolchain so everything will have your ISA/ABI combination. This is the easiest option, but if you're shipping something you should at least run the GCC test suite against your combination of choice as targets outside the default multilib set get less testing.
  • Petition the GCC developers to add a MULTILIB_MATCHES that provides a library compiled with a slightly different set of flags for the ISA you're interested in. This is ideal if your desired ISA doesn't get used much in the C library: for example, a good candidate for addition might be to make -march=rv64imafdc -mabi=lp64 match with the rv64imac/lp64 libraries, as newlib doesn't do much floating-point stuff. This is low overhead, so we'll probably accept your suggestion.
  • Petition the GCC developers to add a MULTILIB_REQUIRED that provides your desired ISA/ABI combination. This is higher overhead than adding to MULTILIB_MATCHES, as it results in a higher support burden. If there's commonly used silicon available for a ISA then we'll strongly consider adding it to the default set, as the whole point of multilib is to avoid the need for multiple toolchains.
  • Fork the toolchain and change the default multilib set. This isn't a desired option, and we request if you do then you pick a different tuple to indicate you have a non-standard build. For example, you might pick riscv64-my_company-elf instead of riscv64-unknown-elf to indicate that "My Company" is providing a non-standard toolchain. As the unknown field isn't really defined, no program should be looking at it so you should be safe. We'd really like to avoid toolchain forks if possible, so please at least contact us to talk first!

I think that's about all there is to the RISC-V multilib implementation, so hopefully there won't be any more coverage on it in this blog series. We'll try to get back to covering slightly more interesting topics next week :).