Global Instruction Selection in LLVM

This document summarizes the design of a global instruction selector for LLVM as proposed on llvmdev.

Design goals

The global instruction selector is intended to replace both SelectionDAG and fast isel. These are some of the goals for a new instruction selector architecture:

We want a global instruction selector. Make it possible to match complex patterns across basic blocks to take advantage of addressing modes. Legalization of switches, selects, and atomics can happen as part of the instruction selection process, even if new basic blocks are required. Clean up ext/trunc instructions across basic blocks after legalization of small integer types.
We want a faster instruction selector. By selecting directly to MI, we can avoid one IR translation phase. By using a linearized IR, scheduling becomes optional.
We want a shared code path for fast and good instruction selection. The -O0 instruction selector should simply be a trimmed down version of the full optimizing instruction selector.
We want an IR that represents ISA concepts well. The MachineInstr IR is used to represent functions throughout instruction selection, and targets are allowed in insert pre-selected instructions at any time. Virtual registers can either have (EVT, Register bank) or register class labels. Custom opaque types can be added to represent target-specific values.
We want a configurable instruction selector. Weird targets have weird requirements, and it should be possible for targets to inject new passes into the instruction selection process. Sometimes, it may even be required to replace a standard pass.

Overall design

The global instruction selector is implemented as a series of machine function passes. Targets can inject their own passes or replace standard passes, using the existing mechanism for configuring code generation pipelines.

The MI intermediate representation is extended with a set of abstract target-independent opcodes very similar to the ISD opcodes used by SelectionDAG. Virtual registers can be typed so they have an EVT instead of a register class. A given type can be mapped to more than one register class, controlled by a register bank label.

The standard passes are:

MI builder.
Switch lowering.
Iterative legalization and cleanup.
Register bank selection.
Common subexpression elimination.
Instruction selection.

I’ll describe these passes in more detail below.

MI builder

The MI builder pass translates LLVM IR to MI IR. This pass will:

Expand LLVM IR first-class aggregate types into their constituent parts. This also includes splitting load, store, phi, and select instructions.
Replace pointer types with target-specific integers.
Expand getelementptr instructions with integer adds and multiplies.
Map LLVM instructions to target-independent abstract MI opcodes.
Lower ABI boundaries with help from target hooks.
Create a 1-1 mapping of the LLVM IR CFG. No new basic blocks are created.
Preserve switch instructions.
Preserve i1 logic instructions.
Not use MERGE_VALUES instructions. We’ll use a value map that can handle aggregates.
The aggregate type expansion creates value types that can all be represented by EVTs, and MachineRegisterInfo will be extended to allow virtual registers to have an EVT instead of a register class. EVTs are all the integer, floating point, and vector types from LLVM IR.

The ABI boundary lowering requires types to be broken down further into ‘legal types’ that can be mapped to registers. The secondary breakdown is currently handled by TargetLowering::LowerCallTo() calling getRegisterType() and getNumRegisters(). Most ABIs are defined in terms of C types, not LLVM IR types, so there is a close connection between the C frontend and the ABI lowering code in the instruction selector. It would be a good idea to have the ABI lowering code work independently of the type system used during instruction selection. This is made possible by having ABI lowering be part of the LLVM IR to MI translation process.

Switch lowering

The switch lowering pass converts switch instructions into a combination of branch trees and jump tables. A default target-independent implementation is possible, but targets may want to override this pass. For example, the ARM target could try to create jump tables that would work well with the TBB/TBH instructions to reduce code size.

Legalization

We’ll define legal types precisely:

A type is considered legal if it can be loaded into and stored from an allocatable register class. (And all allocatable register classes must support copy instructions.)

We don’t require other supported operations than load and store for a type to be legal, and all types that can be loaded and stored are legal. This means that most targets will have a much larger set of legal types than they do today. Also note that we don’t require targets to designate a register class to use for each legal type; in fact, TableGen can compute the legal type set automatically. Register classes can be inferred from the selected instructions, MachineRegisterInfo::recomputeRegClass() already knows how to do that.

With a much larger set of legal types, a separate type legalization phase becomes superfluous. The operation legalizer must be able to do anything the type legalizer can do anyway, so the type legalizer isn’t adding any value.

The legalizer will work bottom-up and iteratively. As illegal instructions are visited, the legalizer will apply transformations that make the current instruction ‘more legal’. Each transformation is allowed to modify a whole chain of single-use instructions for efficiency, but it is also allowed to only modify the current instruction in hard cases. The instructions created don’t need to be legal, the legalizer iterates until the current instruction is legal before it moves on.

The set of new instructions created while legalizing a single instruction is fed through an instruction simplifier that cleans up redundancies. This replaces DAGCombine.

Register bank selection

Many instruction set architectures have multiple register banks. X86 has three: Integer, vector, and x87 floating point registers. (Four if you count MMX registers as a separate bank.) Blackfin and m68k have separate pointer and data register banks, etc. It is also common to have the same operations available in multiple register banks. For example, most ISAs with a vector register bank support bitwise and/or/xor operations in both the integer and vector register banks.

The global instruction selector will assign virtual registers to register banks explicitly. The set of register banks is typically small (2-3) and defined by the target. Modelling register banks explicitly makes it possible to move operations between register banks to minimize cross-bank copies which are often expensive. SPARC even requires cross-bank copies to go through memory, as does x86 in some cases.

The register bank selection pass computes the optimal bank assignment and inserts copy instructions when a value needs to cross banks. Sometimes, it may be beneficial to have the same value available in two register banks simultaneously, this can also be represented with cross-bank copy instructions. The bank selection can also be affected by register pressure concerns. On x86-64, for example, many i32 values could be moved to the SSE registers, freeing up the integer registers.

Instruction selection

The instruction selection pass replaces most target-independent instructions with real target opcodes, and it ensures that all virtual registers have register classes instead of EVTs. Some target-independent instructions, like COPY, are not lowered until after register allocation.

SelectionDAG instruction selection is controlled only by expression types, and the selected instructions are expected to use the unique register banks selected by the type:

(operation, type) -> opcode

We’re representing register banks explicitly, and many operations are available in multiple banks, so the register bank needs to be part of the instruction selection:

(operation, type, bank) -> opcode

On the other hand, when types are not used to select register banks, it becomes really difficult to explain the difference between load i32 and load f32. The hardware doesn’t care either, it simply knows how to load 32 bits into a given register. We can use a three-level hierarchical type system to better describe this:

Bit width. Operations like load, store, select, and the bitwise and/or/xor only depend on the number of bits in the operands. Their effect is independent of the vector lane structure and whether lanes are ints or floats.
Vector lanes + lane width. Operations like shufflevector and insertelement depend of the vector topology of the operand types, but they don’t care if vector lanes are floats or ints.
Full EVTs. Finally, arithmetic instructions actually depend on the full type of their operands. It is worth noting, though, that the int/float distinction is already redundantly encoded in the operations. LLVM IR has separate add and fadd instructions, for example.

The instruction selection will only use the relevant parts of the operand type, depending on the operation being matched. It will consider load i32, load f32, and load v2i16 to be simply 32-bit wide loads. The target is likely to have multiple 32-bit load instructions. The explicit register bank is used to pick the right one. Note that some big-endian targets (ARM and MIPS in BE mode) still number their vector lanes right-to-left. This means that vector loads get shuffled differently depending on the lane size being loaded, and we’ll need to make it possible to use different load instruction for different vector lane sizes. (On such targets, bitcasts become shuffles too.)

Apart from the way operations and types are matched, the instruction selection algorithm is a lot like the current SelectionDAG algorithm. The process is less transactional than it is in SelectionDAG. Specifically, targets are allowed to insert pre-selected instructions at any time.

The instruction selection algorithm is global, which means it can look upwards in the dominator tree when matching patterns. A cost model is required to determine when it is a good idea to fold instructions outside the current loop. The same applies to matching instructions with multiple uses.