To loot, you first need to boot!

As most of you know, the linux kernel is stored as a bzImage. This bzImage has been comprised of different files over the time, but it is usually the composition of two things:

  1. Some kind of linux boot code
  2. vmlinux.bin.gz, your gzipped kernel

The bit that interests us is the linux boot code, and how it paves the way for the kernel itself. You may consider that once the piggy.o (see later) object has been loaded at offset 0x100000, the basic bootloading job is done. But first, before tackling UEFI thematics, let's go back a bit to the legacy booting processes.

I gave a conference about these matters in March. You can consult the slides at the following address. The prezi slides give a very good idea of where you are in the code, try it!

Legacy boot

Way, way back in 2.5.64

Even before people used window managers and all that fancy stuff, linux actually was a bootable image, meaning you could run dd if=bzImage of=/dev/sda and just boot off the thing. This required the 512 first bytes to be MBR-material, able to load the rest of the kernel itself. Using this technique, it was not possible to easily specify a command-line (and therefore a root filesystem, an initrd file or an init binary).

The bzImage was composed as follow:

    0x0   +-----------------+
          |   bootsect.o    | <- _start()
          +-----------------+
          |      hdr        | <- setup_sects field
    0x200 +-----------------+
          |     setup.o     | <- start_of_setup()
          +-----------------+
          | vmlinux.bin.gz: |
          |                 |
          |  +-----------+  |
          |  |   head.o  |  | <- startup_32()
          |  +-----------+  |
          |  |   misc.o  |  | <- decompress_kernel()
          |  +-----------+  |
          |  |  piggy.o  |  | <- linux!
          |  +-----------+  |
    a lot +-----------------+

The piggy.o object contains the bulk of the kernel image. misc.o is a bunch of gzip routines for the decompression of the kernel.

The bootsect.o was a 512-bytes MBR. Since 2.5.65, it just prints an error message indicating that the feature is not supported anymore. arch/i386/boot has since 2.6.24 been moved into arch/x86/boot. bootsect.S and setup.S have been replaced by the header.S file since 2.6.23. The bootsect.S file performed only a few basic tasks:

The size of the setup.o code, which needs to be loaded in low-memory, is defined in the setup_sects field [bootsect.S:415].

After loading those two chunks in memory, the processor jumped into the setup.o code, at the symbol start_of_setup [setup.S:173]. From here, it carried out a few tasks:

The code at 0x100000 (1Mo) is part of the startup_32 [head.S:31] routine, the first protected mode code in the kernel. It uses routines from misc.c to decompress the kernel in place and then re-jumps at 0x100000 [head.S:77], where the code from piggy.o has now been loaded.

The real world

As I previously said, the layout of the arch/i386/boot folder (as of today arch/x86/boot) changed drastically over the time.

The first change to take place was the nullification of the MBR, and starting at version 2.5.65, the 512 first bytes were only able to print out a bugger-off message. Between versions 2.6.22 and 2.6.23, the folder was totally revisited. A new file header.S was created, containing the now useless 512 bytes MBR and a bit of the setup.S code as well. The main change remains in the creation of a main.c file executing most of the initializations performed by the old setup.S regarding the BIOS mode, the memory detection, the video mode and such. The code in the main.c file then jumps in protected mode in the pm.c file [pm.c:149] via the goto_protected_mode stub. The head_32.S file is still very similar to the original head.S source: its job is to decompress the kernel in-place, thus placing piggy.o at 0x100000.

The bzImage is of the following composition according to my research and the compressed folder building files ([Makefile:29] and [vmlinux.lds.S]

    0x0   +-----------------+
          |    header.o:    |
          |                 |
          |  +-----------+  |
          |  |  old mbr  |  |
          |  |    hdr    |  |
          |  +-----------+  |
          |  | call main |  | <- start_of_setup()
          |  +-----------+  |
    0x200 +-----------------+
          |    setup.o:     |
          |                 |
          |  +-----------+  |
          |  |   main.o  |  | <- main()
          |  +-----------+  |
          |  |  video.o  |  | <- set_video()
          |  +-----------+  |
          |  |    ...    |  |
          |  +-----------+  |
          +-----------------+
          | vmlinux.bin.gz: |
          |                 |
          |  +-----------+  |
          |  |   head.o  |  | <- start_32()
          |  +-----------+  |
          |  |   misc.o  |  | <- decompress_kernel()
          |  +-----------+  |
          |  | string.o  |  |
          |  +-----------+  |
          |  | cmdline.o |  |
          |  +-----------+  |
          |  |    ...    |  |
          |  +-----------+  |
          |  |  piggy.o  |  | <- linux
          |  +-----------+  |
    a lot +-----------------+

Usual BIOS-enabled bootloaders startup

Let's take a look at the syslinux sources to understand when and how the linux bzImage is loaded in memory by the bootloader itself. A big thanks to the guys from #syslinux on freenode for their help in finding the module loading and jumping into the kernel linux, the path was not obvious. It is split in two according to the setup_sects header [load_linux.c:243] in the first 512 bytes of the header.o file [header.S:264].

Once this is done, the bootloader simply jump 512 bytes behind the beginning of the realmode code it copied into memory. This code will re-localize itself at 0x9000 offset according to the setup_move_size field [header.S:306] if the command-line address has not been specified in the command_line_ptr field [header.S:338]. From then on, the kernel will follow the same route as when it was loaded as an image.

It might also be interesting to specify that the setup segment of the bootloading process is aware of the bootloader that loaded it previously thanks to the ext_loader_type field [header.S:335] [boot.txt:].

Conclusion

Well, all we thus far is that the BIOS-dependent bootloading process for linux is quite a mess. It is not trivial to follow the control flow and the bzImage loading is far from obvious. The drastic changes the boot folder underwent did not help me get a sense of what was going on. However, here comes UEFI.

The UEFI model

Introduction

The goal of the UEFI specification is first to unify the boot process and get rid of the mess the BIOS-dependent bootloading option is. When the IA64 architecture was designed, engineers from Intel thought it was time to get rid of the legacy 16bits to 32bits to 64bits booting process, and go straight into protected mode. However, as the IA64 architecture failed in favor of the AMD64, the idea of getting rid of the archaic firmware that is BIOS stuck, and after a few years, the EFI firmware became UEFI and development of this specification spread outside Intel.

The idea here is to provide an API more user-friendly to the programmer, with simple applications as Portables Executables (PE from Windows). Most of these applications are services (usually drivers) exposing to the user a bunch of devices such as a keyboard, a screen or the clock. They are initialized and ran automatically by the firmware. Other applications include a shell (enabling the user to start other applications), or bootloaders (it might be useful).

There are three types of application:

Boot services are protocols (API to stay simple) designed to die when the boot process is done and the control is handed to the OS (via the ExitBootServices() routine.) These services include drivers such as text/graphical console, block devices and such. On the other hand, Runtime services are designed to stay reachable by the OS, even after a call to ExitBootServices(). These services provide access to the NvRAM for example, or drivers for the clock.

The NvRAM stores a few variables, including the configuration for the boot manager. This boot manager reads the NvRAM to boot on a given application automatically. This configuration is alterable via the efibootmgr utility and allows the user to setup the bootloader order. This order usually defaults to:

  1. Try to boot on floppy
  2. Try to boot on hard drive
  3. Try to boot on NIC0
  4. Run shell application

The user-defined applications and files are stored on a special fat32 partition defined by the identifier 0xEF.

UEFI: how to

As specified before, the code for an application is encapsulated in the PE format. This means the binary needs both the MZ and PE headers in order to be recognized as a valid efi executable. It needs to feature the .efi extension in the filesystem as well.

The compilation of such binaries can be achieved with the help of the gnu-efi library, which is exposing to the user headers the firmware-provided data structures and function prototypes, such as the main. It also includes a basic library I/O C library using the EFI-defined drivers to the peripherals.

The main prototype as defined by the gnu-efi library [ia32/efibind.h:250], and used in a sample 'hello world' application [apps/t.c:16]:

EFI_STATUS efi_main (EFI_HANDLE image_handle, EFI_SYSTEM_TABLE *systab);

Arriving in that main, all the EFI features are available via the EFI_SYSTEM_TABLE [efiapi.h:866] structure. The firmware thus exposes directly an stdin/stdout/stderr via the systab->{ConIn,ConOut,StdErr} handles.

The EFI_BOOT_SERVICES structure gives a reference to the different protocols and drivers to the user via the LocateHandle() and LocateProtocol() functions. The EFI_RUNTIME_SERVICES structure yields directly access to the time and NvRAM variables.

Booting without a bootloader: the EFI boot stub

As expected, the linux kernel obviously does not use the gnu-efi library. The idea behind the EFI boot stub is to fake the previously seen bzImage as a valid efi application. This means setting up a MZ+PE header and all kinds of sneaky, sneaky stuff.

The EFI boot stub became available as of linux 3.3. When compiling the kernel with options CONFIG_EFI_STUB=y, the header.S image features made up MZ+PE headers. The most important field, the AddressEntryPoint [header.S:144] is named efi_pe_entry in the source tree and is set by the tools/build.c program to either of the following [tools/build.c:274]:

The remaining problem here is that bootloaders usually provide a boot_params data structure. Here, the head_{32,64}.S files use a make_boot_params function [compressed/eboot.c:693] [compressed/head_64.S:214] [compressed/head_32.S:45] in order to setup this structure.

The processor then enters the efi_stub_entry, [compressed/head_64.S:221] [compressed/head_32.S:55] the offset of which also depends on the architecture adopted by the kernel (ia32 or amd64). As implemented since the boot protocol 2.11 [boot.txt:57] [boot.txt:1097], the kernel supports EFI handover, meaning bootloaders can yield the remainder of the boot process to the EFI boot stub. This is where efi_stub_entry intervenes, representing that entry point and being stored in the handover_offset [header.S:422] [boot.txt:728] if the xloadflags [header.S:371] is set accordingly [boot.txt:590].

The code beginning from the efi_stub_entry first calls efi_main [compressed/eboot.c:748] (not to be mistaken with the gnu-efi efi_main we talked about earlier,) which executes a basic initialization:

After exiting efi_main successfully, the processor just jumps in the newly relocated kernel, according to the values in the boot_params [asm/bootparams.h:111] structure, stored in %eax at exit [compressed/head_64.c:233].