JOS - PC Bootstrap
01 Nov 2020Starting from a high-level overview of JOS lab, we now dig into JOS in this post. We follow Lab 1 of JOS loosely, with our main goal to explore the PC start up process up until the kernel is loaded.
OS design is wedded to CPU architecture, and many of its implementation details are CPU architecture specific. For the purpose of this series of posts, we focus on Intel’s x86.
PC bootstrap, often shorten as simply boot, is the start up routine of a computer. It typically contains three major phases: BIOS, bootloader and OS kernel.
Physical Address Layout
Before jumping into the execution flow, let’s inspect the physical address space layout:
(4 GiB) 0xffff:ffff +-------------------------+
| ROM | <--+ 0xffff:fff0 16 bytes reset
| (BIOS, NVRAM, 64 KiB) | vector
+--- . ---+
.
.
|--- ---+
| High Memory |
| (Larger Extended RAM) |
(1 MiB) 0x0010:0000 +-------------------------+ <---+ Top of 20-bit address space
| aliased ROM | in Real Mode.
| (BIOS, 64 KiB) |
(960 KiB) 0x000f:0000 +-------------------------+
| |
| I/O Devices |
| |
(640 KiB) 0x000a:0000 +-------------------------+
| |
| Low Memory |
| (Real Mode RAM) |
| | <---+ 0x0000:7c00 First code loaded
0x0000:0000 +-------------------------+ from disk to RAM.
We will leave the graph as is for now, and come back as a reference in the next sections. Before moving on, a quick teaser rundown of the PC start up:
BIOS
After a PC is reset, both main memory (typically DRAM) and CPU cache (typically SRAM) lose their data and are therefore invalid. PC therefore gets its first instruction from the firmware (ROM or NVRAM) which retains its memory after reset. BIOS, Basic Input/Output System, is the name of the firmware located in these memory. It is the first code that a PC executes on a reset. Generally, BIOS has the following responsibilities:
- Hardware initialization. BIOS test-and-initializes hardware components (Power On Self Test), and initializes the Interrupt Vector Table;
- Search for bootable devices and load the Master Boot Record into the memory, and passes control to the bootloader.
Modern PCs have replace BIOS with UEFI, which addresses many design flaws in BIOS.
Reset Vector
Reset vector is the default location where the processor fetches the first instruction.
By convention, Intel x86 processors recognize the reset vector as 16 bytes below the maximum addressable physical address, e.g. 8086 identifies 0xffff0
(16 bytes below 1 MiB) while 80286 uses 0xfffff0
(16 bytes below 16 MiB) and 80386/later x86 processors uses 0xfffffff0
(16 bytes below 4 GiB).
Now if we look at memory layout, it becomes apparent that the ROM (which holds BIOS) is mapped to the top of the 64 KiB of the 32-bit physical address space to facilitate storing the reset vector in a non-volatile memory.
Now some may ask ‘But on a PC reset, it starts in real mode, which can only address 1 MiB/20-bit address space. How can it access a memory located near the top of 4 GiB?’ Intel explains that the processor powers on in a Initial Processor Mode similar to Real Mode, but with top 12 address line asserted high. So the address of the first instruction becomes:
ffff:fff0
| || |
+-+| |
| +----+
| |
v |
fff = Top 12 bits asserted HIGH on
|
v
ffff0 = cs<<4 + ip, with cs=0xf000 and ip=0xfff0
Initial Processor Mode only applies to the very first instruction. After that, the processor goes into the normal Real Mode with 20-bit addressing capability. The exact same question kicks right back in: how can the processor fetch the second instruction from ROM if it can’t address the memory at the top of the 4 GiB memory? In the same manual, Intel states
The chipset needs to be able to alias memory below 1 MiB to just below 4 GiB, to continue to access NVRAM.
And that’s the aliased ROM in the layout graph.
Both of these statements can be verified in the lab. To verify that the processor fetches the first instruction from near 4 GiB address and the second instruction from the near 1 MiB address, let’s set up breakpoints for first and second instruction in both memory ranges:
# 1st inst high mem: hit
b *0xfffffff0
# 2nd inst high mem: miss
#b *0xffffe05b
# 1st inst low mem: miss
#b *0xffff0
# 2nd inst low mem: hit
b *0xfe05b
To verify the aliased memory, we can inspect the content in both ranges:
(gdb) x/4i 0xffffe05b
0xffffe05b: cmpw $0xffc8,%cs:(%esi)
0xffffe060: jo 0xffffe062
0xffffe062: jne 0xd231d416
0xffffe068: mov %edx,%ss
(gdb) x/4i 0xfe05b
0xfe05b: cmpw $0xffc8,%cs:(%esi)
0xfe060: jo 0xfe062
0xfe062: jne 0xd241d416
0xfe068: mov %edx,%ss
Reset vector transfers the execution to BIOS.
The BIOS is not part of the (virtual) disk image: it resides inside ROM (NVRAM) memory.
QEMU provides default version of BIOS for i386 of SeaBIOS that is overridable with --bios image
option.
To facilitate understanding the BIOS, we can inspect the execution together with SeaBIOS source code and SeaBIOS documentation.
SeaBIOS image can be compiled from the source (be aware of potential issue with AMD CPU).
The Second Instruction
Reset vector send us to address 0xfe05b
with a long jump.
Gdb does not disassemble 16-bit x-86 correctly (even after set architecture i8086
, UPDATE).
See how actual execution differs from the disassembled instructions:
(gdb) si # repeated 5 times
0xfe05b: cmpw $0x28,%cs:(%esi)
0xfe062: jne 0xd241d0b2
0xfe066: xor %edx,%edx
0xfe068: mov %edx,%ss
0xfe06a: mov $0x7000,%sp
(gdb) x/5i 0xfe05b
0xfe05b: cmpw $0x28,%cs:(%esi)
0xfe060: bound %eax,(%eax)
0xfe062: jne 0xd241d0b2
0xfe068: mov %edx,%ss
0xfe06a: mov $0x7000,%sp
Notice the first instruction at 0xfe05b
is 7 bytes wide (0xfe062-0xfe05b
) according to the execution but only 5 bytes according to gdb disassembly.
Inspecting the content at the memory location
(gdb) p/x *0xfe05b
$1 = 0x3e83662e
(gdb) p/x *0xfe05f
$2 = 0xf006228
, and using a 16-bit Disassembler (or alternatively Intel 8086 Family User’s Manual page 4-23) and finally we have the actual instruction
// objdump -d -M intel -S -mi386 -Maddr16,data16 rom.o
fe05b: 2e 66 83 3e 28 62 00 cmp DWORD PTR cs:0x6228,0x0
which sets the ZF after checking a specific location.
POST Phase
To overcome the fact that gdb doesn’t provide accurate disassembly in 16-bit mode, we’ll have to reference SeaBIOS source code instead (note that gdb provides accurate $ip
register, despite failing to correctly disassemble it, and it can be matched against rom16.o
/rom.o
).
The first executed block:
// romlayout.S
ORG 0xe05b
entry_post:
cmpl $0, %cs:HaveRunPost // Check for resume/reboot
jnz entry_resume
// ENTRY_INTO32(entryfuncs.S):
// Transitions to 32-bit mode and go to the target address.
ENTRY_INTO32 _cfunc32flat_handle_post // Normal entry point
ORG 0xe2c3
.global entry_02
The cmpl
was assembled to the cmp
instruction in our BIOS image, serving the same purpose of detecting the HaveRunPost
flag.
This helps BIOS differentiate a warm boot from a cold boot, where the former does not require POST.
On a cold boot, normal entry point is selected, which does
- transition to 32-bit mode;
- call the C function
handle_post()
.
Transitioning to 32-bit protected mode involves several steps:
// romlayout.S
...
transition32:
// Disable irqs (and clear direction flag)
cli
cld
// Disable nmi
...
// enable a20
...
transition32_nmi_off:
// Load IDT and GDT
lidtw %cs:pmode_IDT_info
lgdtw %cs:rombios32_gdt_48
// Enable protected mode
...
movl %ecx, %cr0
// start 32bit protected mode code
...
.code32
// init data segments
...
// goto destination (handle_post() in this case)
jmpl *%edx
Pre-post, processor is transitioned into 32-bit protected mode in BIOS so that it can access the entire physical address space to perform memory check (and more). After the transition, processor starts the Power On Self Test. Other than hardware testing and setup, a few other interesting events take place:
-
Make BIOS writable. After PC reset, the 64 KiB BIOS ROM is mapped to both
0xf0000
and0xffff0000
in the physical memory space. In this step,0xf0000
is shadowed to RAM, configured as writable (undone before boot) and copied the ROM content. This allows us to 1. set up theHaveRunPost
flag and 2. faster BIOS execution (RAM speed > ROM speed). Seefw/shadow.c
for implementation; -
Set up IVT.
Interrupt Vector Table is the real mode counterpart of Interrupt Descriptor Table in protected mode.
IVT is a simple table where each entry maps an interrupt to its corresponding handler, which is the address of the handling logic.
IVT is located at
0x0
, with each entry occupying 4 bytes of space. After setting a breakpoint afterinit_ivt()
, we can observe the change in IVT:(gdb) x/4x 0 0x0: 0x00000000 0x00000000 0x00000000 0x00000000 (gdb) c Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000e6f2 in ?? () (gdb) x/4x 0 0x0: 0xf000ff53 0xf000ff53 0xf000e2c3 0xf000ff53
-
Trigger a software interrupt: Interrupt 19.
Interrupt 19 is a special interrupt for Boot Load Service.
// post.c void VISIBLE32FLAT startBoot(void) { ... // call16_int() invokes a software interrupt. Notice that the processor is in 32-bit // mode, so we need to save current state and set up registers '_farcall16()'. call16_int(0x19, &br); }
When triggered, the entry 19 in the IVT instructs the processor to start the boot phase.
Boot Phase
Interrupt 19 marks the beginning of boot phase:
// romlayout.S
...
ORG 0xe6f2
.global entry_19_official
entry_19_official: // address stored in IVT entry 19
jmp entry_19
...
entry_19:
// enter 32-bit protected mode and call C function handle_19()
ENTRY_INTO32 _cfunc32flat_handle_19
Then the processor loads the boot sector from the hard disk into 0x7c00
, switches to 16-bit mode and jumps there:
// boot.c
// 32-bit code entry for interrupt 19 handling
void VISIBLE32FLAT handle_19(void) {
...
do_boot(0);
}
// Select the boot devide, and invoke the designated boot failure 'interrupt 18'.
do_boot(int seq_nr) {
...
switch (ie->type) {
...
case IPL_TYPE_HARDDISK:
printf("Booting from Hard Disk...\n");
boot_disk(0x80, 1);
break;
...
}
// Boot failed: invoke the boot recovery function
...
call16_int(0x18, &br);
}
// Loads the first sector of boot device and checks for signature.
boot_disk(u8 bootdrv, int checksig) {
u16 bootseg = 0x07c0;
// Invoke hard disk operation 'interrupt 13'
...
call16_int(0x13, &br);
if (checksig) {
struct mbr_s *mbr = (void*)0;
if (GET_FARVAR(bootseg, mbr->signature) != MBR_SIGNATURE) { // 0xaa55
printf("Boot failed: not a bootable disk\n\n");
return;
}
}
...
// bootseg:boottip = 0x0:0x7c00
call_boot_entry(SEGOFF(bootseg, bootip), bootdrv);
}
// Switch to 16-bit mode and jump to 0x7c00.
call_boot_entry(struct segoff_s bootsegip, u8 bootdrv)
{
...
farcall16(&br);
}
Why 0x7c00
though?
There doesn’t seem to be a definitive answer to this somewhat arbitrary location, but the answer to it might coincides with explanation of many other peculiarities in x86: legacy reason.
Pre-8086 era, the address space was 16-bits wide, and with the lower addresses (0x0
) were used for IVT, the logical approach was to use from the top (0x8000
).
0x7c00
is 1 KiB or two disk sectors from the top.
JOS Bootloader
Once we reach 0x7c00
, we leave BIOS-land and arrive Bootloader territory.
Bootloader is the software program that sets up and loads the OS kernel, GNU Grub for example is a bootloader that can be used to load a Linux kernel.
JOS Bootloader is an extremely simplified bootloader, with its sole purpose being:
-
Switch to 32-bit protected mode.
This step (
boot/boot.S
) is not much different from SeaBIOS’sromlayout.S:transition32
procedure: disable interrupts, enable A20, load GDT, turn on protected mode and transfer control (tobootmain
inboot/main.c
); - Load the JOS Kernel from disk to memory and hand over control to the kernel. The kernel itself is an Executable and Linkable Format binary, and the bootloader loads the sections of the binary into the memory according to ELF specifications of the kernel.
In a separate post, we inspect how a JOS image is generated, and we focus on the code here on.
Assembly Part
The assembly part in boot.S
is hugely similar to work done before BIOS POST phase:
1. Disable interrupts (cli
), effectively erasing the IVT created in BIOS POST. Interrupts are re-enabled after JOS sets up its own interruption handlers;
2. Zero the segment registers (%ds, %es and %ss).
The %cs register is zeroed in the transition from BIOS boot phase (farcall16
), and it cannot be modified directly;
3. Unwrap A20. Similar to BIOS, bootloader undoes A20 wrap around in order to access memory higher than 20-bits;
4. Enable Protected Mode.
# Load the Global Descriptor Table
lgdt gdtdesc
# Enable Protected Mode by setting bit in CR0
movl %cr0, %eax
orl $CR0_PE_ON, %eax
movl %eax, %cr0
# After enabling 32-bit Protected Mode in CR0, the actual switch of
# CPU mode occurs after 'ljmp', where %cs register is updated.
ljmp $PROT_MODE_CSEG, $protcseg
The processor stays in 16-bit mode before ljmp
because it needs extra information of the segment to apply protection, which is only obtained after looking it up in GDT.
The first argument $PROT_MODE_CSEG
is 0b1000
.
Its last 3 bits are Table Selector (bit 2) and Requestor’s Privilege Level (bit 1 and 0), and in this case the GDT is selected (instead of LDT) and RPL is 0.
The details of Segmentation Mechanism is discussed in a separate post.
The bit 4 and up is the index in GDT, and in this case it is 1, the code segment:
gdt:
SEG_NULL # null seg
SEG(STA_X|STA_R, 0x0, 0xffffffff) # code seg
SEG(STA_W, 0x0, 0xffffffff) # data seg
Intel Manual Vol. 3A 3.4.2 Segment Selectors mandates that the first/0th entry in GDT is not to be used, and therefore SEG_NULL
.
The third entry, data segment PROT_MODE_DSEG
(0b10000
), is then loaded for the rest of segments (%ds, %es, %fs, %gs, %ss).
Segmentation is mostly unused in JOS. In conclusion, it’s setup:
- LDT is not used.
- There segments in GDT: one mandatory null segment, one code segment and one data segment. The code segment and data segment both are set up with base 0 and maximum size, essentially not using any segment limit protections.
-
Code segment loaded to CS and data segment loaded to all other segment register:
PROT_MODE_CSEG
points to the code segment and is loaded to the CS register on theljmp
, whilePROT_MODE_DSEG
is loaded to all other segment registers.
Details of the segmentation mechanism see Segmentation Mechanism.
5. Stack setup and control transfer.
movl $start, %esp
call bootmain
$start
(0x7c00
) is selected as the top of the stack.
It grows downwards as illustrated.
Lastly, it switches to the C Code part of the bootloader.
C Code Part
We know that
- the bootloader resides in the first sector of the disk image and the kernel in the successive sectors;
- the segments need to be loaded at their LMA.
The C Code boot/main.c
does essentially this: it reads the kernel part of the disk image, and loads its segments to appropriate locations
// read inital sectors to a scratch space in the memory at ELFHDR
readseg((uint32_t) ELFHDR, SECTSIZE*8, 0);
// Note: readseg automatically shifts 1 sector because kernel starts
// at sector 1
// load each program segment to their LMAs
ph = (struct Proghdr *) ((uint8_t *) ELFHDR + ELFHDR->e_phoff);
eph = ph + ELFHDR->e_phnum;
for (; ph < eph; ph++)
readseg(ph->p_pa, ph->p_memsz, ph->p_offset);
and transfers control over to the kernel
((void (*)(void)) (ELFHDR->e_entry))();
At the kernel entry:
- Elf’s
e_entry
is specified by_start
label inkern/entry.S
; - To solve the conflict of having kernel code at high address but load it on lower memory, paging is enabled with high memory mapped to low memory in
kern/entrypgdir.c
. This addresses the disparity between VMA and LMA.