IA-64 : The (EFI) boot loader
Notes here apply to certain extend to the SKI boot loader as well of course. However, the focus is on the EFI boot loader.
Currently the kernel is loaded at a fixed (physical) address. This is required because the kernel image uses direct-mapped (region 7) virtual addressing. This needs to change for 2 reasons:
- The EFI environment is dynamic and it is not guaranteed that the memory is free to use. Heck, it may not even be present,
- In a NUMA environment you typically do not want direct-mapped virtual addressing for code because that prevents replication. Replication is important for keeping the code close to the processor that executes it.
The above can be addressed as follows:
- Split region 5 in half. The lower half [0xA00..00-0xAFF..FF] is used for KVA. The upper half [0xB00..00-0xBFF..FF] is used for special-purpose (non-identity) mappings. Note that the notion of halves is dependent on the number of unimplemented virtual address bits.
- Link the kernel at virtual address 0xBFFC0000-00000000 and have each node map this somewhere in local memory. The reason for this particular address is that there are at least 51 implemented virtual address bits and the unimplemented address bits are sign-extended from the MSB.
- The EFI boot loader typically allocates memory and loads the kernel there, but I'm not sure EFI is NUMA aware and makes sure the memory is allocated local to the BSP. If not, then we should not allocate. However, we should at least make sure that we load the kernel in free memory.
- The kernel can replicate its own image based on whether NUMA support is enabled.
A scheme that implements the above is outlined below:
- Allocate a 4KB page somewhere in physical memory that's close to the BSP. This page is hardwired at 0xbffffffffffff000 and serves as a page table for the special purpose mapping used by the loader for the kernel, the modules, relocation data and other boot data. The page size used is 1MB and the virtual base address covered by this page table is 0xbffc000000000000. Hence, the 4KB page table allows for 512MB of loaded code and data.
- The loader allocates 1M aligned pages of 1MB somewhere in physical memory that's close to the BSP whenever it needs memory for loading modules (including kernel). It allocates these 1MB pages from the high end of the available physical memory and updates the special purpose page table. The loader hardwires the first 1MB page of the kernel image under the assumption that it contains the IVT and hence contains the code that knows how to map subsequent pages. The kernel is linked at 0xbffc000000000000.
- When a NUMA capable kernel has booted and wants to replicate itself, it performs steps 1 and 2 for each node in the machine and then fires up the monarch processor in that node. It needs to tell the monarch where the page table is in physical memory.
- When the BSP or monarch in a node fires up the APs, it only needs to tell the AP where it's copy of the page table is. The AP can hardwire the page table and then hardwire the first 1MB page given the page table. From that moment on the AP can switch to virtual addressing.
- We should support a loader page table that's larger than 4KB. With a 32KB page table we can map 4GB of loader virtual memory. This should be enough even in the not so immediate future. All we basicly need to do is tell the kernel what the size of the page table is, alongside the physical address of the table.
- The loader should cluster text segments and data segments. The text segments of all modules, when clustered, can be replicated whereas the data segments of the modules cannot. By clustering text and data segments, replication is made easy. For this to work, we probably need to switch to object file modules.
Basicly, for a processor to switch to virtual addressing, the physical address of the 4KB page table is needed. The BSP gets it from the loader. All monarchs get it from the BSP and all APs (other than monarchs) get it from the monarch (or BSP) in each respective nodes.
Another thing the loader could/should do is switch EFI to virtual addressing. Currently it's the kernel that does that. This works, but requires that the kernel supports calling EFI in physical addressing mode. This is only needed to switch EFI to virtual addressing. If we do it in the loader, we won't have to worry about it in the kernel and it also allows us to simplify the memory map that we pass on to the kernel. We now need to give the complete memory map to the kernel because the kernel needs to pass it along to EFI when it is switched to virtual addressing, but most of the entries or details are not of interest to the kernel. In fact, the EFI memory map is too detailed and contains seperate entries for memory that the kernel can treat as a single entry. Also, locality information can be collected in the loader and passed on to the kernel in the memory map if we have the freedom to create our own. Anyway, it might be beneficial to create some distance between the EFI memory map and the kernel.
Relocations are currently handled by the kernel and we need to relocate the kernel before we can ever think to setup the system console. This is a bit odd, because we can relocate the kernel in the loader with all the debugging ability EFI gives us. So, why not have the loader relocate the kernel and the preloaded modules. The kernel than only needs to relocate modules that are loaded at runtime. If we allow the kernel to use "services" provided by the loader, such as relocation of code, we won't have to duplicate the code -- interesting, the loader providing services to the kernel where functionality overlaps. This, BTW, would probably work very nicely if we ever need to have an EFI bytecode interpreter to support certain drivers...
The bootinfo structure is running out of spare fields. Maybe we should switch to a discontinuous array of (tag,val) pairs. This is easily extended without having to worry about backward compatibility. This is especially important if we're going to do more of the initialization in the loader. The more we do in the loader, the more data we need to pass on to the kernel. Something like the following is probably a good definition:
- The bootinfo is passed in an array of (tag,val) pairs, with the bootinfo pointer pointing to the first 8-byte aligned tag in the array. The value of a tag immediately follows the corresponding tag. Each tag and value is 64-bit wide. The last tag in the array is either the end tag with some value or a link tag of which the value is a pointer to an array of (tag,val) pairs. The end tag signals the end of all the (tag,val) array segments. The link tag is used to start a new array segment.
The following (tag,val) pairs are some of the pairs we need. See the current bootinfo structure for others:
(BI_TAG_FIRST, <magic>) -- The very first tag of the bootinfo. Used for sanity checks.
(BI_TAG_LAST, <magic>) -- The very last tag of the bootinfo. Signals the end of the bootinfo. The value can be anything.
(BI_TAG_LINK, <pointer>) -- The pointer value points to the first tag of the next array segment. That array segment does not start with BI_TAG_FIRST.
(BI_TAG_LPT_PA, <physaddr>) -- The physical address of the loader page table. The page table is naturally aligned.
(BI_TAG_LPT_SZ, <int>) -- The size in bytes of the loader page table.
(BI_TAG_LPT_TR, <int>) -- Translation register number used for the LPT mapping.
(BI_TAG_KERN_SZ, <int>) -- Kernel size in bytes.
(BI_TAG_KERN_TR, <int>) -- Translation register of initial kernel mapping.
The following are some of the tags we may like to add as a way to simplify the pre-console initialization:
- BI_TAG_NITR -- Number of ITR registers the processor has.
- BI_TAG_NDTR -- Number of DTR registers the processor has.
- BI_TAG_FREQ_ITC -- Frequency of the ITC counter.
- BI_TAG_FREQ_CPU -- Frequency of the processor.
- BI_TAG_FREQ_BUS -- Frequency of the bus.
- BI_TAG_PAL_PA -- Physical address of the PAL code.
- BI_TAG_PAL_VA -- Virtual address of the PAL code.
- BI_TAG_PAL_TR -- Translation register used for the PAL mapping.