arm64 pmap TODO list
Ordered from highest to lowest priority
Enable hardware updates to the accessed flag and dirty state in page table entries. r356207
Implement dirty bit emulation.
I would suggest making the pmap lock a rw lock instead of a simple mutex, and use a read lock in pmap_fault(). The read lock will guarantee that l2 or l3 tables won't be added or removed from the pmap. But, we will allow ourselves to make atomic changes to an entry in pmap_fault().
I think that it would be easier to implement dirty bit emulation with an ATTR_SW_RDONLY flag, specifically, the changes to pmap_protect() would be simpler, but an ATTR_SW_RW flag would lead to code that could also support the hardware DBM feature in newer revs of the architecture.
- Eliminate unnecessary I/D-Cache coherence operations.
On a machine with a PIPT L1 I-cache, I tried reenabling selective "ic" operations in arm64_icache_flush_range() (rather than flushing the entire I-cache), but performance got slightly worse.
BUG: pmap_enter_l2() is often used to map code superpages. However, we do not currently call arm64_icache_flush_range(). https://reviews.freebsd.org/D31181
These coherence operations can be disabled on some newer cores, e.g., Graviton 2, which implement coherence in hardware. Use an IFUNC?
On these cores, the "ic" instruction never broadcasts an invalidation, so most of the overhead is automatically eliminated.
Implement access bit emulation.
See dirty bit emulation.
Implement pmap_advise(). (This task has accessed and dirty bit emulation as a prerequisite.)
Implement ASID support.
Needed for KPTI (only needed as a mitigation on Cortex-A75)
BUG: We perform lazy ASID allocation, meaning that we don't replace a pmap's initially invalid ASID until we are about to run the process for the first time. (I did this for the sake of hypothetical hardware that only has 8-bit ASIDs.) fork() prefaults a lot of read-only/executable mappings through pmap_copy() before we ever run the process. Suppose that the page daemon decides to reclaim one of these pages, and thus have to destroy one of these prefaulted mappings before the process has run. pmap_remove_all() will call pmap_invalidate_page(), which will fail the "invalid ASID" KASSERT.
Prefaulted mappings never have ATTR_AF set, so teaching pmap_remove_all() to only call pmap_invalidate_page() when ATTR_AF is set would both address the problem and eliminate some unnecessary TLB invalidations. r354792
BUG: A similar scenario can arise within reclaim_pv_chunk(). r354860
BUG: pmap_enter_l2() (and amd64's pmap_enter_pde()) updates the ref_count on kernel page table pages. r355883
- Add support for L2_BLOCK mappings to pmap_kenter().
- Start using the ATTR_CONTIGUOUS flag.
[In progress: alc] Specifically, I would start by writing a "demotion" function, put in hooks to call it in, e.g., pmap_enter(). Then, start with setting it in pmap_enter_object().
This may reduce the number of access bit emulation faults. Maybe we want to preset ATTR_AF on the contiguous PTEs since we have to treat them as accessed anyway.
- Consider doing promotion in pmap_fault().
- This could eliminate the need to explicitly write protect or destroy mappings in some places so as to have a future trigger for later (re)promotion.
How will this interact with hardware DBM and AF?
- Avoid back-to-back faults calling pmap_fault() on a write access to a previously unaccessed virtual page, first to set ATTR_AF and second to clear ATTR_AP_RW_BIT.
I suspect that this may not occur often because pmap_enter() presets ATTR_AF. Only pmap_enter_object() and pmap_enter_quick() create mappings without ATTR_AF set, but these are read-only mappings.
Replace many of the pmap_load_store()s, which perform an atomic swap, by a simpler pmap_store().
- pmap_enter() need not perform a TLB invalidation on a wiring change. In places like pmap_enter() or pmap_remove(), we don't need to invalidate the TLB if ATTR_AF is clear.
- pmap_invalidate_range() should flush the entire TLB if the range exceeds a threshold.
With TLBI you can choose whether "page walk cache" entries are flushed. So, operations, like pmap_protect(), that do not deallocate l2 or l3 tables could use a variant of pmap_invalidate_range() that only invalidates leaf-level entries. 4ccd6c137f5b
Given that the TLBI "is" variants broadcast the invalidation to other cores, I don't see why we are pinning to a core at the start of pmap_invalidate_*(). r355145
BUG: Preemption after an "ic" or "tlbi" instruction but before a "dsb" instruction by the same processor is problematic. r355427
On a related note, I would think that pmap_switch() need not use the "is" variant.
Simplify the superpage case in pmap_clear_modify(). It can assume that the PTE is valid after demotion. (This simplification can also be made elsewhere, e.g., amd64.)
Optimize pmap_remove_l3_range() by avoiding repeated PHYS_TO_VM_PAGE() calls to obtain the same vm_page_t each time. r355787
Remember that arm64 uses VM_PHYSSEG_SPARSE, so PHYS_TO_VM_PAGE() has to iterate over vm_phys_segs[].
- Should pmap_remove_pages() remember the last l3pg as a hint to avoid PHYS_TO_VM_PAGE() calls?
Change vmspace0's pmap to use a zeroed physical page as the root of the ttbr0 page table. Currently, it inherits an identity mapping from locore.S. https://reviews.freebsd.org/D22606
- Port r352874 from amd64: Defer freeing of PV chunk pages in pmap_remove_pages() until after the last PV list lock is released.
[In progress: andrew]
Use the ttbr CnP flag
[In progress: andrew] Not seeing any performance improvements in current (2021) hardware