List of interesting issues with ZFS (in no particular order)
LOR between page "lock" and z_teardown_lock
One thread could have a page busy and then stumble into z_teardown_lock while trying to write out the page. Another thread could be doing a rollback, thus it would hold z_teardown_lock and then try to discard potentially stale pages of a vnode. For that it would need to wait for each page to get "unbusied".
Memory allocations in ARC
Low memory conditions result in ZFS ARC deadlocks.
There are some memory allocations in ARC that are done while holding some locks. Buffer hash lock is the worst offender. Given the structure of the code (e.g. sometimes lock acquisition and memory allocation are couple of functions apart) it is quite hard to move all allocations outside of the lock scope.
Some possible remedies (but no real cure yet):
- arc_lowmem hook should never block to avoid the following deadlocks between ARC reclaim thread and:
- a thread doing a memory allocation in ARC and ending up in arc_lowmem
- pagedaemon calling arc_lowmem while there are other threads waiting on it in ARC memory allocations (via VM_WAIT)
- arc_evict_needed should check for KVA shortage even on amd64
Memory allocation in zio aggregation
A new aggregated zio may be created. Its memory is allocated with KM_SLEEP / M_WAIT flag. So it may sleep. But because this happens in ZFS txg_sync thread it means that no new write operations may go through. Thus e.g. page daemon could get stuck in zfs_write and thus it wouldn't be able to page out and free any memory. This leads to a deadlock.
ZFS suspension is not visible to VFS
There are situations when VFS actively tries to avoid sleeping on locks. But because ZFS suspension is completely hidden from VFS it may end up sleeping on z_teardown lock. This could lead to a deadlock if e.g. vnlru code gets stuck on a vnode from a suspended filesystem. Then new vnode allocation may have to sleep waiting for vnlru. These allocations could be requested by threads that are already in ZFS transactions. As such these threads may block txg quiescing and thus the sync thread may get stuck. But the sync thread could be required by the suspension lock owner that wants to perform an operation like a rollback that must go the sync thread.
Separate KVA submap for ZFS allocations
ZFS usage patterns fragment KVA a lot. It seems to make sense to use a dedicated submap for ZFS buffer allocations. The submap should be sized based on maximum ARC size. To be determined if only data buffers should be allocated from the submap or all of the buffers.
This requires extending UMA API.
Allow ARC to dig deeper into available memory?
... or at least to woken up earlier from VM_WAIT.
In other words, try to emulate KM_PUSHPAGE.
Not sure if this is a good idea overall, but may have some merit.
Better interaction between contiguous memory allocation and ARC
Currently if an allocation of contiguous physical (multi-page) memory fails, then there is only a single invocation of vm_lowmem hook despite the amount of laundering that is performed to free up some pages. There should be probably be some special side-channel to communicate the condition/effort to ARC. Perhaps ARC should go as far as shrinking to its minimum size.
Better integration between pageout by page daemon and ARC shrinking
Some questions first:
- should ARC be driven by pagedaemon (its thresholds) or by something else?
- at what threshold should ARC reclaiming start?
- same as pagedaemon pageout?
- a higher threshold?
- a lower threshold?
- why?
- how much memory should ARC give up in one pass (triggered by pageout)?
Use hardware acceleration for checksumming, etc where possible?
Not sure if that would have any visible impact though.
Improve swap on ZVOL by "dumpifing" the ZVOL
ZVOL blocks are made pre-allocated and "pinned". No COW, etc. COW doesn't make much sense for transient data like swap.
Allow dump to ZVOL
This needs support for "hierarchical" dumping in FreeBSD. Dump should call into ZVOL code and the ZVOL code should be able to do dumping onto geoms of vdev and all the way down to real disk drivers.
Priorities of ZFS worker threads
Compression/decompression/etc potentially takes a lot of CPU.
See a complaint here: http://thread.gmane.org/gmane.os.freebsd.stable/85097/focus=85100
One quick fix is to use PUSER priority as done in GELI driver. More investigation is needed.