List of interesting issues with ZFS (in no particular order)

LOR between page "lock" and z_teardown_lock

One thread could have a page busy and then stumble into z_teardown_lock while trying to write out the page. Another thread could be doing a rollback, thus it would hold z_teardown_lock and then try to discard potentially stale pages of a vnode. For that it would need to wait for each page to get "unbusied".

Memory allocations in ARC

Low memory conditions result in ZFS ARC deadlocks.

There are some memory allocations in ARC that are done while holding some locks. Buffer hash lock is the worst offender. Given the structure of the code (e.g. sometimes lock acquisition and memory allocation are couple of functions apart) it is quite hard to move all allocations outside of the lock scope.

Some possible remedies (but no real cure yet):

Memory allocation in zio aggregation

A new aggregated zio may be created. Its memory is allocated with KM_SLEEP / M_WAIT flag. So it may sleep. But because this happens in ZFS txg_sync thread it means that no new write operations may go through. Thus e.g. page daemon could get stuck in zfs_write and thus it wouldn't be able to page out and free any memory. This leads to a deadlock.

ZFS suspension is not visible to VFS

There are situations when VFS actively tries to avoid sleeping on locks. But because ZFS suspension is completely hidden from VFS it may end up sleeping on z_teardown lock. This could lead to a deadlock if e.g. vnlru code gets stuck on a vnode from a suspended filesystem. Then new vnode allocation may have to sleep waiting for vnlru. These allocations could be requested by threads that are already in ZFS transactions. As such these threads may block txg quiescing and thus the sync thread may get stuck. But the sync thread could be required by the suspension lock owner that wants to perform an operation like a rollback that must go the sync thread.

Separate KVA submap for ZFS allocations

ZFS usage patterns fragment KVA a lot. It seems to make sense to use a dedicated submap for ZFS buffer allocations. The submap should be sized based on maximum ARC size. To be determined if only data buffers should be allocated from the submap or all of the buffers.

This requires extending UMA API.

Allow ARC to dig deeper into available memory?

... or at least to woken up earlier from VM_WAIT.

In other words, try to emulate KM_PUSHPAGE.

Not sure if this is a good idea overall, but may have some merit.

Better interaction between contiguous memory allocation and ARC

Currently if an allocation of contiguous physical (multi-page) memory fails, then there is only a single invocation of vm_lowmem hook despite the amount of laundering that is performed to free up some pages. There should be probably be some special side-channel to communicate the condition/effort to ARC. Perhaps ARC should go as far as shrinking to its minimum size.

Better integration between pageout by page daemon and ARC shrinking

Some questions first:

Use hardware acceleration for checksumming, etc where possible?

Not sure if that would have any visible impact though.

Improve swap on ZVOL by "dumpifing" the ZVOL

ZVOL blocks are made pre-allocated and "pinned". No COW, etc. COW doesn't make much sense for transient data like swap.

Allow dump to ZVOL

This needs support for "hierarchical" dumping in FreeBSD. Dump should call into ZVOL code and the ZVOL code should be able to do dumping onto geoms of vdev and all the way down to real disk drivers.

Priorities of ZFS worker threads

Compression/decompression/etc potentially takes a lot of CPU.

See a complaint here: http://thread.gmane.org/gmane.os.freebsd.stable/85097/focus=85100

One quick fix is to use PUSER priority as done in GELI driver. More investigation is needed.

AndriyGapon/AvgZfsProblemList (last edited 2016-07-21 11:00:40 by KubilayKocak)