Some notes on ZFS code and its porting from Solaris VFS to FreeBSD VFS
In Solaris it is sufficient to hold a reference to a vnode to prevent it from getting reclaimed. By extension that also keeps its znode around.
In FreeBSD only a vnode lock prevents vnode reclaiming. Usually a reference (v_usecount) prevents reclaiming of a vnode, but FreeBSD VFS allows an active vnode to be reclaimed. A reference would not help e.g. during forceful unmounting. That seems to be the only practical case today, but, again, in theory VFS allows an active vnode to be reclaimed.
This problem is most visible in zfs_zget where we first get a znode under "object hold" lock. We are guaranteed that the znode and by extension its vnode are valid at that moment. Then we vref the node, drop all locks and return. As described above that does not provide any firm guarantee that either the vnode or the znode would stay valid after zfs_zget returns.
Graceful reclaim is done using the following algorithm:
- acquire the vnode lock
- acquire the interlock
check that reference count is zero
set VI_DOOMED flag
release the interlock
- call VOP_RECLAIM
Note that checking of the reference count and setting of VI_DOOMED flag are done atomically with respect to the interlock.
zfs_zget increments the reference count and checks VI_DOOMED. The reference is dropped and znode lookup is retried if VI_DOOMED is set. Because of this lockstep the graceful reclaim and zfs_zget can not race. The znode is under protection of the object hold mutex in both cases.
In Solaris ZFS had a dummy vnode lock. Solaris VFS allows for that. ZFS completely relies on a number of private locks.
z_teardown_lock protects against concurrent execution of most of vnode operations with unmounting. This lock is of "rrw" type — recursive read-write. vnode methods take a read lock, while unmount takes a write lock. When filesystem is forcefully unmounted, operations on the remaining vnodes will be aborted because of z_unmounted flag.
z_teardown_inactive_lock protects against concurrent execution of vfs_unmount and vop_inactive. Where vop_inactive is closer to FreeBSD vop_reclaim.
In FreeBSD all vnode operations are invoked with the vnode lock held.
znode had several special state flags:
- z_unlinked — set when a file is unlinked/removed, but its znode is still valid
z_sa_hdl — this field is only manipulated under the object hold lock;
if this field is NULL, then the znode is effectively invalidated;
all vnodes operations on this znode will terminate with EIO;
it will not be returned by any lookups;
we are only waiting for its vnode to get reclaimed, so that the znode could be freed;
zfsvfs (a filesystem/mountpoint structure) has the following state flags:
z_unmounted — indicates that a filesystem has been unmounted;
this flag is primarily needed for forceful unmount
As described above zfs_zget returns a potentially unstable znode. Currently only forceful unmount can damage the znode. Most places that call zfs_zget do that under ZFS_ENTER (z_teardown_lock held and z_unmounted checked). So the znode should be stable. Places that return a vnode or make use of it further acquire a vnode lock, which allows to keep the vnode stable too.
One problematic place is zfs_root, which explicitly opens a hole in the logic by not checking for z_unmounted. So, it may call zfs_zget on a filesystem being actively unmounted (z_teardown_lock write lock is not held around vflush). Although zfs_root tries to obtain the vnode lock, the znode can become destroyed after zfs_zget returns and before the returned znode is accessed.
The hole was opened because zfs_zroot is called from vflush and is expected to succeed. And vflush is called after z_unmounted is set.
Some thoughts on forceful unmount
In FreeBSD we have the following things:
- zfs_zget re-lookups valid znodes with doomed vnodes, since this means is the vnode is being reclaimed and the znode is about to go away
- vnode operations are called with vnode lock held (and thus vnodes and znodes are guaranteed to be stable during those)
Thus we do not strictly need z_unmounted flag and the accompanying teardown locks as vflush is going to wait on each active vnode with an operation in progress. And after vgone is done no ZFS operation could be invoked on the znode.
The problematic area is still zfs_zget which returns without any locks held.
We could adjust zfs_zget to obtain and keep across returning the vnode lock for the callers like zfs_root and zfs_vget, which have to obtain the vnode lock anyway. The same applies to zfs_lookup.
But there is another complicated place — ZIL commit path. In this path the znodes are obtained by object ID and their vnodes are never explicitly used. The problem is that
- running zil_commit blocks other threads
- zil_commit may work on several object IDs / znodes / vnodes
- the blocked threads may be in a vnode operations themselves and thus may hold the locks on the vnodes
This leads to a deadlock.
Some thoughts on solutions
znode reference count
Add a reference count to znode, so it could linger after its vnode is reclaimed if it is referenced internally.
acquire z_teardown_lock in vop_reclaim
Hold z_teardown_lock as a writer across zfs_freebsd_reclaim (and zfs_zinactive). Thus no vnode operation and no vfs operation like zfs_vget and zfs_root could run while any vnode is being reclaimed and its znode is being destroyed. Thus, no potential for a znode be destroyed while it is in use (within an area covered by ZFS_ENTER). May have a performance hit though.
vnode lock in zfs_zget plus some trick
We could acquire the vnode lock in zfs_zget by extending its signature with additional flags parameter. zil_commit would be a special case and would signal to not acquire the lock. In this case we would fully rely on z_teardown_lock to protect the znode after returning.
Some porting rules that I've just made up:
z_vnode field should be cleared only after z_sa_hdl is cleared (zfs_znode_dmu_fini);
we should not allow a znode with invalid z_vnode to be visible
Unmount ideas scratchpad
- vflush can't be called with z_teardown_lock and/or z_teardown_inactive_lock exclusively held, because they are not recursive and vfs_root, vop_inactive and vop_reclaim may be called, additionally vop_write/vop_putpages could be called as well.
- vflush can't be called after "gutting" the znodes in zfsvfs_teardown, because in this case vfs_root == zfs_root called by vflush would fail; this is because zfs_zget would attempt to create a new znode/vnode pair, but would fail to insert the vnode into mount queue because of MNTK_NOINSMNTQ.
- practically, we should not allow any zfs_zget operations to run concurrently with forceful reclaiming of a vnode
- since no locks can be held around vflush, then there are the following options:
- use a flag like z_unmounted, but then a backdoor to zfs_root is needed
acquire exclusive z_teardown_lock in vop_reclaim iff v_usecount > 0 has been seen
- make and use a variant of vflush that does not call vfs_root — the root node is provided as parameter