Some notes on ZFS code and its porting from Solaris VFS to FreeBSD VFS

vnode/znode lifecycle

In Solaris it is sufficient to hold a reference to a vnode to prevent it from getting reclaimed. By extension that also keeps its znode around.

In FreeBSD only a vnode lock prevents vnode reclaiming. Usually a reference (v_usecount) prevents reclaiming of a vnode, but FreeBSD VFS allows an active vnode to be reclaimed. A reference would not help e.g. during forceful unmounting. That seems to be the only practical case today, but, again, in theory VFS allows an active vnode to be reclaimed.

This problem is most visible in zfs_zget where we first get a znode under "object hold" lock. We are guaranteed that the znode and by extension its vnode are valid at that moment. Then we vref the node, drop all locks and return. As described above that does not provide any firm guarantee that either the vnode or the znode would stay valid after zfs_zget returns.

Graceful reclaim is done using the following algorithm:

Note that checking of the reference count and setting of VI_DOOMED flag are done atomically with respect to the interlock.
zfs_zget increments the reference count and checks VI_DOOMED. The reference is dropped and znode lookup is retried if VI_DOOMED is set. Because of this lockstep the graceful reclaim and zfs_zget can not race. The znode is under protection of the object hold mutex in both cases.

Locking

In Solaris ZFS had a dummy vnode lock. Solaris VFS allows for that. ZFS completely relies on a number of private locks.

z_teardown_lock protects against concurrent execution of most of vnode operations with unmounting. This lock is of "rrw" type — recursive read-write. vnode methods take a read lock, while unmount takes a write lock. When filesystem is forcefully unmounted, operations on the remaining vnodes will be aborted because of z_unmounted flag.

z_teardown_inactive_lock protects against concurrent execution of vfs_unmount and vop_inactive. Where vop_inactive is closer to FreeBSD vop_reclaim.

In FreeBSD all vnode operations are invoked with the vnode lock held.

States

znode had several special state flags:

zfsvfs (a filesystem/mountpoint structure) has the following state flags:

Current problems

As described above zfs_zget returns a potentially unstable znode. Currently only forceful unmount can damage the znode. Most places that call zfs_zget do that under ZFS_ENTER (z_teardown_lock held and z_unmounted checked). So the znode should be stable. Places that return a vnode or make use of it further acquire a vnode lock, which allows to keep the vnode stable too.

One problematic place is zfs_root, which explicitly opens a hole in the logic by not checking for z_unmounted. So, it may call zfs_zget on a filesystem being actively unmounted (z_teardown_lock write lock is not held around vflush). Although zfs_root tries to obtain the vnode lock, the znode can become destroyed after zfs_zget returns and before the returned znode is accessed.

The hole was opened because zfs_zroot is called from vflush and is expected to succeed. And vflush is called after z_unmounted is set.

Some thoughts on forceful unmount

In FreeBSD we have the following things:

Thus we do not strictly need z_unmounted flag and the accompanying teardown locks as vflush is going to wait on each active vnode with an operation in progress. And after vgone is done no ZFS operation could be invoked on the znode.

The problematic area is still zfs_zget which returns without any locks held.

We could adjust zfs_zget to obtain and keep across returning the vnode lock for the callers like zfs_root and zfs_vget, which have to obtain the vnode lock anyway. The same applies to zfs_lookup.

But there is another complicated place — ZIL commit path. In this path the znodes are obtained by object ID and their vnodes are never explicitly used. The problem is that

  1. running zil_commit blocks other threads
  2. zil_commit may work on several object IDs / znodes / vnodes
  3. the blocked threads may be in a vnode operations themselves and thus may hold the locks on the vnodes

This leads to a deadlock.

Some thoughts on solutions

znode reference count

Add a reference count to znode, so it could linger after its vnode is reclaimed if it is referenced internally.

acquire z_teardown_lock in vop_reclaim

Hold z_teardown_lock as a writer across zfs_freebsd_reclaim (and zfs_zinactive). Thus no vnode operation and no vfs operation like zfs_vget and zfs_root could run while any vnode is being reclaimed and its znode is being destroyed. Thus, no potential for a znode be destroyed while it is in use (within an area covered by ZFS_ENTER). May have a performance hit though.

vnode lock in zfs_zget plus some trick

We could acquire the vnode lock in zfs_zget by extending its signature with additional flags parameter. zil_commit would be a special case and would signal to not acquire the lock. In this case we would fully rely on z_teardown_lock to protect the znode after returning.

ZFS rulebook

Some porting rules that I've just made up:

Unmount ideas scratchpad

AvfZfsFreeBSDPort (last edited 2012-10-16 15:16:35 by AndriyGapon)