Some differences between Solaris and FreeBSD VFS, and porting considerations
In Solaris VFS vnodes have a single reference count field v_count. The field, as other vnode fields, is protected by v_lock, which is a mutex that should be acquired for vnode fields access. The reference count is managed by by VN_HOLD and VN_RELE. VN_HOLD bumps v_count under v_lock, VN_RELE decrements the count if it is greater than one. If the count is one, then VN_RELE calls VOP_INACTIVE. Solaris VFS has a simpler vnode life cycle comparing to FreeBSD. There is no inactive state, the vnodes go from active straight to destroyed. As such there is no VOP_RECLAIM and no free list. Besides, the VFS code doesn't do the vnode deallocation, this task is deferred to filesystem code. So, VOP_INACTIVE is called after the VFS code has seen v_count going down to one. VOP_INACTIVE is called without any VFS locks held and without decrementing v_count (at VFS level). Because of that it is possible that v_count gets incremented before VOP_INACTIVE is executed. Typical vop_inactive implementation acts according to an algorithm similar to the following:
- acquire a lock that protects the filesystem's node registry (something akin to vfs_hash_lock)
- acquire v_lock
- check v_count, if it's greater than 1, then simply decrement it (to compensate for the decrement not done at VFS level) and bail out
- remove the vnode from the registry (such as vfs hash in FreeBSD)
- release the locks
- at this point the vnode can no longer get referenced, so it is safe to destroy the vnode and its filesystem node
Note: the vnode has to be explicitly destroyed/freed by the fs code.
The above protocol ensures that the vnode is always freed, that it is not freed more than once and it is not freed while there is an outstanding reference to it.
Note: it may seem very inefficient that a vnode is always freed when its reference count goes to zero when multiple independent operations are executed on the vnode in a succession. The Solaris solution is that the vnode would most likely be entered to a name cache in one place or another. And the cache keeps a reference on the vnode thus preventing it from immediate destruction. Also note that if the vnode is entered into the cache multiple times, then only a single v_count increment occurs, other references are reflected in a separate counter v_count_dnlc. Thus it is possible to easily detect a situation when only the name cache has a reference (v_count == 1 && v_count_dnlc > 0) and thus the name cache serves as a form of the freelist caching. Periodic or on-demand purges of the cache remove vnodes with v_count == 1.
Forceful unmount handling is entirely a responsibility of a filesystem. The filesystem typically marks all nodes with a special state, which can be equivalent to the removed state. The vnodes are left alone and valid. All vnode operations in the filesystem code must check for that state and abort. Reference counts on the vnodes should eventually drain down to zero and the vnodes should be freed in a normal fashion.
In FreeBSD we have v_usecount and v_holdcnt. v_usecount tracks active uses of a vnode while v_holdcnt tracks references that prevent freeing of the vnode memory. Since every active reference is supposed to prevent the recycling, v_holdcnt is always greater than equal to v_usecount, because v_holdcnt is incremented each time v_usecount is incremented, but the opposite is not true.
Life cycle of a vnode in FreeBSD is more complex than in Solaris. It can be reasonably described in a simplified form, but actual mechanics are more complicated. Two operations are used to involve filesystem code into the management: VOP_INACTIVE to notify that a vnode has no active users and VOP_RECLAIM to request the filesystem-specific cleanup right before the vnode is about to be destroyed.
Also, FreeBSD VFS involves more locks during vnode management. There is an interlock mutex (VI_LOCK/VI_UNLOCK) that is used to protect some fields of the vnode and also to avoid some race conditions. v_usecount and v_holdcnt are manipulated under the interlock. Unlike Solaris FreeBSD VFS actively uses VOP_LOCK not only to protect filesystem operations on a vnode, but also to manage vnode state. While in Solaris it is possible for VOP_LOCK to be a dummy operation, the same is not possible in FreeBSD.
FreeBSD provides the following functions to manage vnode counts:
- vref - increments v_usecount and v_holdnt
- vhold - increments v_holdcnt only
- vget - increments v_usecount and v_holdcnt, plus locks the vnode using VOP_LOCK according to the given lock flags
- vdrop - decrements v_holdcnt
- vput/vrele/vunref - decrement v_usecount and v_holdcnt, have different locking pre- and post- conditions
If v_usecount reaches zero in vput/vrele/vunref and the current thread is not already in VOP_INACTIVE, then the VFS code tries to obtain LK_EXCLUSIVE vnode lock (via VOP_LOCK), if not already held, and then calls VOP_INACTIVE. VI_DOINGINACT flag is set on the vnode before the call, so that recursions from VOP_INACTIVE into the VFS could be detected. Again, VOP_INACTIVE is called with exclusive vnode lock held but without the interlock.
A detail: if v_usecount reaches zero, but the exclusive lock could not be obtained, then VI_OWEINACT flag is set on the vnode. The flag indicates that VOP_INACTIVE should be called at certain points later on.
VOP_INACTIVE notifies a filesystem that there are no active uses of the vnode at a particular moment. But the vnode is still valid and could be re-activated. A filesystem is free to do nothing in VOP_INACTIVE and delay all actions until later (VOP_RECLAIM). Many filesystems detect if the vnode is associated with an unlinked filesystem node and thus can not be re-activated and chose to ask VFS using vrecycle call to destroy vnode immediately, if possible. Note that vrecycle doesn't necessarily lead to actual immediate recycling. Outstanding use references may prevent it from happening. The immediate reclaim can be forced with vgone() call, if needed. Other filesystem may have to perform more complex actions based on their internal design and implementation. In the general case the vnode and its filesystem node remain valid and re-usable. This way, the caching of the inode pages is implemented.
After the VOP_INACTIVE call is completed v_holdcnt is examined. If it reaches zero then the vnode is placed on a free list which contains vnodes that are subject to recycling/destruction. vnodes from the list can be chosen for recycling by a special thread or by any thread calling getnewvnode when certain vnode limits are exceeded. Note that a vnode could be re-used and leave the free list. vnodes on the free list have v_holdcnt and by extension v_usecount equal to zero.
There are two kinds of code that recycles the vnodes. One recycles the free vnodes from the free list and the other, more expensive, tries to recycle inactive vnodes (v_usecount == 0), which are not on the free list (v_holdcnt != 0). Thus, we can see that holding a vnode does not guarantee that the vnode will stick around in its valid state. In both cases the protocol is generally the same:
- increment the hold count
set VI_DOOMED flag on the vnode.
- call vrecycle()
- note: if VI_OWEINACT is set, then VOP_INACTIVE is invoked before VOP_RECLAIM
- purges the vnode from cache (which could lead to vdrop call if the cache holds the [directory] vnode)
- reset the vnode fields to VBAD state
- decrement hold count
Note that both vnode interlock and vnode lock are held while the flag VI_DOOMED flag is set. This means that the recycle cannot be started while any of the locks is owned, if it is not started already.
Once this sequence is started there is no going back. The filesystem code must clean up the vnode. VOP_RECLAIM is executed under exclusive vnode lock. VI_DOOMED signifies that the vnode is being recycled and has been recycled to VBAD state. If after the described sequence the hold count is not zero, then the vnode stays around in the VI_DOOMED+VBAD state until the last hold is dropped. The vnode pointer is valid until then. When the hold count goes to zero (via vdrop) the VFS code checks VI_DOOMED flag and destroys the vnode.
Yet another case when VOP_RECLAIM (and VOP_INACTIVE) can be called is filesystem unmount. If unmounting is graceful, then only vnodes with v_usecount of zero will be reclaimed in the fashion outlined above. Any active vnodes will stay and the unmount operation will not complete with EBUSY status. If the operation is forceful, then even active vnodes will be recycled. Again, the algorithm as the one above but additionally VOP_CLOSE and VOP_INACTIVE will be called.
As we can see the FreeBSD strategy offers some advantages:
- role separation between VOP_INACTIVE and VOP_RECLAIM and so potentially simpler logic for both, together with the centralized handling of vnode cache.
- cached free vnodes allow for faster reactivation when a sequence of operations is executed on the same vnode
- vnodes less prone to filesystem code bugs or leaks
- more complex VFS code
- more complex VFS API (many reference/dereference functions)
more complex state machine: additional doomed state
- possibility of forceful unmount makes handling of the doomed state a necessity
- more complex locking considerations: VOP_LOCK must always be a proper lock that supports exclusive acquisition
- recursion both from fs code via VFS back into the fs code, and vice versa VFS via fs
Small FreeBSD summary:
- if you have a vnode lock, then you are guaranteed that nothing important changes in the vnode state
- if you have a reference to a vnode, then you are guaranteed that the vnode stays valid unless a call to vgone() happens, which typically occur due to a forceful unmount request or the underlying device gone.
- if you have a hold on a vnode, then you are guaranteed that the vnode pointer stays valid, but nothing about the state/validity of the vnode
- The FreeBSD VFS never calls into filesystem with the doomed vnode, if the VOP interface specifies that the vnode lock shall be held. On the other hand, if filesystem code drops the vnode lock, it might find a vnode associated with a filesystem node to be in a doomed state after the relock.
- taking a vnode lock or interlock, and checking for VI_DOOMED (without dropping the lock) is the only way to get a guaranteed valid (or already recycled) vnode
Some additional problem:
- as described above, to be sure of vnode's state one has to acquire and hold the vnode lock — a reference is not sufficient
- this has a deadlock potential if another lock is already held (other vnode's lock or some filesystem private lock)
- First step could be as simple as making Solaris VOP_INACTIVE a FreeBSD VOP_RECLAIM and using VOP_NULL for VOP_INACTIVE.
- If Solaris VOP_INACTIVE has some meaningful / useful actions before it does the bailout check on v_count, then it may make sense to move those actions to FreeBSD VOP_INACTIVE.
- Remember that it doesn't make sense to do any v_usecount or v_holdcnt checks or assertions in either VOP_INACTIVE or VOP_RECLAIM
- In FreeBSD it is VFS that manages vnodes and filesystem can not do anything except obey (or in some cases help).
- Don't even think about directly manipulating v_usecount or v_holdcnt fields.
- Don't mess with VFS calls that change the counts or may drop a vnode lock
- VOP_INACTIVE must not do any destructive actions to a vnode and its filesystem node, nor invalidate them in any way.
- vrecycle can be used in VOP_INACTIVE to finish processing of a vnode that references a removed filesystem node.
- The filesystem must arrange that the node / vnode are not returned by any lookups (ENOENT).
- FreeBSD VOP_RECLAIM is expected to call vnode_destroy_vobject (could probably be done at VFS level for all).
- FreeBSD VOP_RECLAIM is expected to clear v_data field.
If VOP inactive is called then either of the following is true:
- use count is zero — normal inactivation
- VI_DOOMED is set — forceful reclaim or owed/delayed inactivation (VI_OWEINACT could be set too)
use count is one and VI_OWEINACT is set — owed/delayed inactivation in vget before re-activation
Calling back into fs code:
- In Solaris it is expected that VN_RELE (or functions that may call it, e.g. dnlc_purge_vfsp) may call into VOP_INACTIVE
- In FreeBSD VOP_RECLAIM is never invoked directly from vput-ish/vdrop
- In FreeBSD VOP_INACTIVE is never called directly from vdrop
- in Solaris there are no possible calls into fs code when creating a new vnode
- In FreeBSD getnewvnode may call VOP_RECLAIM and/or VOP_INACTIVE; but see getnewvnode_reserve