Some differences between Solaris and FreeBSD VFS, and porting considerations

vnode recycling

Solaris

In Solaris VFS vnodes have a single reference count field v_count. The field, as other vnode fields, is protected by v_lock, which is a mutex that should be acquired for vnode fields access. The reference count is managed by by VN_HOLD and VN_RELE. VN_HOLD bumps v_count under v_lock, VN_RELE decrements the count if it is greater than one. If the count is one, then VN_RELE calls VOP_INACTIVE. Solaris VFS has a simpler vnode life cycle comparing to FreeBSD. There is no inactive state, the vnodes go from active straight to destroyed. As such there is no VOP_RECLAIM and no free list. Besides, the VFS code doesn't do the vnode deallocation, this task is deferred to filesystem code. So, VOP_INACTIVE is called after the VFS code has seen v_count going down to one. VOP_INACTIVE is called without any VFS locks held and without decrementing v_count (at VFS level). Because of that it is possible that v_count gets incremented before VOP_INACTIVE is executed. Typical vop_inactive implementation acts according to an algorithm similar to the following:

Note: the vnode has to be explicitly destroyed/freed by the fs code.

The above protocol ensures that the vnode is always freed, that it is not freed more than once and it is not freed while there is an outstanding reference to it.

Note: it may seem very inefficient that a vnode is always freed when its reference count goes to zero when multiple independent operations are executed on the vnode in a succession. The Solaris solution is that the vnode would most likely be entered to a name cache in one place or another. And the cache keeps a reference on the vnode thus preventing it from immediate destruction. Also note that if the vnode is entered into the cache multiple times, then only a single v_count increment occurs, other references are reflected in a separate counter v_count_dnlc. Thus it is possible to easily detect a situation when only the name cache has a reference (v_count == 1 && v_count_dnlc > 0) and thus the name cache serves as a form of the freelist caching. Periodic or on-demand purges of the cache remove vnodes with v_count == 1.

Forceful unmount handling is entirely a responsibility of a filesystem. The filesystem typically marks all nodes with a special state, which can be equivalent to the removed state. The vnodes are left alone and valid. All vnode operations in the filesystem code must check for that state and abort. Reference counts on the vnodes should eventually drain down to zero and the vnodes should be freed in a normal fashion.

FreeBSD

In FreeBSD we have v_usecount and v_holdcnt. v_usecount tracks active uses of a vnode while v_holdcnt tracks references that prevent freeing of the vnode memory. Since every active reference is supposed to prevent the recycling, v_holdcnt is always greater than equal to v_usecount, because v_holdcnt is incremented each time v_usecount is incremented, but the opposite is not true.

Life cycle of a vnode in FreeBSD is more complex than in Solaris. It can be reasonably described in a simplified form, but actual mechanics are more complicated. Two operations are used to involve filesystem code into the management: VOP_INACTIVE to notify that a vnode has no active users and VOP_RECLAIM to request the filesystem-specific cleanup right before the vnode is about to be destroyed.

Also, FreeBSD VFS involves more locks during vnode management. There is an interlock mutex (VI_LOCK/VI_UNLOCK) that is used to protect some fields of the vnode and also to avoid some race conditions. v_usecount and v_holdcnt are manipulated under the interlock. Unlike Solaris FreeBSD VFS actively uses VOP_LOCK not only to protect filesystem operations on a vnode, but also to manage vnode state. While in Solaris it is possible for VOP_LOCK to be a dummy operation, the same is not possible in FreeBSD.

FreeBSD provides the following functions to manage vnode counts:

If v_usecount reaches zero in vput/vrele/vunref and the current thread is not already in VOP_INACTIVE, then the VFS code tries to obtain LK_EXCLUSIVE vnode lock (via VOP_LOCK), if not already held, and then calls VOP_INACTIVE. VI_DOINGINACT flag is set on the vnode before the call, so that recursions from VOP_INACTIVE into the VFS could be detected. Again, VOP_INACTIVE is called with exclusive vnode lock held but without the interlock.

A detail: if v_usecount reaches zero, but the exclusive lock could not be obtained, then VI_OWEINACT flag is set on the vnode. The flag indicates that VOP_INACTIVE should be called at certain points later on.

VOP_INACTIVE notifies a filesystem that there are no active uses of the vnode at a particular moment. But the vnode is still valid and could be re-activated. A filesystem is free to do nothing in VOP_INACTIVE and delay all actions until later (VOP_RECLAIM). Many filesystems detect if the vnode is associated with an unlinked filesystem node and thus can not be re-activated and chose to ask VFS using vrecycle call to destroy vnode immediately, if possible. Note that vrecycle doesn't necessarily lead to actual immediate recycling. Outstanding use references may prevent it from happening. The immediate reclaim can be forced with vgone() call, if needed. Other filesystem may have to perform more complex actions based on their internal design and implementation. In the general case the vnode and its filesystem node remain valid and re-usable. This way, the caching of the inode pages is implemented.

After the VOP_INACTIVE call is completed v_holdcnt is examined. If it reaches zero then the vnode is placed on a free list which contains vnodes that are subject to recycling/destruction. vnodes from the list can be chosen for recycling by a special thread or by any thread calling getnewvnode when certain vnode limits are exceeded. Note that a vnode could be re-used and leave the free list. vnodes on the free list have v_holdcnt and by extension v_usecount equal to zero.

There are two kinds of code that recycles the vnodes. One recycles the free vnodes from the free list and the other, more expensive, tries to recycle inactive vnodes (v_usecount == 0), which are not on the free list (v_holdcnt != 0). Thus, we can see that holding a vnode does not guarantee that the vnode will stick around in its valid state. In both cases the protocol is generally the same:

Note that both vnode interlock and vnode lock are held while the flag VI_DOOMED flag is set. This means that the recycle cannot be started while any of the locks is owned, if it is not started already.

Once this sequence is started there is no going back. The filesystem code must clean up the vnode. VOP_RECLAIM is executed under exclusive vnode lock. VI_DOOMED signifies that the vnode is being recycled and has been recycled to VBAD state. If after the described sequence the hold count is not zero, then the vnode stays around in the VI_DOOMED+VBAD state until the last hold is dropped. The vnode pointer is valid until then. When the hold count goes to zero (via vdrop) the VFS code checks VI_DOOMED flag and destroys the vnode.

Yet another case when VOP_RECLAIM (and VOP_INACTIVE) can be called is filesystem unmount. If unmounting is graceful, then only vnodes with v_usecount of zero will be reclaimed in the fashion outlined above. Any active vnodes will stay and the unmount operation will not complete with EBUSY status. If the operation is forceful, then even active vnodes will be recycled. Again, the algorithm as the one above but additionally VOP_CLOSE and VOP_INACTIVE will be called.

As we can see the FreeBSD strategy offers some advantages:

Some disadvantages:

Small FreeBSD summary:

Some additional problem:

Porting steps

Further notes

If VOP inactive is called then either of the following is true:

Calling back into fs code:

AndriyGapon/AvgVfsSolarisVsFreeBSD (last edited 2016-07-21T11:03:31+0000 by KubilayKocak)