I (kib@) and Peter Holm (pho@) have developed a patch, available at svn:users/kib/vm6.
The patch implements rangelocks and VM-based I/O for regular vnodes, based on Jeff Roberson idea, see http://people.freebsd.org/~jeff/vm_readwrite.c.
The patch aims to solve the following issues present in the FreeBSD:
- deadlock demonstrated by ups@ dl.c, LOR between vnode locks when one thread does read from one vnode to buffer backed by another vnode, and other thread does the same with vnodes reversed.
- recursive acquisition of the vnode lock when thread does read/write from the vnode to buffer backed by the same vnode. Also, the interesting case of nullfs failure here mount -t nullfs /x /mnt is followed by cp /x/a /mnt/a.
Patch adds rangelocks that cover modified file region for read, write and truncate. Actual read or write is performed by doing uiomove_fromphys() from/to array of the held pages of vnode-backed vm object.The pages are allocated and read using VOP_GETPAGES() if neccessary.
I cannot busy the pages for the duration of uiomove, since doing this allows for LOR between busy pages of different vnodes or self-lock, where vm_fault() waits for busy pages that are busied by the current thread, if uiomove causes page faults in the situation of original ups' deadlock. The later can be somewhat worked around by informing vm_fault() what pages are busied by vnode_pager_read/write, but the first issue seems to not have a good solution.
Only holding the pages allows the other thread to take a page fault on file mapping. This is the reason why I require fully-valid page in vnode_pager_write(). If the page is newly-added to the object, then other thread may see zeroes in the mapping. See comments in vnode_pager_write().
When adding pages for the file region that is extended by write() call, I need to truncate file to larger size. Doing this using VOP_SETATTR() appears to be prohibitely costly, at least on UFS, because ffs_truncate() allocates the block for last byte and writes it.
When file is modified, suid bit shall be dropped. Doing it by VOP_GETATTR/VOP_SETATTR is inefficient. VOP_EXTEND is guaranteed to be called at least once for each write, and filesystem may use the opportunity to clean the bits. ffs_extend() handle this.
I introduced new VOP_EXTEND() that extends the file for new size. Default implementation does VOP_GETATTR/VOP_SETTATTR. On UFS, custom implementation is provided, that sets new size in inode, without allocating block for last byte. When extending the file that ended in fragment, the whole block or neccessary number of fragments at the previous end is reallocated, to keep the invariant that fragment is allowed only at direct blocks at end of file. Current fsck would treat such extended files as filesystem inconsistency, and return the size to pre-extend value.
Pressure on writers is another critical feature, both for performance and for responsivness of the system. I tried to do it simple, by counting number of pages dirtyfied by vnode_pager_write(), and doing the cleaning pass similar to the pageout launder if count of the pages becomes too big.
I noted slight performance improvement and decrease of write counter when I enabled to cluster pages both from inactive and active queues in vm_pageout_clean(). This change is kept in the patch for now, but can be reverted.
Performance: non-scientific installworld time measurement shows no significant change,for some Peter hardware on stock kernel:
# time make installworld > /dev/null 210.44 real 35.61 user 59.47 sys
on the same hardware with patch applied:
# time make installworld > /dev/null 207.66 real 35.93 user 58.68 sys
Known issues and parts not done:
Jeff' concern is that the patch increases contention on vm_page_queue_lock, that is already highly contented. See sysbench graph above.
Currently, the behaviour of the system when appending to the file is the same as if the file is created, then extended, mmaped into address space and then mapped area is filled. It is known that this caused severe file fragmentation, since buf clustering code does not see adjusted writes. Something similar for VOP_REALLOCBLKS ought to be implemented for page flushes.
On the other hand, we have measured the current situation, and it is not much worse that for unpatched kernel. Patch contains the sequential write detection code. vm6 branch contains the UFS fragmentation measurement tool, see tools/tools/ufs/fragc. We run the following load:
> I create some 200 files of various size in 1k increments. I then have > 60 threads that copy the files to a new location, in parallel. The > files in the new location are then deleted. This is repeated 50 times. > Finally the original files are copied to a new location and the > originals are deleted.
The measurement of the fragmentation with use of fragc shown that both unpatched and patched kernel give fragmentation <0.2%, with some variations +/- 40% of absolute value on each side.
Filesystem full handling
Similar to the item #2, the mode of operation makes it very hard to inform write(2) callers on the free space exhaustion.
This could be fixed by tracking the space (not) claimed by VOP_EXTEND on the per-fs basis.
Swap-backed md deadlock
Swap-backed md is deadlock-prone. When a page is not present in the swap object, vm_page_grab() is called. On the low free pages condition, it VM_WAITs. On the other hand, pageout daemon may need to flush a page belonging to the md swap object, causing producer/consumer deadlock. Page greatly increases the chances of hitting the issue. Some mitigation is presented in the patch, cleaning the dirtyfied pages immediately on low memory condition.
Rangelocks implementation is not optimal
Rangelocks are implemented by msleep and mutex, suffering from the same double-wait queue inefficiency as old sx and lockmgr.
fsync(2) and O_SYNC efficiency
fsync(2) and O_SYNC are implemented by vm_object_page_clean() for now. vm_object_page_clean() was rewritten recently.
Possible opt-in for vmio from fs
I think that new i/o path should be explicitely allowed by mnt_kern_flags bit.