ZFS
Organizer: Matt Ahrens <mahrens@FreeBSD.org>
Abstract:
Topics:
- Implementing the new hashing algorithms (SHA-512t, Skein)
- ARC/VFS Interactions (a find(1) on a large SVN tree will cause the ARC to fill with unevictable metadata and kill the system)
- ABD (ARC Buf Data) - filling the ARC with scatter-gather list of page_t's instead of contiguous kernel VA
- swap/dump (especially dump) to ZVOL
- complete the testing/review of the UEFI ZFS bootloader so it can be enabled by default
- defining the roadmap for fixing ZFS trim in FreeBSD
- secure delete files (0 blocks as they are freed)
- fixing vdev layout bugs in the FreeBSD installer (e.g. instead of a stripe of two disk mirrors, a 12 disk install resulted in a single 12-way mirror)
- zfs allow {all-properties,all-commands} (future-proof for new properties)
- Upcoming ZFS features
Attending
In order to attend you need register for the storage summit as well as email and have your working group slot confirmed by the working group organizer.
Please do NOT add yourself here. Your name will appear automatically once you have received the confirmation email. You need to put your name on the general storage summit attendees list though.
Name |
Username / Affiliation |
Topics of Interest |
Notes |
Matt Ahrens |
mahrens / Delphix |
all |
Organizer |
Allan Jude |
allanjude |
all |
|
Sean Kelly |
FlightAware |
|
|
Dan Kimmel |
Delphix |
|
|
Discussion Notes
- dedup performance
- limit the size of the DDT (disable dedup when it the DDT is full)
- LRU for the DDT, so old entries are undeduped, and only busy blocks are deduped
- DDT on dedicated device (SSD, NVMe)
- Intel's "metadata classes" work (government contract to upstream to ZoL)
- DDT write amplification (all writes have to update the DDT hash table)
- DDT Log, append the changes (inc/dec the reference counts)
- Once the log is x times bigger than the optimal size, coalesce/compact the log
- Should make the in memory DDT smaller
- Shrink DDT entries by removing the 2nd and 3rd block pointers that are rarely used
- DDT limit, evict random item with ref=1 on each new addition. If read a dedup block, and it is not in the DDT, then there is only 1 copy
- Followup / Leader: Jordan Hubbard
- New hash algorithms for dedup: sha-512 truncated, skein (with random initializer, protect against dedup hash collisions)
- secure delete
- FreeBSD TRIM will lose the list of blocks to be trimmed on a power failure
- Saso's TRIM does not appear to have this problem
- Collects all frees from the last 32 txgs
- Waits a further 32 txgs
- Removes any ranges that have been overwritten in the mean time
- Runs the TRIM command on the ranges
- SSDs FTL means overwriting a block is not good enough, as it might write to a different block
- Current Drives do not provide a trust worthly "secure delete"
- iOS/OS X devices control flash NAND directly, and avoid the FTL
- Eager zero: force allocation of sparse storage, passively writes zeros to all blocks
- Could be used to overwrite all unallocated space periodically
- cpu affinity
- In networking, RSS hashing means incoming packets from source A always go to thread 3
- coordinate with the network, so zfs and the disk controller queue use the same CPU as the nic queue etc.
- ABD
- How it works today:
- Store linear buffers, contiguous kernel memory
- single copy of every block in the cache
- Pointer to the data in the ARC is given to the consumer
- Compressed ARC:
- The ARC copy of the data is compressed
- Memory is not compressed, but all data read from disks that is already compressed, is not uncompressed until it is used
- Intermediate representation of the data, uncompressed, is created on demand (when a consumer reads it)
- The consumer gets a pointer to the uncompressed version, when the consumer is done with it, it is returned
- small LRU cache (DRU dbuf cache) for frequently accessed blocks
- ABD:
- The compressed version of the ARC buffer is scatter-gather list of 4k buffers containing the data
- Uncompressed buffer is contiguous, but is freed fairly quickly (aside from the dbuf cache)
- Platform specific:
- IllumOS and Linux implementations in progress
- Followup / Leader: Dan Kimmel
- FreeBSD Contact: John Baldwin
- How it works today:
- NUMA
- dbuf cache per domain for improved locality
- SMR
- CAM Changes
- GEOM Changes:
- Discover zones, attributes
- Possible to expose zones as separate devices?
- If using RAID-5, 4+1 disks would expose 1 logical 256*4 = 1G zone. Software would need to manage this logical mapping
- How to deal with garbage collections, compaction
- There are random I/O zones, usually the lowest LBAs. Sometimes a runt zone at the end as well
- Teach ZFS to decide if certain blocks should be written to random I/O zones, and data to sequential zones
- Backpointers
- Use a logical to physical zone mapping. Copy an old zone and fill gaps with new data, to avoid changing the offsets/checksums of the old data
- SMR has multiple zone types, Standard (random), Sequential Write Required (append only), Sequential Write Preferred (slower, write amplifies old data for you). A possible new one called as Circular Buffer Zone (rewrite in the same LBAs).
- What about when you swap drives for newer/bigger/different zone sized ones? If the layout of the file system is specific to the drives...
- Can you turn on 'SMR Support' for a pool that isn't SMR "yet"
- A GEOM layer to coalesce random writes into zones, backed by persistent memory
- encryption
- Want pre-dataset encryption
- How to control keys
- What metadata to encrypt, and what must be unencrypted
- Encrypt the pool-level metadata, and optionally each dataset
- What are the use cases?
- Want to be able to start the system, and some services without requiring the key
- RAID-Y
- Hybrid of RAID-Z and RAID-5, to avoid performance and padding overhead of RAID-Z
- A whole block would only be on one disk, so multiple readers would conflict less
- Might allow additional devices to be added to the vdev after the fact. Allow upgrading to additional parity
- Better random read performance
- Future Features
- Device Removal. stripes only is ready at Delphix
- To support mirrors, need to verify the checksums of all data read, to make sure if checksum fails, read from other side of the mirror
- For RAID-Z, alignment constraints. Make sure blocks are spread across devices, do not allocate data and parity to the same device. Padding might be different on the other device.
- Channel Programs
- Allow to perform multiple administrative actions as a single atomic operations
- Going into production with Delphix
- First version will have a limited number of operations
- ZFS private variant to LUA
- Sync all actions as a single transaction group
- How will this be pulled into FreeBSD?
- Available Operations: Destroy snap, Destroy dataset, iterate over snapshots and clones.
- Code Review Needed:
- Persistent L2ARC
- Write-back cache
- Large dnode support (Lustre). System Attributes/Metadata too big for the bonus buffer, creates a spill block. Use a 2x or 3x sized dnode instead
- OpenZFS on Ubuntu
- Device Removal. stripes only is ready at Delphix