ZFS

Organizer: Matt Ahrens <mahrens@FreeBSD.org>

Abstract:

Topics:

Implementing the new hashing algorithms (SHA-512t, Skein)
ARC/VFS Interactions (a find(1) on a large SVN tree will cause the ARC to fill with unevictable metadata and kill the system)
ABD (ARC Buf Data) - filling the ARC with scatter-gather list of page_t's instead of contiguous kernel VA
swap/dump (especially dump) to ZVOL
complete the testing/review of the UEFI ZFS bootloader so it can be enabled by default
defining the roadmap for fixing ZFS trim in FreeBSD
secure delete files (0 blocks as they are freed)
fixing vdev layout bugs in the FreeBSD installer (e.g. instead of a stripe of two disk mirrors, a 12 disk install resulted in a single 12-way mirror)
zfs allow {all-properties,all-commands} (future-proof for new properties)
Upcoming ZFS features

Attending

In order to attend you need register for the storage summit as well as email and have your working group slot confirmed by the working group organizer.

Please do NOT add yourself here. Your name will appear automatically once you have received the confirmation email. You need to put your name on the general storage summit attendees list though.

Name	Username / Affiliation	Topics of Interest	Notes
Matt Ahrens	mahrens / Delphix	all	Organizer
Allan Jude	allanjude	all
Sean Kelly	FlightAware
Dan Kimmel	Delphix

Discussion Notes

dedup performance
- limit the size of the DDT (disable dedup when it the DDT is full)
- LRU for the DDT, so old entries are undeduped, and only busy blocks are deduped
- DDT on dedicated device (SSD, NVMe)
- Intel's "metadata classes" work (government contract to upstream to ZoL)
- DDT write amplification (all writes have to update the DDT hash table)
  - DDT Log, append the changes (inc/dec the reference counts)
  - Once the log is x times bigger than the optimal size, coalesce/compact the log
  - Should make the in memory DDT smaller
- Shrink DDT entries by removing the 2nd and 3rd block pointers that are rarely used
- DDT limit, evict random item with ref=1 on each new addition. If read a dedup block, and it is not in the DDT, then there is only 1 copy
- Followup / Leader: Jordan Hubbard
- New hash algorithms for dedup: sha-512 truncated, skein (with random initializer, protect against dedup hash collisions)
secure delete
- FreeBSD TRIM will lose the list of blocks to be trimmed on a power failure
- Saso's TRIM does not appear to have this problem
  - Collects all frees from the last 32 txgs
  - Waits a further 32 txgs
  - Removes any ranges that have been overwritten in the mean time
  - Runs the TRIM command on the ranges
- SSDs FTL means overwriting a block is not good enough, as it might write to a different block
- Current Drives do not provide a trust worthly "secure delete"
- iOS/OS X devices control flash NAND directly, and avoid the FTL
- Eager zero: force allocation of sparse storage, passively writes zeros to all blocks
  - Could be used to overwrite all unallocated space periodically
cpu affinity
- In networking, RSS hashing means incoming packets from source A always go to thread 3
- coordinate with the network, so zfs and the disk controller queue use the same CPU as the nic queue etc.
ABD
- How it works today:
  - Store linear buffers, contiguous kernel memory
  - single copy of every block in the cache
  - Pointer to the data in the ARC is given to the consumer
- Compressed ARC:
  - The ARC copy of the data is compressed
  - Memory is not compressed, but all data read from disks that is already compressed, is not uncompressed until it is used
  - Intermediate representation of the data, uncompressed, is created on demand (when a consumer reads it)
  - The consumer gets a pointer to the uncompressed version, when the consumer is done with it, it is returned
  - small LRU cache (DRU dbuf cache) for frequently accessed blocks
- ABD:
  - The compressed version of the ARC buffer is scatter-gather list of 4k buffers containing the data
  - Uncompressed buffer is contiguous, but is freed fairly quickly (aside from the dbuf cache)
  - Platform specific:
    - IllumOS and Linux implementations in progress
    - Followup / Leader: Dan Kimmel
    - FreeBSD Contact: John Baldwin
NUMA
- dbuf cache per domain for improved locality
SMR
- CAM Changes
- GEOM Changes:
  - Discover zones, attributes
  - Possible to expose zones as separate devices?
  - If using RAID-5, 4+1 disks would expose 1 logical 256*4 = 1G zone. Software would need to manage this logical mapping
- How to deal with garbage collections, compaction
- There are random I/O zones, usually the lowest LBAs. Sometimes a runt zone at the end as well
- Teach ZFS to decide if certain blocks should be written to random I/O zones, and data to sequential zones
- Backpointers
- Use a logical to physical zone mapping. Copy an old zone and fill gaps with new data, to avoid changing the offsets/checksums of the old data
- SMR has multiple zone types, Standard (random), Sequential Write Required (append only), Sequential Write Preferred (slower, write amplifies old data for you). A possible new one called as Circular Buffer Zone (rewrite in the same LBAs).
- What about when you swap drives for newer/bigger/different zone sized ones? If the layout of the file system is specific to the drives...
- Can you turn on 'SMR Support' for a pool that isn't SMR "yet"
- A GEOM layer to coalesce random writes into zones, backed by persistent memory
encryption
- Want pre-dataset encryption
- How to control keys
- What metadata to encrypt, and what must be unencrypted
- Encrypt the pool-level metadata, and optionally each dataset
- What are the use cases?
- Want to be able to start the system, and some services without requiring the key
RAID-Y
- Hybrid of RAID-Z and RAID-5, to avoid performance and padding overhead of RAID-Z
- A whole block would only be on one disk, so multiple readers would conflict less
- Might allow additional devices to be added to the vdev after the fact. Allow upgrading to additional parity
- Better random read performance
Future Features
- Device Removal. stripes only is ready at Delphix
  - To support mirrors, need to verify the checksums of all data read, to make sure if checksum fails, read from other side of the mirror
  - For RAID-Z, alignment constraints. Make sure blocks are spread across devices, do not allocate data and parity to the same device. Padding might be different on the other device.
- Channel Programs
  - Allow to perform multiple administrative actions as a single atomic operations
  - Going into production with Delphix
  - First version will have a limited number of operations
  - ZFS private variant to LUA
  - Sync all actions as a single transaction group
  - How will this be pulled into FreeBSD?
  - Available Operations: Destroy snap, Destroy dataset, iterate over snapshots and clones.
- Code Review Needed:
  - Persistent L2ARC
  - Write-back cache
  - Large dnode support (Lustre). System Attributes/Metadata too big for the bonus buffer, creates a spill block. Use a 2x or 3x sized dnode instead
- OpenZFS on Ubuntu

DevSummit/201602/StorageSummit/ZFS (last edited 2021-04-25T07:12:22+0000 by JethroNederhof)