Highly Available Storage (HAST)
Contents
Introduction
HAST allows to transparently store data on two physically separated machines connected over the TCP/IP network. Those two machines together will be called a cluster and each machine is one cluster node. HAST works in Primary-Secondary (Master-Backup, Master-Slave) configuration, which means that only one of the cluster nodes can be active at any given time. Active node will be called Primary node. This is the node that will be able to handle I/O requests to HAST-managed devices. Currently HAST is limited to two cluster nodes in total.
HAST operates on block level - it provides disk-like devices in /dev/hast/ directory for use by file systems and/or applications. Working on block level makes it transparent for file systems and applications. There in no difference between using HAST-provided device and raw disk, partition, etc. All of them are just regular GEOM providers in FreeBSD.
HAST can be compared to a RAID1 (mirror) where one of the components is local disk (on primary node) and second component is disk on remote machine (secondary node). Every write, delete or flush operation (BIO_WRITE, BIO_DELETE, BIO_FLUSH) is sent to local disk and to remote disk over TCP connection (if secondary node is available). Every read operation (BIO_READ) is served from local disk, unless local disk isn't up-to-date or an I/O error occurs, then read operation is sent to secondary node (if it is, of course, available).
It is very important to reduce synchronization time after node's outage. Synchronizing all data when connection fails for few minutes isn't very optimal. To provide fast synchronization HAST manages on-disk bitmap of dirty extents. Every extent is represented by one bit in bitmap, so this is the smallest block that can be marked as dirty by HAST. By default extent size is 2MB. Eventhough write will be confirmed to the file system above only when both nodes confirm it, it is still possible that the nodes can get out of sync, for example when write will succeed on primary node, but will never be send to secondary node. This is why extent has to be marked as dirty before writing the data. Of course it will be very slow if HAST would:
- Mark extent as dirty and write metadata.
- Write data.
- Mark extent as clean and write metadata.
It would mean that each write operation will be turned into three write operations. To avoid this, HAST keeps fixed number of extents marked as dirty all the time. By default those are 64 most recently touched extents. This of course means that when nodes connect they have to synchronize by default 128MB of data, which don't really need to be synchronized, but on local network it is very quick operation and it is definiatelly worth it. Note that extent size has to be choosen carefully. If the extent size is too small, there will be a lot of metadata updates, which will degrade overall performance. If the extent size is too big, synchronization time can be much longer, which will also degrade performance before synchronization completes.
HAST is not responsible for selecting node's role (primary or secondary). Node's role has to be configured by an administrator of other software like heartbeat or ucarp using hastctl(8) utility.
Replication Modes
Currently only the first replication mode described below is supported, but other replication modes are described as well to show the difference and to note desire for implementing them.
memsync - Report write operation as completed when local write completes and when remote node acknowledges data arrival, but before actually storing the data. The data on remote node will be stored directly after sending answer. This mode is intended to reduce latency, but still provide very good reliability. The only situation where some small amount of data could be lost is when data is stored on primary and sent to secondary. Secondary then acknowledges data and primary reports success to an application. Before data is really stored on secondary node, it goes down for some period of time. Before secondary returns, primary node dies entirely. Secondary node comes back to life and becomes new primary. Unfortunately some small amount of data that was confirmed to be safe to the application was lost. The risk of such situation is very low. The memsync replication mode is the default.
fullsync - Report write operation as completed when local write completes and when remote write completes. This is the safest and the slowest replication mode.
async - Report write operation as completed when local write completes. This is the fastest and the most dangerous replica- tion mode. This mode should be used when replicating to a distant node where latency is too high for other modes. The async replication mode is currently not implemented.
How does Synchronization work?
When we work as primary and secondary is missing we will increase localcnt in our metadata. When secondary is connected and synced we make localcnt be equal to remotecnt, which means nodes are more or less in sync.
Split-brain condition is when both nodes are not able to communicate and are both configured as primary nodes. In turn, they can both make incompatible changes to the data and we have to detect that. Under split-brain condition we will increase our localcnt on first write and remote node will increase its localcnt on first write. When we connect we can see that primary's localcnt is greater than our remotecnt (primary was modified while we weren't watching) and our localcnt is greater than primary's remotecnt (we were modified while primary wasn't watching).
There are many possible combinations which are all gathered below. Don't pay too much attention to exact numbers, the more important is to compare them. We compare secondary's local with primary's remote and secondary's remote with primary's local. Note that every case where primary's localcnt is smaller than secondary's remotecnt and where secondary's localcnt is smaller than primary's remotecnt should be impossible in practise. We will perform full synchronization then. Those cases are marked with an asterisk. Regular synchronization means that only extents marked as dirty are synchronized (regular synchronization).
Secondary metadata |
Primary metadata |
Synchronization type |
local=3, remote=3 |
local=2, remote=2* |
?! Full sync from secondary. |
local=3, remote=3 |
local=2, remote=3* |
?! Full sync from primary. |
local=3, remote=3 |
local=2, remote=4* |
?! Full sync from primary. |
local=3, remote=3 |
local=3, remote=2 |
Primary is out-of-date, regular sync from secondary. |
local=3, remote=3 |
local=3, remote=3 |
Regular sync just in case. |
local=3, remote=3 |
local=3, remote=4* |
?! Full sync from primary. |
local=3, remote=3 |
local=4, remote=2 |
Split-brain condition. |
local=3, remote=3 |
local=4, remote=3 |
Secondary out-of-date, regular sync from primary. |
local=3, remote=3 |
local=4, remote=4* |
?! Full sync from primary. |
Setting up HAST
This section will describe how to integrate HAST with UCARP (/usr/ports/net/ucarp).
Let's assume we have two machines for our needs (hasta and hastb). Each have one network interface: em0. The hasta node is using IP 10.8.0.1/24 and hastb is using IP 10.8.0.2/24. For ucarp purposes they will share 10.8.0.3. For HAST purposes we will use local /dev/da0 disk, which is of equal size on both nodes.
First let's configure HAST. We need one, identical /etc/hast.conf on both nodes. The config will be as simple as possible:
resource test { on hasta { local /dev/da0 remote 10.8.0.2 } on hastb { local /dev/da0 remote 10.8.0.1 } }
The below is how ucarp starting script could look like. The script will automatically detect on which node it is run and will act accordingly.
#!/bin/sh # Shared IP address, unused for now. addr="10.8.0.3" # Password for UCARP communication. pass="password" # First node IP and interface for UCARP communication. nodea_srcip="10.8.0.1" nodea_ifnet="em0" # Second node IP and interface for UCARP communication. nodeb_srcip="10.8.0.2" nodeb_ifnet="em0" export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin vhid="1" upscript="/usr/local/bin/vip-up.sh" downscript="/usr/local/bin/vip-down.sh" ifconfig "${nodea_ifnet}" 2>/dev/null | grep -q "inet ${nodea_srcip} " if [ $? -eq 0 ]; then srcip="${nodea_srcip}" ifnet="${nodea_ifnet}" node="node A" fi ifconfig "${nodeb_ifnet}" 2>/dev/null | grep -q "inet ${nodeb_srcip} " if [ $? -eq 0 ]; then if [ -n "${srcip}" -o -n "${ifnet}" ]; then echo "Unable to determine which node is this (both match)." >/dev/stderr exit 1 fi srcip="${nodeb_srcip}" ifnet="${nodeb_ifnet}" node="node B" fi if [ -z "${srcip}" -o -z "${ifnet}" ]; then echo "Unable to determine which node is this (none match)." >/dev/stderr exit 1 fi ucarp -B -i ${ifnet} -s ${srcip} -v ${vhid} -a ${addr} -p ${pass} -u "${upscript}" -d "${downscript}"
Ucarp will execute /usr/local/bin/vip-up.sh once MASTER role is negotiated and will execute /usr/local/bin/vip-down.sh once switched to BACKUP role.
Those scripts are rather simple. Eventhough ucarp will pass interface name and shared IP address to those scripts, we currently do nothing with those informations. We could extend ucarp_up.sh and ucarp_down.sh scripts provided below to also configure/unconfigure shared IP when role is changed, but it is not done now.
vip-up.sh:
#!/bin/sh set -m /usr/local/bin/ucarp_up.sh & set +m
We need to turn on job control before running /usr/local/bin/ucarp_up.sh as we may need to kill it from ucarp_down.sh as a process group. We also need to run ucarp_up.sh in background, as it can perform time-consuming tasks like fsck(8)ing file system, etc. If run in foreground it will pause ucarp process allowing the other node to retake MASTER role (as our node is stopped and it is read be the other node as if it was down).
vip-down.sh:
#!/bin/sh /usr/local/bin/ucarp_down.sh
Ok. Now let's see ucarp_up.sh and ucarp_down.sh scripts.
The ucarp_up.sh script is responsible for serval things. When operating with UFS:
- Switch HAST to PRIMARY role for the given resource.
- Run fsck(8) on the given GEOM provider where our UFS file system is placed.
- Mount the file system.
When operating with ZFS:
- Switch HAST to PRIMARY role for the given resource.
- Forcibly import the given ZFS pool.
And this is how it looks like:
#!/bin/sh # Resource name as defined in /etc/hast.conf. resource="test" # Supported file system types: UFS, ZFS fstype="UFS" # ZFS pool name. Required only when fstype == ZFS. pool="test" # File system mount point. Required only when fstype == UFS. mountpoint="/mnt/test" # Name of HAST provider as defined in /etc/hast.conf. device="/dev/hast/${resource}" export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin # If there is secondary worker process, it means that remote primary process is # still running. We have to wait for it to terminate. for i in `jot 30`; do pgrep -f "hastd: ${resource} \(secondary\)" >/dev/null 2>&1 || break sleep 1 done if pgrep -f "hastd: ${resource} \(secondary\)" >/dev/null 2>&1; then logger -p local0.error -t hast "Secondary process for resource ${resource} is still running after 30 seconds." exit 1 fi logger -p local0.debug -t hast "Secondary process in not running." # Change role to primary for our resource. out=`hastctl role primary "${resource}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "Unable to change to role to primary for resource ${resource}: ${out}." exit 1 fi # Wait few seconds for provider to appear. for i in `jot 50`; do [ -c "${device}" ] && break sleep 0.1 done if [ ! -c "${device}" ]; then logger -p local0.error -t hast "Device ${device} didn't appear." exit 1 fi logger -p local0.debug -t hast "Role for resource ${resource} changed to primary." case "${fstype}" in UFS) # Check the file system. fsck -y -t ufs "${device}" >/dev/null 2>&1 if [ $? -ne 0 ]; then logger -p local0.error -t hast "File system check for resource ${resource} failed." exit 1 fi logger -p local0.debug -t hast "File system check for resource ${resource} finished." # Mount the file system. out=`mount -t ufs "${device}" "${mountpoint}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "File system mount for resource ${resource} failed: ${out}." exit 1 fi logger -p local0.debug -t hast "File system for resource ${resource} mounted." ;; ZFS) # Import ZFS pool. Do it forcibly as it remembers hostid of # the other cluster node. out=`zpool import -f "${pool}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "ZFS pool import for resource ${resource} failed: ${out}." exit 1 fi logger -p local0.debug -t hast "ZFS pool for resource ${resource} imported." ;; esac logger -p local0.info -t hast "Successfully switched to primary for resource ${resource}." exit 0
The ucarp_down.sh script is responsible for serval things. When operating with UFS:
Kill ucarp_up.sh and its children if it's running.
- Forcibly unmount UFS file system.
- Switch HAST to SECONDARY role for the given resource.
When operating with ZFS:
Kill ucarp_up.sh and its children if it's running.
- Forcibly export the given ZFS pool.
- Switch HAST to PRIMARY role for the given resource.
And the code:
#!/bin/sh # Resource name as defined in /etc/hast.conf. resource="test" # Supported file system types: UFS, ZFS fstype="UFS" # ZFS pool name. Required only when fstype == ZFS. pool="test" # File system mount point. Required only when fstype == UFS. mountpoint="/mnt/test" # Name of HAST provider as defined in /etc/hast.conf. # Required only when fstype == UFS. device="/dev/hast/${resource}" export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin # KIll UP script if it still runs in the background. sig="TERM" for i in `jot 30`; do pgid=`pgrep -f ucarp_up.sh | head -1` [ -n "${pgid}" ] || break kill -${sig} -- -${pgid} sig="KILL" sleep 1 done if [ -n "${pgid}" ]; then logger -p local0.error -t hast "UCARP UP process for resource ${resource} is still running after 30 seconds." exit 1 fi logger -p local0.debug -t hast "UCARP UP is not running." case "${fstype}" in UFS) mount | egrep -q "^${device} on " if [ $? -eq 0 ]; then # Forcibly unmount file system. out=`umount -f "${mountpoint}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "Unable to unmount file system for resource ${resource}: ${out}." exit 1 fi logger -p local0.debug -t hast "File system for resource ${resource} unmounted." fi ;; ZFS) zpool list | egrep -q "^${pool} " if [ $? -eq 0 ]; then # Forcibly export file pool. out=`zpool export -f "${pool}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "Unable to export pool for resource ${resource}: ${out}." exit 1 fi logger -p local0.debug -t hast "ZFS pool for resource ${resource} exported." fi ;; esac # Change role to secondary for our resource. out=`hastctl role secondary "${resource}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "Unable to change to role to secondary for resource ${resource}: ${out}." exit 1 fi logger -p local0.debug -t hast "Role for resource ${resource} changed to secondary." logger -p local0.info -t hast "Successfully switched to secondary for resource ${resource}." exit 0
Before we start ucarp.sh we have to do some initializations. The following commands will place initial metadata onto local disks and will start hastd:
hasta# hastctl create test hasta# service hastd onestart hastb# hastctl create test hastb# service hastd onestart
Configure hasta node as primary for resource test:
hasta# hastctl role primary test
And hastb as secondary:
hastb# hastctl role secondary test
The /dev/hast/test device will only appear on hasta node (primary), so we configure file system there:
hasta# newfs -U /dev/hast/test
Depending on the disks size, the synchronization might be still in progress at this time, which can be found be running the following command and observing 'dirty' field:
hasta# hastctl status test
HAST is ready, we now only need to start ucarp.sh script on both nodes:
hasta# ucarp.sh hastb# ucarp.sh
Of course it isn't really useful to have empty redundant file system. The ucarp_up.sh script can be modified to start and the ucarp_down.sh to stop application of our choice that will use redundant HAST device.
To test HAST switch we can send USR2 signal to ucarp on MASTER node, which will make it to downgrade to BACKUP:
hasta# pkill -USR2 -f 'ucarp -B'
If it works fine, feel free to enable hastd so it will be automatically started after a reboot. For this, run
sysrc hastd_enable="YES"
on both nodes.
The scripts pasted above can be found in /usr/share/examples/hast/.
More Information
hast.conf(5) - Configuration file for the hastd 8 daemon and the hastctl 8 utility
hastctl(8) - Highly Available Storage Control Utility
hastd(8) - Highly Available Storage daemon
If you have any questions or comments, you can ask on the freebsd-fs mailing list