Continuous Data Protection at Rubrik

Rubrik Continuous Data Protection (CDP) helps our customers protect mission critical VMware workloads with near-zero Recovery Point Objective (RPO). Recovery operations are available in both local and remote locations. It also integrates seamlessly with Rubrik Orchestrated Application Recovery to provide near-zero RPO and low Recovery Time Objective (RTO) disaster recovery for our customers.

This was a significant accomplishment made possible by work from various experts in the Rubrik tech stack to build and make this functionality resilient to challenges such as network congestion, partitions, IO bursts in the protected virtual machines etc. In this blog, we will break down the components of our solution and how we were able to achieve a lightning fast, fault tolerant continuous data protection pipeline.

Getting the setup out of the way

Rubrik CDP is seamlessly integrated into the Rubrik platform, with a toggle on the Service Level Agreements (SLAs) wizard allowing our customers to enable CDP protection as part of the SLA on top of the standard backup and replication functionalities provided by Rubrik. Rubrik allows recovery to any point in time in the last 24 hours on both local and remote locations (except in the small intervals when we go to Resync, which is described at the tail end of the blog). For CDP to be able to start, the customer has to install Rubrik’s I/O filter using CDM. Once the I/O filter is installed, any virtual machines with CDP SLA will automatically start capturing and storing the last few hours of I/O data according to the SLA starting from the next snapshot.

CDP Architecture

Fig 1. High level overview of the CDP architecture

CDP can be broken down into two parts: Rubrik I/O filter and CDM.

Rubrik’s I/O filter

The I/O filter is made using VMware’s vSphere APIs for IO filtering (VAIO) framework, which allows us to access I/O operations happening on the virtual machine. Unlike some other vendors, all Rubrik I/O filters go through extensive testing and are certified by VMware as VMware Accepted. It has two parts: an I/O filter instance which is attached to the disks and a daemon, which runs inside the ESXi hosts. These cannot talk to each other directly as part of the framework and use shared memory to communicate.

The engineering requirements for the I/O filter are:

It should not add any significant latencies to the I/O path which will lead to impacting user’s production workload
It should send all the I/Os of the VM to Rubrik with some checks to ensure that I/Os are not corrupted (bit-flips etc.) or written out-of-order (application logic errors etc.)
It should be resilient to I/O bursts and/or network partitions that temporarily cause it to not send data to Rubrik at a similar pace on which data is being generated on the virtual machine

Fig 2: A simplified version of how the I/O filter works

The above figure shows the rough architecture of how the I/O filter works. The VAIO framework provides a callback to the I/O filter instance once the I/O is processed from the I/O stack. The I/O filter instance captures every successful I/O and writes a few things into shared memory:

Compressed I/O
Checksum of I/O
Sequence number
Monotonic timestamp generated by the ESX

Sequence number is a number that is incremented on every I/O. It, along with the monotonic timestamp, is used to ensure the ordering of the I/Os. The checksum is used to ensure that the data is resilient to corruption as it is checked at every stage on the Rubrik side such as during persistence to Atlas (Rubrik’s cloud-scale filesystem), during recovery etc. The work done by the instance was tested extensively to be very lightweight and ensure that it did not affect the production workload (such as running very high FIO and verifying that the I/O filter did not affect the production workload significantly).

The daemon then reads from shared memory and sends it to CDM in batches to optimize network performance. The batch size is an important parameter to consider. There is a tradeoff between network utilization and memory consumption on the ESXi host that depends on the batch size. The batch size can neither be too small due to its impact on network utilization nor too big due to the in-memory buffering requirements. It is a fun exercise for the reader to experiment with this tradeoff and tune this parameter appropriately in case you are solving a similar problem.

In the happy path, the I/O filter works as mentioned above. This is known as Sync mode (since we are in sync with all the virtual machines’ I/Os).

But there’s a catch here: What happens when there are network issues such as network partitions or a high number of TCP drops and retransmits that causes the daemon to get backlogged and run out of shared memory to capture all the I/Os on? To handle such situations, Rubrik came up with a strategy known as Resync.

In some cases, such as the ones mentioned above, all I/Os could not be sent to the CDP service. In that case, I/O filter stops tracking all I/Os and instead just tracks changed blocks from the I/O successfully acknowledged by CDP service. Those changed blocks are sent over until Rubrik is able to catch up and the filter enters Sync mode again. Resync makes CDP very resilient to temporary disruptions. We were able to internally test that CDP went to Resync and converged back to Sync mode on a stress setup protecting over a hundred virtual machines (all of which had a large constant data churn along with randomly spaced I/O bursts) whenever a network partition/degradation happened, solidifying our confidence in this product.

This is the high level picture of how Rubrik receives data from the ESX side. Now let's look at the CDP service on Rubrik side.

CDP service on Rubrik receives all the data coming from the filter. CDP service needs to:

Have a very efficient data pipeline to fully utilize Atlas, Rubrik’s proprietary file system
Guard against the common use case of I/O bursts on the virtual machines being protected, which would lead to sometimes needing to process a lot of I/Os in a short amount of time
Replicate data to target locations very efficiently

A couple of terminologies before we proceed any further. When we deal with journal files that capture I/Os in Atlas, CDP service calls two methods on them:

Flush - Simply writing data to the disk. Note that just after flush, it is not guaranteed that all the replicas of the data have been written in our distributed file system, that is done asynchronously in the background
Sync - Making a call to the filesystem to ensure that the data has been replicated to be fault tolerant (for example, after a sync, it is guaranteed that the data will be available even if the node crashes, which is not guaranteed after a flush)

To solve these problems, we brainstormed various options and came up with a layered architecture, each handling some of the responsibility and communicating with each other. CDP service’s data receiver consists of the following parts:

Receiver: This is the outward-facing communication layer to external respondents such as the I/O Filter.
Stream Handle: A stream handle corresponds to one I/O filter instance or one VMDK/disk. It stores information relating to the state of the handle.
Stream Handle Manager: This layer consists of the bulk of the logic for asynchronous actions that are scheduled using thread pools. It also manages all the stream handles.

Fig 3: A simplified description of CDP service data pipeline

Data received initially just goes through some basic checks at every layer (such as increasing sequence number, timestamps etc.) and is stored in an in-memory buffer. Each handle has a local buffer and can allocate a limited, but large amount of memory from a global buffer. These buffers help guard against I/O bursts on the protected virtual machines in case the filesystem is busy and cannot give the application enough throughput to write all the data temporarily.

Stream Handle Manager layer has thread pools that flush and sync data to Atlas. Whenever a handle has more than a threshold of data in it’s buffer or has not been flushed for a period of time over a certain threshold, the scheduler is notified and a task is scheduled to flush this data in the threadpool task queue. Threads pick up the tasks in FIFO order from this scheduled task queue. At any time, only one task can be scheduled for a handle in the threadpool, to provide fairness while still having large chunks written together to take advantage of spatial locality.

When a thread is scheduled to flush data to the filesystem, it also syncs the data to Atlas if there is more than a certain threshold of unsynced data or the last sync was before a certain threshold of time. This ensures that we are almost always recoverable to a recent point in time (sub-minute RPO), even in the case of, say, a disaster which causes the power to go out in the datacenter suddenly.

A shrewd reader will notice that there are a lot of tunables here. Let’s go over them one-by-one and see how they were tuned:

Thread pool size:
- This cannot be too large as this will cause a filesystem to thrash if there are too many writers writing at once.
- This was tuned by seeing how much throughput we could get from Atlas by varying this number.
In-memory buffer size:
- This was chosen by seeing the amount of memory we can take from Rubrik without affecting system performance.
- CDP service also was constructed with logic to ensure that the total global buffer size is not increased if there is less than a certain threshold of RAM free on the system to ensure CDP service never causes significant memory pressure on the system.
Amount of data that we flush
- We can see that this cannot be too large since we don’t want to consume too much time on one writer, especially important when Atlas throughput is low.
- This was validated by ensuring that we can saturate Atlas and the overhead of scheduling was very small (about 0.01%) when compared to the time taken to write this amount, similar to how Linux defines it’s time slices to minimize context switch overhead.
Amount of data after which we sync
- We can see that this cannot be too small since the cost of sync per MB increases as we decrease the size.
- This was tuned by seeing which set of values allows us to reach the maximum throughput we can get from Atlas and choosing a reasonable value.

Tuning these values led us to maximise the filesystem utilization. The maximum constant throughput virtual machine that CDP service could handle was exactly the same as the maximum constant throughput that Atlas could sustain.

CDP service also replicates all the I/Os to a remote location specified in the SLA to be able to recover to a sub-minute RPO on the replication target. We use a few simple yet efficient techniques to achieve this:

A replication thread pool reads data from the in-memory buffers and sends them to target. This avoids issuing more read I/Os to the filesystem, freeing up filesystem throughput.
We reuse all the code of local handling of I/Os by formatting the request to remote CDM cluster exactly as the request is received on the source cluster.

In fact, since usually a remote cluster does not have a lot of protected workloads unlike a primary cluster, we sometimes have a lower RPO on remote location than source location 😄

Conclusion

Building a highly efficient data plane requires a lot of tinkering, experimentation and combining various computer science concepts with a deft execution. In this blog, we saw how the real world problems of protecting virtual machines with continuous data protection under various stressors while building safeguards against data corruption and application bugs into the application logic itself were solved by creating engineering requirements from these problems and using standard techniques such as checksumming, threadpools, spatial locality etc. along with experimenting with multiple combinations of tunables led to achieving a very efficient data plane.