Saturday, 30 April 2016

VDP Deduplication Process

Saving the same data after every iteration of backup is not ideal because the space consumed on you storage increases rapidly. To provide a better storage for backups deduplication technology is used. What this does is, during the first initial full back, the entire contents of the virtual machine is backed up. However, the subsequent backups only save the new data or the changes that has occurred when compared to the previous iteration of backup. This is called as incremental backup. The changed data will be processed by VDP and saved where as pointer files are created to the same/unchanged data that was present in the previous backup. This saves the storage space and also increases the backup efficiency.  

Before we jump deeper into deduplication, let us have a look at the two types of deduplication we have at hand; Fixed Length (Also called as Fixed block) and Variable length (Variable block) deduplication.

This is the raw data that I have at hand right now:
"Welcome to virtuallypeculiar read abot vmware technology"

Fixed length deduplication: I am going to segment this raw data into a data-set defined by a block length of 8. Which is, 8 characters per data-set. The output will look something as:

Variable length deduplication: In this, we do not have a constant set of deduplication block length. The algorithm is going to look at the data set and set logical boundaries for deduplication length. The output will something as:

Now, on a high level basis, my backup software took a backup of the raw data which is saved on my notepad file. Since this is a first backup, the entire text data is saved on the storage using a deduplication technology. 

Next, in the raw data I have a spelling error for the word abot (about), upon noticing this, I will re-open the notepad make the necessary changes and save the file again. When the next iteration of backup runs, it is going to scan for changes in blocks. 

How fixed length deduplication deals with this?

When the new character is added, the data bits are shifted towards the right by 1. The output would be something as:

Now, in these cases there are scenarios where the shifting of data bits causes the shifted data to enter a new 8-character data-set which creates a new storage block to be occupied, just for one character. This reduces the storage efficiency when compared to variable length deduplication.

How variable length deduplication deals with this?

When the changed data is detected the variable length deduplication makes sure that the outcome of the changed data set matches the chunk size or data set size of the previous backup iteration. The output is something as:

Here the red box shows the changed data, and it is seen that it is limited to the same block whereas in fixed length it was seen till the end of the data set.
VDP is based on variable length deduplication, and using an algorithm the logical boundaries are set for the raw data.

Final note, variable length deduplication provides better storage density than fixed length as the changes in data-set is not vast.

How does VDP deduplication work?

Now, since you have a fair understanding of deduplication, we can look into how VDP handles deduplication. Please note, throughout the process, VDP uses only variable block deduplication.

Have a look at the flow chart below for the basic flow of deduplication process:

Before we get to the working of the flow chart let's have a little understanding regarding the various daemons or processes involved in this backup.

MCS (Management Console Server) This is responsible for the management of all your backup request, VDRDB database.

There are 8 internal proxies on the VDP appliance. Each proxy runs a process called avAgent. These query the MCS every 15 seconds for incoming job requests. Once the avAgent receives the backup request it in turn calls the avVcbImage

The avVcbImage is responsible for enabling, browsing backing up and restoring the disks.

The avVvcbImage in turn calls the avTar which is the primary process for backup and restore.

The entire deduplication process occurs inside the VDP appliance. When the backup request comes in, the first check is done on the client side, where the appliance determines if this virtual machine has been backed up or not. The .ctk file created due to CBT feature when a backup is taken records all the changed sector information since the previous backup. When the appliance scans for this, and if the ctk file determines the changes, only those changed data is sent further to the Sticky Byte Factoring. If older data is present, it is going to create pointer files and will be excluded from Sticky byte factoring.

In Sticky Byte Factoring:

The avTar running here is responsible for breaking down the raw data input that we received earlier into data chunks. The data set that is an outcome of this will be anywhere from 1 KB to 64 KB and will average out on a 24 KB set.

The earlier example we considered for variable block deduplication, let's use that data set, represent that in terms of KB of data and re-review the deduplication process.

So here, in the first full backup, the raw data is divided into variable length blocks using VDP algorithm. It produces a set of data chunks anywhere between 1 and 64K. Now, the data in the first two blocks have changed after the backup was performed. Now, in the next iteration of backup, the sticky byte factoring re-syncs the block so the output of new dataset matches the chunks of the previous dataset. So, no matter where the data has changed the avTar creates chunks to match the previous chunk size.


Once the sticky byte factoring divides the raw data into chunks, these will be compressed. The compression ratio will be anywhere between 30 and 50 percent and data that is not favourable for compression will be omitted to prevent performance impact.


The compressed data is then hashed using SHA-1 algorithm. And the hashed data will always output a 20 byte data string. This hash data is unique to each block and serves as a reference for comparison to check if the previous backup has a similar has. If yes, then the similar hashed data are excluded from backup. Hashing does not convert data into hashes, it rather creates hashes for each data block. So at the end of Hashing, you will have your data chunks and hashes corresponding to it. If the hashes are not found in the hash cache, then the cache is updated with the new hash. 

The above hashes are called as atomic hashes, further the atomic hashes are combined to form composites. The composites are further combined to form composite hashes. This process is continued until one single root hash is created.
So in the end, we have the actual data stripes. the atomic hashes, the composite hashes and the root hash all stored in their own locations on the VDP storage disks.

Hope this was helpful. If you have any questions, please feel free to reply.

All images are copyrights of