Saturday, 28 November 2015

Virtual Machine Disk Consolidation Is Needed

Written by Suhas Savkoor



We all have performed backup of virtual machines, either using VMware backup solutions or from third party vendors like Veeam or Netapp VSC. Most of the times, the backup jobs go well. However, in some cases we see the annoying message saying "Virtual machine disk consolidation is needed" Here we do not see any snapshots in the snapshot manager, but when we right click the VM and select Edit Settings and choose the hard disk, we notice that this hard disk is actually running on a snapshot, a vm-name-00000x.vmdk



Now, most of the times, we right click the virtual machine that is displaying this message, select Snapshot and Click Consolidate and it works. In the task and events section we can see the consolidation process progressing to success.

Then we have those other sticky situation where it does not work. We receive a ton of errors when we click Consolidate. Specially the "Cannot consolidate file since it is locked" and "Unable to access file <unspecified filename>"

There are two steps to troubleshoot this:

Step 1:

Make sure that there is no active backup job running for this VM. If there is an active snapshot job for the VM or if the backup job is configured for this VM, then that backup application will be holding a lock on the virtual machine's vmdk file, resulting in failure of consolidation.

So first try, Power OFF the backup appliance and carry out the consolidation process.

If this works, then great!
If not, then we have couple more in depth troubleshooting to do, which takes us to step 2.

Step 2:

In this step we need to verify the integrity of the snapshot chain. Now, the question is, what is Snapshot chain integrity.

Let's break it down to bits:

I have a CentOS7 VM here, which has one VMDK and this is running on the base disk, and not a snapshot disk.



Next I take a snapshot of the VM and this time you can notice it is running on a snapshot disk.



So with this in mind, let's look into the snapshot chain. To check the snapshot chain we need two things:


  • SSH (Putty) Access to the host where this VM is residing
  • A very interesting command to generate the chain structure


Login into Putty, and change your directory to the virtual machine's directory and then run this command, and we will receive an output like this:


Re-arranging this in text output, such that the base vmdk information is displayed first and the snapshot vmdk information is specified next, we get:

CentOS7.vmdk
CID=70b0b210
parentCID=ffffffff
RW 33554432 VMFS "CentOS7-flat.vmdk"


CentOS7-000001.vmdk
CID=71a5b396
parentCID=70b0b210
parentFileNameHint="CentOS7.vmdk"
RW 33554432 VMFSSPARSE "CentOS7-000001-delta.vmdk"


Chain Structure Analysis:

For CentOS7.vmdk (Base disk), we have a Parent ID which is 8 f's, and this always remains the same not matter which VM we use.

For the same Disk we have a Child ID which is a Random 8 digit hexadecimal.

Now the Child ID (CID) of the base disk must be the parent ID (PID) of the first snapshot disk.
In simple formula
CID(CentOS7.vmdk) = ParentCID(CentOS7-000001.vmdk)
Which in our case is true.

For CentOS7-000001.vmdk, the CID is again a random 8 digit hexadecimal. And this CID will be equal to the ParentID of the next snapshot (CentOS7-000002.vmdk), and this structure continues.

Also, the CentOS7-000001.vmdk points to it's corresponding -000001-delta.vmdk with a parent file of (CentOS7.vmdk)

The next chain would be CentOS7-000002.vmdk pointing to -000002-delta.vmdk with a parent file of  (CentOS7-000001.vmdk)

This structure should always be in this format. if there is a mist-match in the snapshot chain format then the consolidation fails.

If the chain structure is too big and if there are lot of corruptions, then the feasible workaround would be to clone the VM, as it would be very tedious to sit through multiple chain structures and do the necessary corrections.


What do we learn from this?

1. If you receive this message saying virtual machine disk consolidation is needed, DO NOT directly go and remove the snapshot files.
2. Verify if the VM is running on the snapshot file, yes in most of the cases.
3. If it is running perform a consolidation.
4. if it works, good! If not, check the chain structure, correct the necessary and run the consolidation again.


Make good use of the command folks!



Additional Tip!

Use Notepad++ To verify chain integrity as it highlights the same characters making it easier to verify CIDs and PIDs.