Friday, 8 July 2016

Automatic/Manual Backup Verification Fails In VDP

To check the consistency of the restore points you have backup verification jobs. These verification jobs can be either Automatic (ABV) or Manual backup verification. The backup verification flow on a high level basis goes as:

>> Restore: Restores the restore point as a temporary VM on the ESXi host and datastore which is defined on the backup verification job
>> Power On: Powers On the VM.
>> Heartbeat Verification: Verifies the heartbeat for the restored virtual machine
>> Power Off: Powers Off the VM once the verification is done
>> Delete VM: Remove the temporary restored VM from the inventory and delete from disk.

The issue I am going to be discussing here is not a general issue, and this caused due to a very specific cause. However, the troubleshooting steps can be used and you might have similar causes due to which you will run into verification jobs to fail.

All the verification job logs are present under the following directory:
/usr/local/avamarclient/var/

The verification job that I had was something as:
xyz-backup-verify-1467724890971-c2857d179f4b9e67465bf496709d8bc1f43149ef-1016-vmimagel.log

I created it as xyz because I have a VM named xyz, hence the temporary restored VM would have a name similar VDP_VERIFICATION_xyz

So in the verification job the initial logging refers to the following:

>> Which vCenter this VM is going to be restored for verification
>> The ESXi host
>> The VMFS/NFS datastore

Their logging would be in the start of the verification log and would look something as:

2016-07-05T09:21:32.956+04:00 avvcbimage Info <16010>: vCenter 'ABC.vcloud.local' is 192.168.1.1
2016-07-05T09:21:32.956+04:00 avvcbimage Info <11981>: VM's host is ESXi.vcloud.local
2016-07-05T09:21:32.956+04:00 avvcbimage Info <11982>: VM's primary storage location is [Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx
So, to the error. To the very end of the verification log is the final moments of the process in my case were the errors:
2016-07-05T09:35:00.128+04:00 avvcbimage Info <19670>: vmAction runRemote()

2016-07-05T09:35:00.177+04:00 avvcbimage Info <19672>: vmAction powerOnVM()

2016-07-05T09:35:00.187+04:00 avvcbimage Info <17789>: Modifying VmxNet3 adapter: Network adapter 1 to not Connect at Power On

2016-07-05T09:35:00.213+04:00 avvcbimage Info <0000>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' setNics config at PowerOn task still in progress, sleep for 3 sec

2016-07-05T09:35:03.243+04:00 avvcbimage Info <14632>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' setNics config at PowerOn task completed, moref=

2016-07-05T09:35:03.266+04:00 avvcbimage Info <14629>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power On task queued, sleep for 1 sec

2016-07-05T09:35:04.287+04:00 avvcbimage Error <16006>: vSphere Task failed: 'The operation is not allowed in the current state.'.

2016-07-05T09:35:04.287+04:00 avvcbimage Error <14635>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power On task creation encountered a problem

2016-07-05T09:35:04.287+04:00 avvcbimage Warning <19673>: PowerOnVM failed or cancelled

2016-07-05T09:35:04.287+04:00 avvcbimage Info <19684>: vmAction cleanupVM() DeletingVM=0

2016-07-05T09:35:04.287+04:00 avvcbimage Info <19685>: vmAction poweroffVM()

2016-07-05T09:35:04.311+04:00 avvcbimage Info <0000>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power Off task still in progress, sleep for 3 sec

2016-07-05T09:35:07.345+04:00 avvcbimage Error <16006>: vSphere Task failed: 'The attempted operation cannot be performed in the current state (Powered off).'.

2016-07-05T09:35:07.345+04:00 avvcbimage Error <14635>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' Power Off task creation encountered a problem

2016-07-05T09:35:07.345+04:00 avvcbimage Info <19686>: vmAction deleteVM()

2016-07-05T09:35:07.387+04:00 avvcbimage Info <0000>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' deletion task still in progress, sleep for 3 sec

2016-07-05T09:35:10.416+04:00 avvcbimage Info <14632>: VM '[Local-DS-1] VDP_VERIFICATION_xyz_1467724891189/VDP_VERIFICATION_xyz_1467724891189.vmx' deletion task completed, moref=

2016-07-05T09:35:10.416+04:00 avvcbimage Info <9772>: Starting graceful (staged) termination, ABV failed (wrap-up stage)

2016-07-05T09:35:10.416+04:00 avvcbimage Error <19702>: ABV failed

2016-07-05T09:35:10.419+04:00 avvcbimage Info <16038>: Final summary, cancelled/aborted 0, snapview 0, exitcode 170: completed with errors, client log should be examined

So here, the restore was done successfully and the network adapter is always disconnected for the verification VM to avoid IP conflict.
Then there was several tries done to Power On the virtual machine and all of them failed. Since the Power On was not completed the Power off failed as well.
The step to verify the heartbeat is excluded since the virtual machine was not powered On which led to the final state, delete the VM which was completed successfully.

That's pretty much it in the verification logs. This was not sufficient to find a cause, which led me to implement the next couple of tests:

1. For this verification job, I changed the destination host and datastore. Basically, I am doing the restore on a different host and a different datastore and it went through successfully. So something was either wrong with the host or the datastore.

2. So I changed the datastore location to the old path and the host still on the new one. The verification job completed successfully again. And when edited the job back to the old host, it failed with the same error.

So something is going on with this host! So we need to troubleshoot on the host level.

From the vobd.log during this time, I saw the following:
2016-07-06T14:01:49.787Z: [UserWorldCorrelator] 3011315291947us: [vob.uw.core.dumped] /bin/hostd(2038251) /var/core/hostd-zdump.003

2016-07-06T14:04:26.406Z: [UserWorldCorrelator] 3011471909126us: [vob.uw.core.dumped] /bin/hostd(2043706) /var/core/hostd-zdump.000

2016-07-06T14:08:35.096Z: [UserWorldCorrelator] 3011720596785us: [vob.uw.core.dumped] /bin/hostd(2099166) /var/core/hostd-zdump.001

2016-07-06T14:11:12.313Z: [UserWorldCorrelator] 3011877811665us: [vob.uw.core.dumped] /bin/hostd(2849623) /var/core/hostd-zdump.002

2016-07-06T14:13:48.795Z: [UserWorldCorrelator] 3012034293301us: [vob.uw.core.dumped] /bin/hostd(2040079) /var/core/hostd-zdump.003
So here the hostd daemon on the host has crashed and a zdump is created. So the hostd and hostd-worker threads were in a crash state and hence I rebooted the host.

After a reboot the hostd and worker thread were not in an inconsistent state anymore allowing me to perform verification tasks without any issues on this host.

There can be multiple causes for backup verification failure. Well this is one of them!