Wednesday, 8 February 2017

VDP Backup Fails With "Failed To Remove Snapshot"

There might be scenarios where you execute a backup for a virtual machine. It starts successfully, takes a snapshot successfully and completes the backup process, however at the very end, it fails to remove the snapshot for the VM. This would be seen persistently for one or more virtual machine.

At the very end of the backup job log you would see something like:

2017-02-06T11:53:39.552+04:00 avvcbimage Warning <16004>: Soap fault detected, Query problem, Msg:'SOAP 1.1 fault: SOAP-ENV:Client [no subcode]
"Connection timed out"
Detail: connect failed in tcp_connect()"

2017-02-06T11:53:39.552+04:00 avvcbimage Error <17773>: Snapshot (snapshot-5656) removal for VM '[Suhas-Store-2] VM01/VM01.vmx' task failed to start

2017-02-06T11:53:39.552+04:00 avvcbimage Info <18649>: Removal of snapshot 'VDP-1486397576f70379edb62fb81285abbf68dfadc0bd0758ba83' is not complete, moref 'snapshot-5656'.

2017-02-06T11:53:39.552+04:00 avvcbimage Info <9772>: Starting graceful (staged) termination, Problem with the snapshot removal. (wrap-up stage)

If you see there is a Connection time out message once the snapshot remove call is handed down to the virtual machine. For this VDP-ID if you look into the vmware.log, you will notice the following:

2017-02-06T16:13:02.636Z| vmx| I125: SnapshotVMXTakeSnapshotComplete: Done with snapshot 'VDP-1486397576f70379edb62fb81285abbf68dfadc0bd0758ba83': 55

2017-02-06T16:58:30.826Z| vmx| I125: GuestRpcSendTimedOut: message to toolbox-dnd timed out.
2017-02-06T16:59:11.235Z| vmx| I125: GuestRpcSendTimedOut: message to toolbox-dnd timed out.
2017-02-06T16:59:13.117Z| vmx| I125: GuestRpcSendTimedOut: message to toolbox-dnd timed out.

We see there is a lot of timeout occurring from the VMtools. And at the same time if you notice the datastore where this VM resides, you will see that it is on a NFS storage:

2017-02-06T11:13:40.355+04:00 avvcbimage Info <0000>: checking datastore type for special processing, checking for type VVOL, actual type = NFS

And if you see the mode of backup you see it is a hot-add mode of backup:

2017-02-06T11:13:40.337+04:00 avvcbimage Info <9675>: Connected with hotadd transport to virtual disk [Suhas-Store-2] VM01/VM01.vmdk

Now, when the VM is residing on NFSv3 there are issues with timeout due to NFS lock during snapshot consolidation. This KB explains the cause of it. The workaround here is to disable hot-add mode of backup and switch to NBD or NBDSSL.

1. SSH into the VDP appliance and browse to the below directory:
# cd /usr/local/avamarclient/var
2. Edit the avvcbimageAll.cmd using a vi editor and enter the below line:
--transport=nbd

3. Save the file and restart avagent using:
# service avagent-vmware restart
Post this the backup should complete successfully. Hope this helps.