Thursday, 25 January 2018

SRM Service Crashes During A Recovery Operation With timedFunc BackTrace

In few scenarios when you run a test recovery or a planned migration, the SRM service will crash. This might happen when you run a specific recovery plan or any recovery plan.

If you look into the vmware-dr.log you will notice the following back-trace:

--> Panic: VERIFY d:\build\ob\bora-3884620\srm\public\functional/async/timedFunc.h:210
--> Backtrace:
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
xxxxxxxxxxxxxxxxxxxxx Cut Logs Here xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
--> backtrace[36] ntdll.dll[0x000154E4]
--> [backtrace end]

The timedFunc back-trace is seen when "Wait For VMware Tools" is set to 0 minutes and 0 seconds

And just about few lines top of this back trace you will see the faulty VM which caused this crash.

You will see something similar to:

2018-01-21T08:37:05.421-05:00 [44764 info 'VmDomain' ctxID=57d5ae61 opID=21076ff:c402:4147:d883] Waiting for VM '[vim.VirtualMachine:b2ab3f04-c72e-43ca-b93d-de1566e4de14:vm-323]' to reach desired powered state 'poweredOff' within '0' seconds.

The VM ID is given here. To find this VM ID you will need to go to the vCenter MOB page.

The way I found out to correlate this is:
1. Login to MOB page for vCenter (https://vcenter-ip/mob)
2. Content > group-d1 (Datacenters)
3. Respective datacenter under "Child Entity"
4. Then under vmFolder group-v4 (vm)
5. Expand childEntity and this will list out all the VMs in that vCenter.

My output was similar to:

The VM was CentOS7.2

> Then navigate to the Recovery plan in SRM
> Select the affected Recovery plan this VM is part of > Related Objects > Virtual Machines
> Right click this VM and select Configure Recovery

Here the Wait For VMware Tools were set to 0,0 timeout. We had to change this to a valid non zero value. 

Post this, the recovery plan completed fine without crashing the SRM service. This should ideally be fixed in the newer SRM releases as it would not let you set a 0 timeout. 

Hope this helps!