Wednesday, 6 December 2017

SRM Service Crashes After A Failed Recovery With "abrRecoveryEngine" Backtrace

In some instances, when you are running Array Based Replication for SRM, a failed planned migration might cause the SRM service to crash. In the vmware-dr.log found on the SRM machine, we will notice the following backtrace

2017-12-06T09:55:38.620-05:00 panic vmware-dr[06076] [Originator@6876 sub=Default] 
--> 
--> Panic: Assert Failed: "ok (Dr::Providers::Abr::AbrRecoveryEngine::AbrRecoveryEngineImpl::LoadFromDb: Unable to insert post failover info object 212337205 for group vm-protection-group-121101624 array pair array-pair-7065)" @ d:/build/ob/bora-6014840/srm/src/providers/abr/common/abrRecoveryEngine/abrRecoveryEngine.cpp:244
--> Backtrace:
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
--> backtrace[00] vmacore.dll[0x001F29FA]
--> backtrace[01] vmacore.dll[0x00067D60]
--> backtrace[02] vmacore.dll[0x0006A20E]
--> backtrace[03] vmacore.dll[0x002245A7]
--> backtrace[04] vmacore.dll[0x00224771]
--> backtrace[05] vmacore.dll[0x00059C0D]
--> backtrace[06] dr-abr-recoveryEngine.dll[0x00028A91]
--> backtrace[07] dr-abr-recoveryEngine.dll[0x00015199]
--> backtrace[08] dr-abr-recoveryEngine.dll[0x002DB368]
--> backtrace[09] dr-abr-recoveryEngine.dll[0x002DB913]
--> backtrace[10] vmacore.dll[0x001D6ACC]
--> backtrace[11] vmacore.dll[0x001865AB]
--> backtrace[12] vmacore.dll[0x0018759C]
--> backtrace[13] vmacore.dll[0x002202E9]
--> backtrace[14] MSVCR120.dll[0x00024F7F]
--> backtrace[15] MSVCR120.dll[0x00025126]
--> backtrace[16] KERNEL32.DLL[0x000013D2]
--> backtrace[17] ntdll.dll[0x000154E4]
--> [backtrace end]

This is seen when there are issues unmounting the source datastore or demoting the source datastore. 

Disclaimer: Modifying database tables is done by VMware. Do this at your own risk.

The fix is:

1. Make sure SRM service is stopped on both sites
2. Backup the SRM databases on both sites
3. Login to the database either using PGadmin or SQL management studio depending on the type of database used
4. Open this table "pda_grouppostfailoverinfo"
5. Here we need to remove the db_id which is available from the back trace. In my case it is: 212337205
6. Once this is done, start the SRM service. If it crashes again, it usually generates another object ID and repeat the process.

And that should be it.