Thursday, 12 April 2018

SRM Test Recovery Fails: "Failed to create snapshots of replica devices"

When using SRM with array based replication, a test recovery operation will take a snapshot of the replica LUN, present it and mount it on the ESX server to bring up the VMs on an isolated network. 

In many instances, the test recovery would fail at the crucial step, which is taking a snapshot of the replica device. The GUI would mention: 

Failed to create snapshots of replica devices 

In this case, always look into the vmware-dr.log on the recovery site of the SRM. In my case I noticed the below snippet:

2018-04-10T11:00:12.287+01:00 error vmware-dr[16896] [Originator@6876 sub=SraCommand opID=7dd8a324:9075:7d02:758d] testFailoverStart's stderr:
--> java.io.IOException: Couldn't get lock for /tmp/santorini.log
--> at java.util.logging.FileHandler.openFiles(Unknown Source)
--> at java.util.logging.FileHandler.<init>(Unknown Source)
=================BREAK========================
--> Apr 10, 2018 11:00:12 AM com.emc.santorini.log.KLogger logWithException
--> WARNING: Unknown error: 
--> com.sun.xml.internal.ws.client.ClientTransportException: HTTP transport error: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.getOutput(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.process(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.transport.DeferredTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Unknown Source)
=================BREAK========================
--> Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at sun.security.ssl.InputRecord.handleUnknownRecord(Unknown Source)
--> at sun.security.ssl.InputRecord.read(Unknown Source)
--> at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)

2018-04-10T11:00:12.299+01:00 error vmware-dr[21512] [Originator@6876 sub=AbrRecoveryEngine opID=7dd8a324:9075:7d02:758d] Dr::Providers::Abr::AbrRecoveryEngine::Internal::RecoverOp::ProcessFailoverFailure: Failed to create snapshots of replica devices for group 'vm-protection-group-45026' using array pair 'array-pair-2038': (dr.storage.fault.CommandFailed) {
-->    faultCause = (dr.storage.fault.LocalizableAdapterFault) {
-->       faultCause = (vmodl.MethodFault) null, 
-->       faultMessage = <unset>, 
-->       code = "78814f38-52ff-32a5-806c-73000467afca.1049", 
-->       arg = <unset>
-->       msg = ""
-->    }, 
-->    faultMessage = <unset>, 
-->    commandName = "testFailoverStart"
-->    msg = ""
--> }
--> [context]

So here the SRA attempts to establish connection with the RecoverPoint over HTTP which from 3.5.x is disabled. And we need to allow RP and SRM to communicate over HTTPS. 

On the SRM, perform the below:

1. Open CMD in admin mode and navigate to the below location:
c:\Program Files\VMware\VMware vCenter Site Recovery Manager\storage\sra\array-type-recoverpoint

2. Then run the below command:
"c:\Program Files\VMware\VMware vCenter Site Recovery Manager\external\perl-5.14.2\bin\perl.exe" command.pl --useHttps true

In 6.5 I have seen the path to be external\perl\perl]bin\perl.exe 
So verify what the correct path is for the second command. 

You should ideally see an output like:
Successfully changed to HTTPS security mode

3. Perform this on both the SRM sites. 

On the RPA, perform the below:

1. Login to each RPA with boxmgmt account

2. [2] Setup > [8] Advanced Options > [7] Security Options > [1] Change Web Server Mode 
(option number may change)

3. You will be then presented with this message:
Do you want to disable the HTTP server (y/n)?

4. Disable HTTP and repeat this on production and recovery RPA cluster. 

Restart the SRM service on both sites and re-run the test recovery and this should now complete successfully. 

Hope this helps.