Thursday, 26 April 2018

SRM Service Fails To Start: "Could not initialize Vdb connection Data source name not found and no default driver specified"

In few cases, you might come across a scenario where the Site Recovery Manager service does not start and in the Event Viewer you will notice the following back trace for the vmware-dr service.

VMware vCenter Site Recovery Manager application error.
class Vmacore::Exception "DBManager error: Could not initialize Vdb connection: ODBC error: (IM002) - [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] vmware-dr.exe[0x00107621]
backtrace[05] MSVCR120.dll[0x00066920]
backtrace[06] MSVCR120.dll[0x0005E36D]
backtrace[07] ntdll.dll[0x00092A63]
backtrace[08] vmware-dr.exe[0x00014893]
backtrace[09] vmware-dr.exe[0x00015226]
backtrace[10] windowsService.dll[0x00002BF5]
backtrace[11] windowsService.dll[0x00001F24]
backtrace[12] sechost.dll[0x00005ADA]
backtrace[13] KERNEL32.DLL[0x000013D2]
backtrace[14] ntdll.dll[0x000154E4]
[backtrace end]  

There are no logs generated in vmware-dr.log and the ODBC connection test completes successfully too. 

However, when you go to vmware-dr.xml file located under C:\Program Files\VMware\VMware vCenter Site Recovery Manager\config and search for the tag <DBManager> you will notice the <dsn> name will be incorrect.

Upon providing in the right dsn name within the <dsn> </dsn> you will then notice a new back trace when you attempt to start the service again

VMware vCenter Site Recovery Manager application error.
class Vmacore::InvalidArgumentException "Invalid argument"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] listener.dll[0x0000BCBC]

What I suspect is something has gone wrong with the vmware-dr.xml file and the fix for this is to re-install the SRM application with an existing database. 

Post this, the service starts successfully. Hope this helps. 

Friday, 20 April 2018

Upgrading vCenter Appliance From 6.5 to 6.7

So as you know, vSphere 6.7 is now GA and this article will speak about upgrading an embedded PSC deployment of 6.5 vCenter appliance to 6.7. Once you download the 6.7 VCSA ISO installer
mount the ISO on a local windows machine and then you can use the ui installer for windows to begin the upgrade phase.

You will be presented with the below choices:


We will be going with the Upgrade option. The upgrade is going to be like the earlier path wherein the process will deploy a new 6.7 VCSA and perform a data and configuration migration from the older 6.5 appliance and then power down the old server when the upgrade is successful.


Accept the EULA to proceed further.


In the next step we will connect to the source appliance so provide in the IP/FQDN of the source 6.5 vCenter server.


Once the Connect To Source goes through you will then be asked to enter the SSO details and the ESX details where the 6.5 vCSA is running.


Then the next step is to provide information about the target appliance, the 6.7 appliance. You will select the ESX where the target appliance should be deployed.


Then provide the inventory display name for the target vCenter 6.7 along with the a root password.


Select the appliance deployment size for the target server. Make sure this matches or is greater than the source 6.7 server.


Then select the datastore where the target appliance should reside.


Next, we will provide a set of temporary network details for the 6.7 appliance. The appliance will inherit the old 6.5 network configuration post a successful migration.


Review the details and Finish the begin the Stage 1 deployment process.


Once the Stage 1 is done, you can Continue to proceed further with the Stage 2



In the Stage 2 we will be performing a data copy from the source vCenter appliance to the deployed target from Stage 1


Provide in the details to connect to the source vCenter server.


Select the type of data to be copied over to the destination vCenter server. In my case, I just want to migrate the configuration data.


Join the CEIP and proceed further


Review the details and Finish to begin the data copy.


The source vCenter will be shutdown post the data copy.


The data migration will take a while to complete and is in 3 stages.


And that's it. If all goes well, the migration is complete and you can access your new vCenter from the URL.

Hope this helps.

Thursday, 12 April 2018

SRM Test Recovery Fails: "Failed to create snapshots of replica devices"

When using SRM with array based replication, a test recovery operation will take a snapshot of the replica LUN, present it and mount it on the ESX server to bring up the VMs on an isolated network. 

In many instances, the test recovery would fail at the crucial step, which is taking a snapshot of the replica device. The GUI would mention: 

Failed to create snapshots of replica devices 

In this case, always look into the vmware-dr.log on the recovery site of the SRM. In my case I noticed the below snippet:

2018-04-10T11:00:12.287+01:00 error vmware-dr[16896] [Originator@6876 sub=SraCommand opID=7dd8a324:9075:7d02:758d] testFailoverStart's stderr:
--> java.io.IOException: Couldn't get lock for /tmp/santorini.log
--> at java.util.logging.FileHandler.openFiles(Unknown Source)
--> at java.util.logging.FileHandler.<init>(Unknown Source)
=================BREAK========================
--> Apr 10, 2018 11:00:12 AM com.emc.santorini.log.KLogger logWithException
--> WARNING: Unknown error: 
--> com.sun.xml.internal.ws.client.ClientTransportException: HTTP transport error: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.getOutput(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.process(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.transport.DeferredTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Unknown Source)
=================BREAK========================
--> Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at sun.security.ssl.InputRecord.handleUnknownRecord(Unknown Source)
--> at sun.security.ssl.InputRecord.read(Unknown Source)
--> at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)

2018-04-10T11:00:12.299+01:00 error vmware-dr[21512] [Originator@6876 sub=AbrRecoveryEngine opID=7dd8a324:9075:7d02:758d] Dr::Providers::Abr::AbrRecoveryEngine::Internal::RecoverOp::ProcessFailoverFailure: Failed to create snapshots of replica devices for group 'vm-protection-group-45026' using array pair 'array-pair-2038': (dr.storage.fault.CommandFailed) {
-->    faultCause = (dr.storage.fault.LocalizableAdapterFault) {
-->       faultCause = (vmodl.MethodFault) null, 
-->       faultMessage = <unset>, 
-->       code = "78814f38-52ff-32a5-806c-73000467afca.1049", 
-->       arg = <unset>
-->       msg = ""
-->    }, 
-->    faultMessage = <unset>, 
-->    commandName = "testFailoverStart"
-->    msg = ""
--> }
--> [context]

So here the SRA attempts to establish connection with the RecoverPoint over HTTP which from 3.5.x is disabled. And we need to allow RP and SRM to communicate over HTTPS. 

On the SRM, perform the below:

1. Open CMD in admin mode and navigate to the below location:
c:\Program Files\VMware\VMware vCenter Site Recovery Manager\storage\sra\array-type-recoverpoint

2. Then run the below command:
"c:\Program Files\VMware\VMware vCenter Site Recovery Manager\external\perl-5.14.2\bin\perl.exe" command.pl --useHttps true

In 6.5 I have seen the path to be external\perl\perl]bin\perl.exe 
So verify what the correct path is for the second command. 

You should ideally see an output like:
Successfully changed to HTTPS security mode

3. Perform this on both the SRM sites. 

On the RPA, perform the below:

1. Login to each RPA with boxmgmt account

2. [2] Setup > [8] Advanced Options > [7] Security Options > [1] Change Web Server Mode 
(option number may change)

3. You will be then presented with this message:
Do you want to disable the HTTP server (y/n)?

4. Disable HTTP and repeat this on production and recovery RPA cluster. 

Restart the SRM service on both sites and re-run the test recovery and this should now complete successfully. 

Hope this helps. 

Wednesday, 4 April 2018

Maintenance Task Fails On VDP When Connected To Data Domain

There are many instances where the maintenance task fails on VDP. This article is in specific to VDP when integrated with data domain and moreover when the DDoS version is 6.1 and above.

The checkpoint and HFS tasks were completing fine without issues:
# dumpmaintlogs --types=cp | grep "<4"

2018/03/19-12:01:04.44235 {0.0} <4301> completed checkpoint maintenance
2018/03/19-12:04:17.71935 {0.0} <4300> starting scheduled checkpoint maintenance
2018/03/19-12:04:40.40012 {0.0} <4301> completed checkpoint maintenance

# dumpmaintlogs --types=hfscheck | grep "<4"

2018/03/18-12:00:59.49574 {0.0} <4002> starting scheduled hfscheck
2018/03/18-12:04:11.83316 {0.0} <4003> completed hfscheck of cp.20180318120037
2018/03/19-12:01:04.49357 {0.0} <4002> starting scheduled hfscheck
2018/03/19-12:04:16.59187 {0.0} <4003> completed hfscheck of cp.20180319120042

Garbage collection task was the one that was failing:
# dumpmaintlogs --types=gc --days=30 | grep "<4"

2018/03/18-12:00:22.29852 {0.0} <4200> starting scheduled garbage collection
2018/03/18-12:00:36.77421 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR
2018/03/19-12:00:23.91138 {0.0} <4200> starting scheduled garbage collection
2018/03/19-12:00:41.77701 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR

From ddrmaint.log located under /usr/local/avamar/var/ddrmaintlogs had the following entry:

Mar 18 12:00:31 VDP01 ddrmaint.bin[14667]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 18 12:00:34 VDP01 ddrmaint.bin[14667]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

Mar 19 12:00:35 VDP01 ddrmaint.bin[13409]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 19 12:00:39 VDP01 ddrmaint.bin[13409]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

It was basically failing to retrieve checkpoint list from the data domain.
Also, the get checkpoint list was failing:

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: <4750>Datadomain get checkpoint list operation failed.

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::execute_cplist: Failed to retrieve snapshot checkpoints from LSU: avamar-1488469814, ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: <4750>Datadomain get checkpoint list operation failed.

From the mTree LSU of this VDP Server, we noticed that the checkpoints were not expired:
# snapshot list mtree /data/col1/avamar-1488469814

Snapshot Information for MTree: /data/col1/avamar-1488469814
----------------------------------------------
Name                Pre-Comp (GiB)   Create Date         Retain Until   Status
-----------------   --------------   -----------------   ------------   ------
cp.20171220090039         128533.9   Dec 20 2017 09:00
cp.20171220090418         128543.0   Dec 20 2017 09:04
cp.20171221090040         131703.8   Dec 21 2017 09:00
cp.20171221090415         131712.9   Dec 21 2017 09:04
.
cp.20180318120414         161983.7   Mar 18 2018 12:04
cp.20180319120042         162263.9   Mar 19 2018 12:01
cp.20180319120418         162273.7   Mar 19 2018 12:04
cur.1515764908            125477.9   Jan 12 2018 13:49
-----------------   --------------   -----------------   ------------   ------
Snapshot Summary
-------------------
Total:          177
Not expired:    177
Expired:          0

Due to this, all the recent checkpoints on VDP were invalid:
# cplist

cp.20180228120038 Wed Feb 28 12:00:38 2018 invalid --- ---  nodes   1/1 stripes     76
.
cp.20180318120414 Sun Mar 18 12:04:14 2018 invalid --- ---  nodes   1/1 stripes     76
cp.20180319120042 Mon Mar 19 12:00:42 2018 invalid --- ---  nodes   1/1 stripes     76
cp.20180319120418 Mon Mar 19 12:04:18 2018 invalid --- ---  nodes   1/1 stripes     76

The case here is the VDP version was 6.1.x and the data domain OS version was 6.1
# ddrmaint read-ddr-info --format=full

====================== Read-DDR-Info ======================
System name        : xxx.xxxx.xxxx
System ID          : Bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx4
DDBoost user       : ddboost
System index       : 1
Replication        : True
CP Backup          : True
Model number       : DDxxx
Serialno           : Cxxxxxxxx
DDOS version       : 6.1.0.21-579789
System attached    : 1970-01-01 00:00:00 (0)
System max streams : 16

6.1 DD OS version is not supported for VDP 6.1.x. 6.0.x is the last DD OS version supported for VDP.

So if your DD OS is on 6.1.x then the choice would be to:
> Migrate the VDP to Avamar Virtual Edition (Recommended)
> Rollback DD OS to 6.0.x

Hope this helps!