Wednesday, 30 May 2018

Unable To Configure Or Reconfigure Protection Groups In SRM: java.net.SocketTimeoutException: Read timed out

When you try to reconfigure or create a new protection group you might run into the following message when you click on the Finish option.

 java.net.SocketTimeoutException: Read timed out

Below is a screenshot of this error:


In the web client logs for the respective vCenter you will see the below logging: 

[2018-05-30T13:49:14.548Z] [INFO ] health-status-65 com.vmware.vise.vim.cm.healthstatus.AppServerHealthService Memory usage: used=406,846,376; max=1,139,277,824; percentage=35.7109010137285%. Status: GREEN
[2018-05-30T13:49:14.549Z] [INFO ] health-status-65   c.v.v.v.cm.HealthStatusRequestHandler$HealthStatusCollectorTask Determined health status 'GREEN' in 0 ms
[2018-05-30T13:49:20.604Z] [ERROR] http-bio-9090-exec-16 70002318 100003 200007 c.v.s.c.g.wizard.addEditGroup.ProtectionGroupMutationProvider Failed to reconfigure PG [DrReplicationVmProtectionGroup:vm-protection-group-11552:67
2e1d34-cbad-46a4-ac83-a7c100547484]:  com.vmware.srm.client.topology.client.view.PairSetup$RemoteLoginFailed: java.net.SocketTimeoutException: Read timed out
        at com.vmware.srm.client.topology.impl.view.PairSetupImpl.remoteLogin(PairSetupImpl.java:126)
        at com.vmware.srm.client.infraservice.util.TopologyHelper.loginRemoteSite(TopologyHelper.java:398)
        at com.vmware.srm.client.groupservice.wizard.addEditGroup.ProtectionGroupMutationProvider.apply(ProtectionGroupMutationProvider.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.vmware.vise.data.provider.DelegatingServiceBase.invokeProviderInternal(DelegatingServiceBase.java:400)


Caused by: com.vmware.vim.vmomi.client.exception.ConnectionException: java.net.SocketTimeoutException: Read timed out
        at com.vmware.vim.vmomi.client.common.impl.ResponseImpl.setError(ResponseImpl.java:252)
        at com.vmware.vim.vmomi.client.http.impl.HttpExchange.run(HttpExchange.java:51)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingBase.executeRunnable(HttpProtocolBindingBase.java:226)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingImpl.send(HttpProtocolBindingImpl.java:110)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.sendCall(MethodInvocationHandlerImpl.java:613)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.executeCall(MethodInvocationHandlerImpl.java:594)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl.completeCall(MethodInvocationHandlerImpl.java:345)

There might an issue with ports between vCenter and SRM server and you can validate those ports using this KB here

If the ports are fine, then validate that no guest level security agents on SRM or vCenter (Windows) are blocking this traffic. 

In my case the network connection and firewall / security settings were fine, and a fix was done by performing a Modify on the SRM installation on both the sites. Once this was done, a reconfigure pairing was done and post this we were able to reconfigure the protection groups successfully. 

Tuesday, 22 May 2018

VDP 6.1.8 Upgrade Stalls At 97 Percent

When upgrading a vSphere Data Protection appliance to 6.1.8 there might be a case where the upgrade stalls at 97 percent with the below task stuck in progress


In the avinstaller.log.0 located under /usr/local/avamar/var/avi/server_logs has the following message:

May 21, 2018 12:07:38 PM org.hibernate.transaction.JDBCTransaction commit
SEVERE: JDBC commit failed
java.sql.SQLException: database is locked
        at org.sqlite.DB.throwex(DB.java:855)
        at org.sqlite.DB.exec(DB.java:138)
        at org.sqlite.SQLiteConnection.commit(SQLiteConnection.java:512)
        at org.hibernate.transaction.JDBCTransaction.commitAndResetAutoCommit(JDBCTransaction.java:170)
        at org.hibernate.transaction.JDBCTransaction.commit(JDBCTransaction.java:146)
        at org.jbpm.pvm.internal.tx.HibernateSessionResource.commit(HibernateSessionResource.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransaction.commit(StandardTransaction.java:139)
        at org.jbpm.pvm.internal.tx.StandardTransaction.complete(StandardTransaction.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransactionInterceptor.execute(StandardTransactionInterceptor.java:57)

This is because the avidb is locked and to resume the upgrade restart the avidb by issuing the below command:
# avinstaller.pl --restart

Post the restart the upgrade should resume successfully. Since this is on 126the package, the upgrade would complete fairly quickly post the restart of the service and you might not notice the progress further. In this case, you can go back to the avinstaller.log.0 and verify if the upgrade is completed.

Hope this helps.

Tuesday, 15 May 2018

Bad Exit Code: 1 During Upgrade Of vSphere Replication To 8.1

With the release of 8.1 vSphere replication comes a ton of new upgrade and deployment issues. The one common issue is the Bad Exit Code: 1 error during the upgrade phase. This is valid for 6.1.2 or 6.5.x to 8.1 upgrade.

The first thing you will notice in the GUI is the following error message.


If you retry the upgrade will still fail and if you Ignore, the upgrade will proceed but then you will notice during during the configuration section.


Only after a "successful" failed upgrade we can access the logs to see what's the issue.

There is a log called hms-boot.log which records all these information and can be found under /opt/vmware/hms/logs

Here, the first error was this:

----------------------------------------------------
# Upgrade Services
Stopping hms service ... OK
Stopping vcta service ... OK
Stopping hbr service ... OK
Downloading file [/opt/vmware/hms/conf/hms-configuration.xml] to [/opt/vmware/upgrade/oldvr] ...Failure during upgrade procedure at Upgrade Services phase: java.io.IOException: inputstream is closed

com.jcraft.jsch.JSchException: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:315)
        at com.jcraft.jsch.Channel.connect(Channel.java:152)
        at com.jcraft.jsch.Channel.connect(Channel.java:145)
        at com.vmware.hms.apps.util.upgrade.SshUtil.getSftpChannel(SshUtil.java:66)
        at com.vmware.hms.apps.util.upgrade.SshUtil.downloadFile(SshUtil.java:88)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.downloadConfigFiles(Vr81MigrationUpgradeWorkflow.java:578)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$3(Vr81MigrationUpgradeWorkflow.java:1222)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)
Caused by: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2911)
        at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2935)
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:262)
        ... 24 more

Then when I proceeded with an ignore, the error was this:

# Reconfigure VR
Failure during upgrade procedure at Reconfigure VR phase: null

java.lang.NullPointerException
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.vrReconfig(Vr81MigrationUpgradeWorkflow.java:1031)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$5(Vr81MigrationUpgradeWorkflow.java:1253)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)

When we still proceeded with ignore, the last stack was this:

Initialization error: Bad exit code: 1
Traceback (most recent call last):
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 178, in main
    __ROUTINES__[name]()
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 714, in get_default_sitename
    ovf.hms_cache_sitename()
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 686, in hms_cache_sitename
    cache_f.write(hms_get_sitename(ext_key, jks, passwd, alias))
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 679, in hms_get_sitename
    ext_key, jks, passwd, alias
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 412, in get_sitename
    output = commands.execute(cmd, None, __HMS_HOME__)[0]
  File "/opt/vmware/share/htdocs/service/hms/cgi/commands.py", line 324, in execute
    raise Exception('Bad exit code: %d' % proc.returncode)
Exception: Bad exit code: 1

So it looks like there is any issue with the copy of files from the old vR server to the new one. In the sshd_config file under /etc/ssh/ on the old vR server, the following was an entry:

Subsystem sftp /usr/lib64/ssh/sftp-server

Edit this line, so it will be:
Subsystem sftp /usr/lib/ssh/sftp-server

Then retry the Upgrade by deploying a fresh 8.1 and going through the "upgrade" process again and this time it should complete successfully.

Hope this helps!

Upgrading vSphere Replication From 6.5 To 8.1

With the release of vSphere Replication 8.1, the upgrade path is not how it was earlier. The 8.1 vR server now hosts a PhotonOS and the upgrade is similar to a vCenter migration. In this case, you will deploy a new 8.1 vR server via the OVF template with a temporary IP and then follow a series of upgrade / migrate steps to transfer data from the old vR server to the new one.

1. Proceed with the regular deployment of the vSphere replication appliance, where you download the 8.1 ISO, mount it on a windows server and choose the support.vmdk, system.vmdk, certificate, manifest and the OVF file for deployment. A temporary IP is needed for the appliance to be on network.

2. Once the deployment is done, power on the 8.1 appliance and open a VM console. During the boot you will be presented with the below options.


The 192.168.1.110 is my 6.5 vSphere Replication appliance and it was already registered to the vCenter server. Select the Option 3 to proceed with the Upgrade.

NOTE: For Bad Exit Code 1 error during upgrade, refer this article here.

3. Provide in the root password of the old replication server to proceed.


4. The upgrade process begins to install the necessary RPMs. This might take about 10 minutes to complete.


5. You will then be prompted to enter the SSO user name of the corresponding vCenter this vR is registered to and it's password.


6.  Post a few configuration progress in the window, the upgrade is done and you will be presented with the 8.1 banner page.


That should be it. Hope this helps!

Tuesday, 1 May 2018

VDP Backup And Maintenance Tasks Fail With "DDR result code: 5004, desc: nothing matched"

I had come across a scenario where the backups on VDP appliance connected to a data domain used to fail constantly and the backup logs for the avtar service had the following snippet:

2018-04-24T20:01:16.180+07:00 avtar Info <19156>: - Establishing a connection to the Data Domain system with encryption (Connection mode: A:3 E:2).
2018-04-24T20:01:26.473+07:00 avtar Error <10542>: Data Domain server "data-domain.vcloud.local" open failed DDR result code: 5004, desc: nothing matched
2018-04-24T20:01:26.473+07:00 avtar Error <10509>: Problem logging into the DDR server:'', only GSAN communication was enabled.
2018-04-24T20:01:26.474+07:00 avtar FATAL <17964>: Backup is incomplete because file "/ddr_files.xml" is missing
2018-04-24T20:01:26.474+07:00 avtar Info <10642>: DDR errors caused the backup to not be posted, errors=0, fatals=0
2018-04-24T20:01:26.474+07:00 avtar Info <12530>: Backup was not committed to the DDR.
2018-04-24T20:01:26.475+07:00 avtar FATAL <8941>: Fatal server connection problem, aborting initialization. Verify correct server address and login credentials.
2018-04-24T20:01:26.475+07:00 avtar Info <6149>: Error summary: 4 errors: 10542, 8941, 10509, 17964
2018-04-24T20:01:26.476+07:00 avtar Info <8468>: Sending wrapup message to parent
2018-04-24T20:01:26.476+07:00 avtar Info <5314>: Command failed (4 errors, exit code 10008: cannot establish connection with server (possible network or DNS failure))

Whenever we have issues with a VDP system with data domain we usually have a look into the ddrmaintlogs located under /usr/local/avamar/var/ddrmaintlogs

Under this, I saw the following:

Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Data Domain Engine (3.1.0.1 build 481386)
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig ddr_info version is 5.
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig found 1 ddrconfig records in ddr_info
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: dpnid 1494515613 from flag
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: LSU: avamar-1494515613
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: server=data-domain.vcloud.local(1),id=4D752F9C9A984D026520ACA64AA465388352BAB1,user=vmware_vdp
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Establishing a connection to the Data Domain system with basic authentication (Connection mode: A:0 E:0).
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::open_ddr: DDR_Open failed: data-domain.vcloud.local(1) lsu: avamar-1494515613, DDR result code: 5004, desc: nothing matched
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::body - ddr_info file from the persistent store has no ddr_config entries
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: <4750>Datadomain get checkpoint list operation failed.
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist finished in 11 seconds
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist cmd finished =============================

Out of curiosity, when I try to list data domain checkpoints, it fails too.

root@vdp:~/#: ddrmaint cplist
<4750>Datadomain get checkpoint list operation failed.

The DDoS seemed to be supported for the VDP node:

root@vdp:/usr/local/avamar/var/ddrmaintlogs/#: ddrmaint read-ddr-info --format=full
====================== Read-DDR-Info ======================

 System name        : data-domain.vcloud.local
 System ID          : 4D752F9C9A984D026520ACB64AA465388352BAB1
 DDBoost user       : vmware_vdp
 System index       : 1
 Replication        : True
 CP Backup          : True
 Model number       : DD670
 Serialno           : 3FA0807274
 DDOS version       : 6.0.1.10-561375
 System attached    : 1970-01-01 00:00:00 (0)
 System max streams : 16

You can SSH into the Data Domain and view the ddfs logs using the below command:
# log view debug/ddfs.info

I noticed the following in this log:

04/25 12:08:57.141 (tid 0x7fbf4fa68490): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:57.141 (tid 0x7fbf4fa68490): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:58.147 (tid 0x6314fb0): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:58.147 (tid 0x6314fb0): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:58.147 (tid 0x6314fb0): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:59.153 (tid 0x7fc24501bd40): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1

The VDP server relies upon the ost folder to perform the maintenance tasks and this folder was missing from the data domain causing maintenance and backup tasks to fail. 

To fix this we need to recreate the ost folder and export the NFS mount

1. Login to the Bash shell of Data Domain. You can view this article for the steps.

2. Navigate to the below directory:
# cd /data/col1/backup

3. Verify that there is no ost folder when you list for files and folders

4. Create the directory and set the ownership and permissions with the below commands:
# mkdir ost
# chmod 777 ost
# chown <your_ddboost_user> ost

5. Exit the bash shell with exit

6. Create the NFS mount with:
# nfs add /backup/ost *

You will see the below message:
NFS export for "/backup/ost" added.

Exit the data domain and back in VDP putty, run ddrmaint cplist and this should return the checkpoint lists successfully. This should proceed with a successful maintenance and backup tasks.

Hope this helps!

Thursday, 26 April 2018

SRM Service Fails To Start: "Could not initialize Vdb connection Data source name not found and no default driver specified"

In few cases, you might come across a scenario where the Site Recovery Manager service does not start and in the Event Viewer you will notice the following back trace for the vmware-dr service.

VMware vCenter Site Recovery Manager application error.
class Vmacore::Exception "DBManager error: Could not initialize Vdb connection: ODBC error: (IM002) - [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] vmware-dr.exe[0x00107621]
backtrace[05] MSVCR120.dll[0x00066920]
backtrace[06] MSVCR120.dll[0x0005E36D]
backtrace[07] ntdll.dll[0x00092A63]
backtrace[08] vmware-dr.exe[0x00014893]
backtrace[09] vmware-dr.exe[0x00015226]
backtrace[10] windowsService.dll[0x00002BF5]
backtrace[11] windowsService.dll[0x00001F24]
backtrace[12] sechost.dll[0x00005ADA]
backtrace[13] KERNEL32.DLL[0x000013D2]
backtrace[14] ntdll.dll[0x000154E4]
[backtrace end]  

There are no logs generated in vmware-dr.log and the ODBC connection test completes successfully too. 

However, when you go to vmware-dr.xml file located under C:\Program Files\VMware\VMware vCenter Site Recovery Manager\config and search for the tag <DBManager> you will notice the <dsn> name will be incorrect.

Upon providing in the right dsn name within the <dsn> </dsn> you will then notice a new back trace when you attempt to start the service again

VMware vCenter Site Recovery Manager application error.
class Vmacore::InvalidArgumentException "Invalid argument"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] listener.dll[0x0000BCBC]

What I suspect is something has gone wrong with the vmware-dr.xml file and the fix for this is to re-install the SRM application with an existing database. 

Post this, the service starts successfully. Hope this helps. 

Friday, 20 April 2018

Upgrading vCenter Appliance From 6.5 to 6.7

So as you know, vSphere 6.7 is now GA and this article will speak about upgrading an embedded PSC deployment of 6.5 vCenter appliance to 6.7. Once you download the 6.7 VCSA ISO installer
mount the ISO on a local windows machine and then you can use the ui installer for windows to begin the upgrade phase.

You will be presented with the below choices:


We will be going with the Upgrade option. The upgrade is going to be like the earlier path wherein the process will deploy a new 6.7 VCSA and perform a data and configuration migration from the older 6.5 appliance and then power down the old server when the upgrade is successful.


Accept the EULA to proceed further.


In the next step we will connect to the source appliance so provide in the IP/FQDN of the source 6.5 vCenter server.


Once the Connect To Source goes through you will then be asked to enter the SSO details and the ESX details where the 6.5 vCSA is running.


Then the next step is to provide information about the target appliance, the 6.7 appliance. You will select the ESX where the target appliance should be deployed.


Then provide the inventory display name for the target vCenter 6.7 along with the a root password.


Select the appliance deployment size for the target server. Make sure this matches or is greater than the source 6.7 server.


Then select the datastore where the target appliance should reside.


Next, we will provide a set of temporary network details for the 6.7 appliance. The appliance will inherit the old 6.5 network configuration post a successful migration.


Review the details and Finish the begin the Stage 1 deployment process.


Once the Stage 1 is done, you can Continue to proceed further with the Stage 2



In the Stage 2 we will be performing a data copy from the source vCenter appliance to the deployed target from Stage 1


Provide in the details to connect to the source vCenter server.


Select the type of data to be copied over to the destination vCenter server. In my case, I just want to migrate the configuration data.


Join the CEIP and proceed further


Review the details and Finish to begin the data copy.


The source vCenter will be shutdown post the data copy.


The data migration will take a while to complete and is in 3 stages.


And that's it. If all goes well, the migration is complete and you can access your new vCenter from the URL.

Hope this helps.

Thursday, 12 April 2018

SRM Test Recovery Fails: "Failed to create snapshots of replica devices"

When using SRM with array based replication, a test recovery operation will take a snapshot of the replica LUN, present it and mount it on the ESX server to bring up the VMs on an isolated network. 

In many instances, the test recovery would fail at the crucial step, which is taking a snapshot of the replica device. The GUI would mention: 

Failed to create snapshots of replica devices 

In this case, always look into the vmware-dr.log on the recovery site of the SRM. In my case I noticed the below snippet:

2018-04-10T11:00:12.287+01:00 error vmware-dr[16896] [Originator@6876 sub=SraCommand opID=7dd8a324:9075:7d02:758d] testFailoverStart's stderr:
--> java.io.IOException: Couldn't get lock for /tmp/santorini.log
--> at java.util.logging.FileHandler.openFiles(Unknown Source)
--> at java.util.logging.FileHandler.<init>(Unknown Source)
=================BREAK========================
--> Apr 10, 2018 11:00:12 AM com.emc.santorini.log.KLogger logWithException
--> WARNING: Unknown error: 
--> com.sun.xml.internal.ws.client.ClientTransportException: HTTP transport error: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.getOutput(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.process(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.transport.DeferredTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Unknown Source)
=================BREAK========================
--> Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at sun.security.ssl.InputRecord.handleUnknownRecord(Unknown Source)
--> at sun.security.ssl.InputRecord.read(Unknown Source)
--> at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)

2018-04-10T11:00:12.299+01:00 error vmware-dr[21512] [Originator@6876 sub=AbrRecoveryEngine opID=7dd8a324:9075:7d02:758d] Dr::Providers::Abr::AbrRecoveryEngine::Internal::RecoverOp::ProcessFailoverFailure: Failed to create snapshots of replica devices for group 'vm-protection-group-45026' using array pair 'array-pair-2038': (dr.storage.fault.CommandFailed) {
-->    faultCause = (dr.storage.fault.LocalizableAdapterFault) {
-->       faultCause = (vmodl.MethodFault) null, 
-->       faultMessage = <unset>, 
-->       code = "78814f38-52ff-32a5-806c-73000467afca.1049", 
-->       arg = <unset>
-->       msg = ""
-->    }, 
-->    faultMessage = <unset>, 
-->    commandName = "testFailoverStart"
-->    msg = ""
--> }
--> [context]

So here the SRA attempts to establish connection with the RecoverPoint over HTTP which from 3.5.x is disabled. And we need to allow RP and SRM to communicate over HTTPS. 

On the SRM, perform the below:

1. Open CMD in admin mode and navigate to the below location:
c:\Program Files\VMware\VMware vCenter Site Recovery Manager\storage\sra\array-type-recoverpoint

2. Then run the below command:
"c:\Program Files\VMware\VMware vCenter Site Recovery Manager\external\perl-5.14.2\bin\perl.exe" command.pl --useHttps true

In 6.5 I have seen the path to be external\perl\perl]bin\perl.exe 
So verify what the correct path is for the second command. 

You should ideally see an output like:
Successfully changed to HTTPS security mode

3. Perform this on both the SRM sites. 

On the RPA, perform the below:

1. Login to each RPA with boxmgmt account

2. [2] Setup > [8] Advanced Options > [7] Security Options > [1] Change Web Server Mode 
(option number may change)

3. You will be then presented with this message:
Do you want to disable the HTTP server (y/n)?

4. Disable HTTP and repeat this on production and recovery RPA cluster. 

Restart the SRM service on both sites and re-run the test recovery and this should now complete successfully. 

Hope this helps. 

Wednesday, 4 April 2018

Maintenance Task Fails On VDP When Connected To Data Domain

There are many instances where the maintenance task fails on VDP. This article is in specific to VDP when integrated with data domain and moreover when the DDoS version is 6.1 and above.

The checkpoint and HFS tasks were completing fine without issues:
# dumpmaintlogs --types=cp | grep "<4"

2018/03/19-12:01:04.44235 {0.0} <4301> completed checkpoint maintenance
2018/03/19-12:04:17.71935 {0.0} <4300> starting scheduled checkpoint maintenance
2018/03/19-12:04:40.40012 {0.0} <4301> completed checkpoint maintenance

# dumpmaintlogs --types=hfscheck | grep "<4"

2018/03/18-12:00:59.49574 {0.0} <4002> starting scheduled hfscheck
2018/03/18-12:04:11.83316 {0.0} <4003> completed hfscheck of cp.20180318120037
2018/03/19-12:01:04.49357 {0.0} <4002> starting scheduled hfscheck
2018/03/19-12:04:16.59187 {0.0} <4003> completed hfscheck of cp.20180319120042

Garbage collection task was the one that was failing:
# dumpmaintlogs --types=gc --days=30 | grep "<4"

2018/03/18-12:00:22.29852 {0.0} <4200> starting scheduled garbage collection
2018/03/18-12:00:36.77421 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR
2018/03/19-12:00:23.91138 {0.0} <4200> starting scheduled garbage collection
2018/03/19-12:00:41.77701 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR

From ddrmaint.log located under /usr/local/avamar/var/ddrmaintlogs had the following entry:

Mar 18 12:00:31 VDP01 ddrmaint.bin[14667]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 18 12:00:34 VDP01 ddrmaint.bin[14667]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

Mar 19 12:00:35 VDP01 ddrmaint.bin[13409]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 19 12:00:39 VDP01 ddrmaint.bin[13409]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

It was basically failing to retrieve checkpoint list from the data domain.
Also, the get checkpoint list was failing:

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: <4750>Datadomain get checkpoint list operation failed.

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::execute_cplist: Failed to retrieve snapshot checkpoints from LSU: avamar-1488469814, ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: <4750>Datadomain get checkpoint list operation failed.

From the mTree LSU of this VDP Server, we noticed that the checkpoints were not expired:
# snapshot list mtree /data/col1/avamar-1488469814

Snapshot Information for MTree: /data/col1/avamar-1488469814
----------------------------------------------
Name                Pre-Comp (GiB)   Create Date         Retain Until   Status
-----------------   --------------   -----------------   ------------   ------
cp.20171220090039         128533.9   Dec 20 2017 09:00
cp.20171220090418         128543.0   Dec 20 2017 09:04
cp.20171221090040         131703.8   Dec 21 2017 09:00
cp.20171221090415         131712.9   Dec 21 2017 09:04
.
cp.20180318120414         161983.7   Mar 18 2018 12:04
cp.20180319120042         162263.9   Mar 19 2018 12:01
cp.20180319120418         162273.7   Mar 19 2018 12:04
cur.1515764908            125477.9   Jan 12 2018 13:49
-----------------   --------------   -----------------   ------------   ------
Snapshot Summary
-------------------
Total:          177
Not expired:    177
Expired:          0

Due to this, all the recent checkpoints on VDP were invalid:
# cplist

cp.20180228120038 Wed Feb 28 12:00:38 2018 invalid --- ---  nodes   1/1 stripes     76
.
cp.20180318120414 Sun Mar 18 12:04:14 2018 invalid --- ---  nodes   1/1 stripes     76
cp.20180319120042 Mon Mar 19 12:00:42 2018 invalid --- ---  nodes   1/1 stripes     76
cp.20180319120418 Mon Mar 19 12:04:18 2018 invalid --- ---  nodes   1/1 stripes     76

The case here is the VDP version was 6.1.x and the data domain OS version was 6.1
# ddrmaint read-ddr-info --format=full

====================== Read-DDR-Info ======================
System name        : xxx.xxxx.xxxx
System ID          : Bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx4
DDBoost user       : ddboost
System index       : 1
Replication        : True
CP Backup          : True
Model number       : DDxxx
Serialno           : Cxxxxxxxx
DDOS version       : 6.1.0.21-579789
System attached    : 1970-01-01 00:00:00 (0)
System max streams : 16

6.1 DD OS version is not supported for VDP 6.1.x. 6.0.x is the last DD OS version supported for VDP.

So if your DD OS is on 6.1.x then the choice would be to:
> Migrate the VDP to Avamar Virtual Edition (Recommended)
> Rollback DD OS to 6.0.x

Hope this helps!