Wednesday, 30 May 2018

Unable To Configure Or Reconfigure Protection Groups In SRM: java.net.SocketTimeoutException: Read timed out

When you try to reconfigure or create a new protection group you might run into the following message when you click on the Finish option.

 java.net.SocketTimeoutException: Read timed out

Below is a screenshot of this error:


In the web client logs for the respective vCenter you will see the below logging: 

[2018-05-30T13:49:14.548Z] [INFO ] health-status-65 com.vmware.vise.vim.cm.healthstatus.AppServerHealthService Memory usage: used=406,846,376; max=1,139,277,824; percentage=35.7109010137285%. Status: GREEN
[2018-05-30T13:49:14.549Z] [INFO ] health-status-65   c.v.v.v.cm.HealthStatusRequestHandler$HealthStatusCollectorTask Determined health status 'GREEN' in 0 ms
[2018-05-30T13:49:20.604Z] [ERROR] http-bio-9090-exec-16 70002318 100003 200007 c.v.s.c.g.wizard.addEditGroup.ProtectionGroupMutationProvider Failed to reconfigure PG [DrReplicationVmProtectionGroup:vm-protection-group-11552:67
2e1d34-cbad-46a4-ac83-a7c100547484]:  com.vmware.srm.client.topology.client.view.PairSetup$RemoteLoginFailed: java.net.SocketTimeoutException: Read timed out
        at com.vmware.srm.client.topology.impl.view.PairSetupImpl.remoteLogin(PairSetupImpl.java:126)
        at com.vmware.srm.client.infraservice.util.TopologyHelper.loginRemoteSite(TopologyHelper.java:398)
        at com.vmware.srm.client.groupservice.wizard.addEditGroup.ProtectionGroupMutationProvider.apply(ProtectionGroupMutationProvider.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.vmware.vise.data.provider.DelegatingServiceBase.invokeProviderInternal(DelegatingServiceBase.java:400)


Caused by: com.vmware.vim.vmomi.client.exception.ConnectionException: java.net.SocketTimeoutException: Read timed out
        at com.vmware.vim.vmomi.client.common.impl.ResponseImpl.setError(ResponseImpl.java:252)
        at com.vmware.vim.vmomi.client.http.impl.HttpExchange.run(HttpExchange.java:51)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingBase.executeRunnable(HttpProtocolBindingBase.java:226)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingImpl.send(HttpProtocolBindingImpl.java:110)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.sendCall(MethodInvocationHandlerImpl.java:613)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.executeCall(MethodInvocationHandlerImpl.java:594)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl.completeCall(MethodInvocationHandlerImpl.java:345)

There might an issue with ports between vCenter and SRM server and you can validate those ports using this KB here

If the ports are fine, then validate that no guest level security agents on SRM or vCenter (Windows) are blocking this traffic. 

In my case the network connection and firewall / security settings were fine, and a fix was done by performing a Modify on the SRM installation on both the sites. Once this was done, a reconfigure pairing was done and post this we were able to reconfigure the protection groups successfully. 

Tuesday, 22 May 2018

VDP 6.1.8 Upgrade Stalls At 97 Percent

When upgrading a vSphere Data Protection appliance to 6.1.8 there might be a case where the upgrade stalls at 97 percent with the below task stuck in progress


In the avinstaller.log.0 located under /usr/local/avamar/var/avi/server_logs has the following message:

May 21, 2018 12:07:38 PM org.hibernate.transaction.JDBCTransaction commit
SEVERE: JDBC commit failed
java.sql.SQLException: database is locked
        at org.sqlite.DB.throwex(DB.java:855)
        at org.sqlite.DB.exec(DB.java:138)
        at org.sqlite.SQLiteConnection.commit(SQLiteConnection.java:512)
        at org.hibernate.transaction.JDBCTransaction.commitAndResetAutoCommit(JDBCTransaction.java:170)
        at org.hibernate.transaction.JDBCTransaction.commit(JDBCTransaction.java:146)
        at org.jbpm.pvm.internal.tx.HibernateSessionResource.commit(HibernateSessionResource.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransaction.commit(StandardTransaction.java:139)
        at org.jbpm.pvm.internal.tx.StandardTransaction.complete(StandardTransaction.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransactionInterceptor.execute(StandardTransactionInterceptor.java:57)

This is because the avidb is locked and to resume the upgrade restart the avidb by issuing the below command:
# avinstaller.pl --restart

Post the restart the upgrade should resume successfully. Since this is on 126the package, the upgrade would complete fairly quickly post the restart of the service and you might not notice the progress further. In this case, you can go back to the avinstaller.log.0 and verify if the upgrade is completed.

Hope this helps.

Tuesday, 15 May 2018

Bad Exit Code: 1 During Upgrade Of vSphere Replication To 8.1

With the release of 8.1 vSphere replication comes a ton of new upgrade and deployment issues. The one common issue is the Bad Exit Code: 1 error during the upgrade phase. This is valid for 6.1.2 or 6.5.x to 8.1 upgrade.

The first thing you will notice in the GUI is the following error message.


If you retry the upgrade will still fail and if you Ignore, the upgrade will proceed but then you will notice during during the configuration section.


Only after a "successful" failed upgrade we can access the logs to see what's the issue.

There is a log called hms-boot.log which records all these information and can be found under /opt/vmware/hms/logs

Here, the first error was this:

----------------------------------------------------
# Upgrade Services
Stopping hms service ... OK
Stopping vcta service ... OK
Stopping hbr service ... OK
Downloading file [/opt/vmware/hms/conf/hms-configuration.xml] to [/opt/vmware/upgrade/oldvr] ...Failure during upgrade procedure at Upgrade Services phase: java.io.IOException: inputstream is closed

com.jcraft.jsch.JSchException: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:315)
        at com.jcraft.jsch.Channel.connect(Channel.java:152)
        at com.jcraft.jsch.Channel.connect(Channel.java:145)
        at com.vmware.hms.apps.util.upgrade.SshUtil.getSftpChannel(SshUtil.java:66)
        at com.vmware.hms.apps.util.upgrade.SshUtil.downloadFile(SshUtil.java:88)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.downloadConfigFiles(Vr81MigrationUpgradeWorkflow.java:578)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$3(Vr81MigrationUpgradeWorkflow.java:1222)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)
Caused by: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2911)
        at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2935)
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:262)
        ... 24 more

Then when I proceeded with an ignore, the error was this:

# Reconfigure VR
Failure during upgrade procedure at Reconfigure VR phase: null

java.lang.NullPointerException
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.vrReconfig(Vr81MigrationUpgradeWorkflow.java:1031)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$5(Vr81MigrationUpgradeWorkflow.java:1253)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)

When we still proceeded with ignore, the last stack was this:

Initialization error: Bad exit code: 1
Traceback (most recent call last):
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 178, in main
    __ROUTINES__[name]()
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 714, in get_default_sitename
    ovf.hms_cache_sitename()
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 686, in hms_cache_sitename
    cache_f.write(hms_get_sitename(ext_key, jks, passwd, alias))
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 679, in hms_get_sitename
    ext_key, jks, passwd, alias
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 412, in get_sitename
    output = commands.execute(cmd, None, __HMS_HOME__)[0]
  File "/opt/vmware/share/htdocs/service/hms/cgi/commands.py", line 324, in execute
    raise Exception('Bad exit code: %d' % proc.returncode)
Exception: Bad exit code: 1

So it looks like there is any issue with the copy of files from the old vR server to the new one. In the sshd_config file under /etc/ssh/ on the old vR server, the following was an entry:

Subsystem sftp /usr/lib64/ssh/sftp-server

Edit this line, so it will be:
Subsystem sftp /usr/lib/ssh/sftp-server

Then retry the Upgrade by deploying a fresh 8.1 and going through the "upgrade" process again and this time it should complete successfully.

Hope this helps!

Upgrading vSphere Replication From 6.5 To 8.1

With the release of vSphere Replication 8.1, the upgrade path is not how it was earlier. The 8.1 vR server now hosts a PhotonOS and the upgrade is similar to a vCenter migration. In this case, you will deploy a new 8.1 vR server via the OVF template with a temporary IP and then follow a series of upgrade / migrate steps to transfer data from the old vR server to the new one.

1. Proceed with the regular deployment of the vSphere replication appliance, where you download the 8.1 ISO, mount it on a windows server and choose the support.vmdk, system.vmdk, certificate, manifest and the OVF file for deployment. A temporary IP is needed for the appliance to be on network.

2. Once the deployment is done, power on the 8.1 appliance and open a VM console. During the boot you will be presented with the below options.


The 192.168.1.110 is my 6.5 vSphere Replication appliance and it was already registered to the vCenter server. Select the Option 3 to proceed with the Upgrade.

NOTE: For Bad Exit Code 1 error during upgrade, refer this article here.

3. Provide in the root password of the old replication server to proceed.


4. The upgrade process begins to install the necessary RPMs. This might take about 10 minutes to complete.


5. You will then be prompted to enter the SSO user name of the corresponding vCenter this vR is registered to and it's password.


6.  Post a few configuration progress in the window, the upgrade is done and you will be presented with the 8.1 banner page.


That should be it. Hope this helps!

Tuesday, 1 May 2018

VDP Backup And Maintenance Tasks Fail With "DDR result code: 5004, desc: nothing matched"

I had come across a scenario where the backups on VDP appliance connected to a data domain used to fail constantly and the backup logs for the avtar service had the following snippet:

2018-04-24T20:01:16.180+07:00 avtar Info <19156>: - Establishing a connection to the Data Domain system with encryption (Connection mode: A:3 E:2).
2018-04-24T20:01:26.473+07:00 avtar Error <10542>: Data Domain server "data-domain.vcloud.local" open failed DDR result code: 5004, desc: nothing matched
2018-04-24T20:01:26.473+07:00 avtar Error <10509>: Problem logging into the DDR server:'', only GSAN communication was enabled.
2018-04-24T20:01:26.474+07:00 avtar FATAL <17964>: Backup is incomplete because file "/ddr_files.xml" is missing
2018-04-24T20:01:26.474+07:00 avtar Info <10642>: DDR errors caused the backup to not be posted, errors=0, fatals=0
2018-04-24T20:01:26.474+07:00 avtar Info <12530>: Backup was not committed to the DDR.
2018-04-24T20:01:26.475+07:00 avtar FATAL <8941>: Fatal server connection problem, aborting initialization. Verify correct server address and login credentials.
2018-04-24T20:01:26.475+07:00 avtar Info <6149>: Error summary: 4 errors: 10542, 8941, 10509, 17964
2018-04-24T20:01:26.476+07:00 avtar Info <8468>: Sending wrapup message to parent
2018-04-24T20:01:26.476+07:00 avtar Info <5314>: Command failed (4 errors, exit code 10008: cannot establish connection with server (possible network or DNS failure))

Whenever we have issues with a VDP system with data domain we usually have a look into the ddrmaintlogs located under /usr/local/avamar/var/ddrmaintlogs

Under this, I saw the following:

Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Data Domain Engine (3.1.0.1 build 481386)
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig ddr_info version is 5.
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig found 1 ddrconfig records in ddr_info
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: dpnid 1494515613 from flag
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: LSU: avamar-1494515613
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: server=data-domain.vcloud.local(1),id=4D752F9C9A984D026520ACA64AA465388352BAB1,user=vmware_vdp
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Establishing a connection to the Data Domain system with basic authentication (Connection mode: A:0 E:0).
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::open_ddr: DDR_Open failed: data-domain.vcloud.local(1) lsu: avamar-1494515613, DDR result code: 5004, desc: nothing matched
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::body - ddr_info file from the persistent store has no ddr_config entries
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: <4750>Datadomain get checkpoint list operation failed.
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist finished in 11 seconds
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist cmd finished =============================

Out of curiosity, when I try to list data domain checkpoints, it fails too.

root@vdp:~/#: ddrmaint cplist
<4750>Datadomain get checkpoint list operation failed.

The DDoS seemed to be supported for the VDP node:

root@vdp:/usr/local/avamar/var/ddrmaintlogs/#: ddrmaint read-ddr-info --format=full
====================== Read-DDR-Info ======================

 System name        : data-domain.vcloud.local
 System ID          : 4D752F9C9A984D026520ACB64AA465388352BAB1
 DDBoost user       : vmware_vdp
 System index       : 1
 Replication        : True
 CP Backup          : True
 Model number       : DD670
 Serialno           : 3FA0807274
 DDOS version       : 6.0.1.10-561375
 System attached    : 1970-01-01 00:00:00 (0)
 System max streams : 16

You can SSH into the Data Domain and view the ddfs logs using the below command:
# log view debug/ddfs.info

I noticed the following in this log:

04/25 12:08:57.141 (tid 0x7fbf4fa68490): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:57.141 (tid 0x7fbf4fa68490): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:58.147 (tid 0x6314fb0): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:58.147 (tid 0x6314fb0): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:58.147 (tid 0x6314fb0): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:59.153 (tid 0x7fc24501bd40): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1

The VDP server relies upon the ost folder to perform the maintenance tasks and this folder was missing from the data domain causing maintenance and backup tasks to fail. 

To fix this we need to recreate the ost folder and export the NFS mount

1. Login to the Bash shell of Data Domain. You can view this article for the steps.

2. Navigate to the below directory:
# cd /data/col1/backup

3. Verify that there is no ost folder when you list for files and folders

4. Create the directory and set the ownership and permissions with the below commands:
# mkdir ost
# chmod 777 ost
# chown <your_ddboost_user> ost

5. Exit the bash shell with exit

6. Create the NFS mount with:
# nfs add /backup/ost *

You will see the below message:
NFS export for "/backup/ost" added.

Exit the data domain and back in VDP putty, run ddrmaint cplist and this should return the checkpoint lists successfully. This should proceed with a successful maintenance and backup tasks.

Hope this helps!