Tuesday, 24 July 2018

Understanding Perfbeat Logging In GSAN

If you have ever come across GSAN logs in VDP located under /data01/cur you would sometimes notice the below logging:

2018/05/28-10:34:31.01397 {0.0} [perfbeat.3:196]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=1397.27 limit=139.7273 mbpersec=0.79
2018/05/28-10:35:38.67619 {0.0} [perfbeat.2:194]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=53.72 limit=5.3722 mbpersec=0.88
The perfbeat outoftolerance is logged against various process.  In the above example, the task running is garbage collection and flush. This can be hfscheck, backup, restore and so on. Ideally, you will see this logging whenever that particular task has slow performance, causing the respective maintenance or backup jobs to take a long time to complete. If you are in a situation where the backup or restore or any Filesystem check is taking suspiciously long time to complete, then this would be a best place to look.

On a high level, GSAN measures the current performance over a period of previously measured average performance.

A simple explanation to the above logging is this. The average performance for the task within [] was 53.72, which was measured over a period of time. The current performance is 10 percent below the read average. (10 percent of 53.72 is 5.372) and the current mbpersec is 0.88

This mentions that there is a stress on the underlying storage or something wrong with that particular storage in terms of performance. Since VDP runs as a virtual machine. The flow would be:

> Check the load on the VDP itself. See if there is unusual load on the system and if yes, determine if there is a process hogging up the resources
> If the VM level checks out, then see if there are any issues on the DAVG or the VMFS file system. Perhaps there are multiple high I/O VMs running on this storage and there is a resource contention occurring? I would start with the vobd.log and vmkernel.log for that particular datastore naa.ID and then verify the Device Avg for that device.
> If this checks out too, then the last part would be the storage array itself. Moving VDP to another datastore is not an ideal test since these appliances fairly large in size.

Hope this helps!

Sunday, 8 July 2018

Unable To Pair SRM Sites: "Server certificate chain not verified"

So first things first, as of this post, you might know that I have moved out of VMware and ventured further into backup and recovery solutions domain. Currently, I work as a solutions engineer at Rubrik.

There are a lot of instances where you are unable to manage anything in Site Recovery Manager; regardless of the version (Also, applicable to vSphere Replication) and the common error that pops up on the bottom right of the web client is Server certificate chain not verified

Failed to connect to vCenter Server at vCenter_FQDN:443/sdk. Reason:
com.vmware.vim.vmomi.core.exception CertificateValidationException: Server certificate chain not verified.

This article will briefly explain only on the embedded Platform Services Controller deployment model. Similar logic needs to be extrapolated to the external deployments.

These issues are ideally seen when:
> PSC is migrated from embedded to external
> Certificates are replaced for the vCenter

I will be simplifying this KB article here for reference. Before proceeding have a powered off snapshot of the PSC and vCenters involved. 

So, for embedded deployment of VCSA:

1. SSH into the VCSA and run the below command:
# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost/lookupservice/sdk --no-check-cert --ep-type com.vmware.cis.cs.identity.sso 2>/dev/null

This command will give you the ssl Trust that is currently stored on your PSC. Now, consider you are using an embedded PSC deployment on production and another embedded deployment in DR (No Enhanced Linked Mode). In this case, when you run the above command, you are expected to see just one single output where the URL section is the FQDN of your current PSC node and associated with it would be its current ssl trust. 

URL: https://current-psc.vmware.local/sts/STSService/vsphere.local
SSL trust: MIIDWDCCAkCgAwIBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...Reducing output...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10ggClaP8=

If this is your case, proceed to step (2), if not jump to step (3)

2. Run the next command:
# echo | openssl s_client -connect localhost:443

This is the current ssl Trust that is used by your deployment post the certificate replacement. Here look at the extract which speaks about the certificate chain.

Server certificate
-----BEGIN CERTIFICATE-----
MIIDWDCCyAHikleBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10fGhhDDqm=
-----END CERTIFICATE-----

So from here, the chain obtained from the first command (current ssl trust in PSC) does not match the chain from the second command (Actual ssl trust). And due to this mismatch you would see the chain not verified message in the UI.

To fix this, the logic is; find all the services using the thumbprint of the old ssl Trust (step 1) and update them with the thumbprint from step 2. The steps in KB article is a bit confusing, so this is what I follow to fix it.

A) Copy the SSL trust you obtained from the first command to Notepad++ Everything that starts from MIID... in my case (No need to include SSL Trust option in it).

B) The chain should contain 65 characters in one line. So in notepad++ place the mouse after a character and see what the col option reads at the bottom. Hit Enter at the mark when col: 65
Format this for the complete chain (The last line might have <65 characters which is okay)

C) Append -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- before and after the chain (5 hyphens are used before and after)

D) Save the Notepadd++ document as a .cer extension.

E) Open the certificate file that you saved and navigate to Details > Thumbprint. You will notice a string of hexa with spacing after every 2 characters. Copy this to a Notepadd++ and append : after every 2 characters, so you will end up with the thumbprint similar to: 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88

F) Next, we will export the current certificate using the below command
# /usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output /certificates/new_machine.crt

This will export the current certificate to the /certificates directory. 

G) Run the update thumbprint option using the below command:
# cd /usr/lib/vmidentity/tools/scripts/
# python ls_update_certs.py --url https://FQDN_of_Platform_Services_Controller/lookupservice.sdk --fingerprint Old_Certificate_Fingerprint --certfile New_Certificate_Path_from_/Certificates --user Administrator@vsphere.local --password 'Password'

So a sample command would be:
# python ls_update_certs.py --url https://vcsa.vmware.local --fingerprint 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88 --certfile /certificates/new_machine.crt --user Administrator@vsphere.local --password 'Password'

This will basically look for all the services using the old thumbprint (04:83:88) and then update them with the current thumbprint from the new_machine.crt

Note: Once you paste the thumbprint in the SSH, remove any extra spaces before and after the beginning and end of thumbprint respectively. I have seen the Update service fail because of this as it picks up some special character in few cases (Special characters that you would not see in the terminal) So remove the space after fingerprint and re-add the space back. Do the same before the --certfile switch too.

Re-run the commands in step 1 and 2 and the SSL trust should now match. If yes, then re-login back to the web client and you should be good to go. 

--------------------------------------------------------------------------------------------------------------------------
3. In this case, you might see two PSC URL outputs in the step 1 command. Both the PSC have the same URL, and it might or might not have the same SSL trust. 

Case 1:
Multiple URL with different SSL trust.

This would be a easy one. On the output from Step 1, you will see two outputs with same PSC URL, but different SSL trust. And one of the SSL trust from here will match the current certificate from step 2. So this means, the one that does not match is the stale one and can be removed from the STS store. 

You can remove them from the CLI, however, I stick to using Jxplorer tool to remove it from the GUI. You can connect to PSC from Jxplorer using this KB article here.

Once connected, navigate to Configuration > Sites > LookupService > Service Registrations. 
One of the fields from command in step 1 is Service ID. Which is something similar to:
Service ID: 04608398-1493-4482-881b-b35961bf5141

Locate this similar service ID in the service registrations and you should be good to remove it. 

Case 2:
Multiple URL with same SSL trust. 

In this case, after the output from step 1, you will see two same PSC URL along with the same SSL trust. And these might or might not match the output from step 2. 

The first step of this fix is:

Note down both of the service IDs from the output of step 1. Connect the Jxplorer as mentioned above. Select the service ID and on the right side, click Table Editor view and click submit. You can view the last modified date of this service registration. The service ID having the older last modified date would be the stale registration and can be removed via Jxplorer. Now, when you run the command from Step 1, it would have one output. If this matches the thumbprint from step 2, great! If not, then an additional step of updating the thumbprint needs to be performed. 

In an event of external PSC deployment, let's say one PSC in production site and one in recovery site in ELM, then the command from step 1 is supposed to populate two outputs with two different URL (production and DR PSC) since they are replicating. This will of course change if there are multiple PSCs replicating with or without a load balancer. The process would be too complex to explain using text, so in this event it would be best to involve VMware Support for assistance. 

Hope this helps!

Wednesday, 30 May 2018

Unable To Configure Or Reconfigure Protection Groups In SRM: java.net.SocketTimeoutException: Read timed out

When you try to reconfigure or create a new protection group you might run into the following message when you click on the Finish option.

 java.net.SocketTimeoutException: Read timed out

Below is a screenshot of this error:


In the web client logs for the respective vCenter you will see the below logging: 

[2018-05-30T13:49:14.548Z] [INFO ] health-status-65 com.vmware.vise.vim.cm.healthstatus.AppServerHealthService Memory usage: used=406,846,376; max=1,139,277,824; percentage=35.7109010137285%. Status: GREEN
[2018-05-30T13:49:14.549Z] [INFO ] health-status-65   c.v.v.v.cm.HealthStatusRequestHandler$HealthStatusCollectorTask Determined health status 'GREEN' in 0 ms
[2018-05-30T13:49:20.604Z] [ERROR] http-bio-9090-exec-16 70002318 100003 200007 c.v.s.c.g.wizard.addEditGroup.ProtectionGroupMutationProvider Failed to reconfigure PG [DrReplicationVmProtectionGroup:vm-protection-group-11552:67
2e1d34-cbad-46a4-ac83-a7c100547484]:  com.vmware.srm.client.topology.client.view.PairSetup$RemoteLoginFailed: java.net.SocketTimeoutException: Read timed out
        at com.vmware.srm.client.topology.impl.view.PairSetupImpl.remoteLogin(PairSetupImpl.java:126)
        at com.vmware.srm.client.infraservice.util.TopologyHelper.loginRemoteSite(TopologyHelper.java:398)
        at com.vmware.srm.client.groupservice.wizard.addEditGroup.ProtectionGroupMutationProvider.apply(ProtectionGroupMutationProvider.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.vmware.vise.data.provider.DelegatingServiceBase.invokeProviderInternal(DelegatingServiceBase.java:400)


Caused by: com.vmware.vim.vmomi.client.exception.ConnectionException: java.net.SocketTimeoutException: Read timed out
        at com.vmware.vim.vmomi.client.common.impl.ResponseImpl.setError(ResponseImpl.java:252)
        at com.vmware.vim.vmomi.client.http.impl.HttpExchange.run(HttpExchange.java:51)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingBase.executeRunnable(HttpProtocolBindingBase.java:226)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingImpl.send(HttpProtocolBindingImpl.java:110)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.sendCall(MethodInvocationHandlerImpl.java:613)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.executeCall(MethodInvocationHandlerImpl.java:594)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl.completeCall(MethodInvocationHandlerImpl.java:345)

There might an issue with ports between vCenter and SRM server and you can validate those ports using this KB here

If the ports are fine, then validate that no guest level security agents on SRM or vCenter (Windows) are blocking this traffic. 

In my case the network connection and firewall / security settings were fine, and a fix was done by performing a Modify on the SRM installation on both the sites. Once this was done, a reconfigure pairing was done and post this we were able to reconfigure the protection groups successfully. 

Tuesday, 22 May 2018

VDP 6.1.8 Upgrade Stalls At 97 Percent

When upgrading a vSphere Data Protection appliance to 6.1.8 there might be a case where the upgrade stalls at 97 percent with the below task stuck in progress


In the avinstaller.log.0 located under /usr/local/avamar/var/avi/server_logs has the following message:

May 21, 2018 12:07:38 PM org.hibernate.transaction.JDBCTransaction commit
SEVERE: JDBC commit failed
java.sql.SQLException: database is locked
        at org.sqlite.DB.throwex(DB.java:855)
        at org.sqlite.DB.exec(DB.java:138)
        at org.sqlite.SQLiteConnection.commit(SQLiteConnection.java:512)
        at org.hibernate.transaction.JDBCTransaction.commitAndResetAutoCommit(JDBCTransaction.java:170)
        at org.hibernate.transaction.JDBCTransaction.commit(JDBCTransaction.java:146)
        at org.jbpm.pvm.internal.tx.HibernateSessionResource.commit(HibernateSessionResource.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransaction.commit(StandardTransaction.java:139)
        at org.jbpm.pvm.internal.tx.StandardTransaction.complete(StandardTransaction.java:64)
        at org.jbpm.pvm.internal.tx.StandardTransactionInterceptor.execute(StandardTransactionInterceptor.java:57)

This is because the avidb is locked and to resume the upgrade restart the avidb by issuing the below command:
# avinstaller.pl --restart

Post the restart the upgrade should resume successfully. Since this is on 126the package, the upgrade would complete fairly quickly post the restart of the service and you might not notice the progress further. In this case, you can go back to the avinstaller.log.0 and verify if the upgrade is completed.

Hope this helps.

Tuesday, 15 May 2018

Bad Exit Code: 1 During Upgrade Of vSphere Replication To 8.1

With the release of 8.1 vSphere replication comes a ton of new upgrade and deployment issues. The one common issue is the Bad Exit Code: 1 error during the upgrade phase. This is valid for 6.1.2 or 6.5.x to 8.1 upgrade.

The first thing you will notice in the GUI is the following error message.


If you retry the upgrade will still fail and if you Ignore, the upgrade will proceed but then you will notice during during the configuration section.


Only after a "successful" failed upgrade we can access the logs to see what's the issue.

There is a log called hms-boot.log which records all these information and can be found under /opt/vmware/hms/logs

Here, the first error was this:

----------------------------------------------------
# Upgrade Services
Stopping hms service ... OK
Stopping vcta service ... OK
Stopping hbr service ... OK
Downloading file [/opt/vmware/hms/conf/hms-configuration.xml] to [/opt/vmware/upgrade/oldvr] ...Failure during upgrade procedure at Upgrade Services phase: java.io.IOException: inputstream is closed

com.jcraft.jsch.JSchException: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:315)
        at com.jcraft.jsch.Channel.connect(Channel.java:152)
        at com.jcraft.jsch.Channel.connect(Channel.java:145)
        at com.vmware.hms.apps.util.upgrade.SshUtil.getSftpChannel(SshUtil.java:66)
        at com.vmware.hms.apps.util.upgrade.SshUtil.downloadFile(SshUtil.java:88)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.downloadConfigFiles(Vr81MigrationUpgradeWorkflow.java:578)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$3(Vr81MigrationUpgradeWorkflow.java:1222)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)
Caused by: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2911)
        at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2935)
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:262)
        ... 24 more

Then when I proceeded with an ignore, the error was this:

# Reconfigure VR
Failure during upgrade procedure at Reconfigure VR phase: null

java.lang.NullPointerException
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.vrReconfig(Vr81MigrationUpgradeWorkflow.java:1031)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$5(Vr81MigrationUpgradeWorkflow.java:1253)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)

When we still proceeded with ignore, the last stack was this:

Initialization error: Bad exit code: 1
Traceback (most recent call last):
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 178, in main
    __ROUTINES__[name]()
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 714, in get_default_sitename
    ovf.hms_cache_sitename()
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 686, in hms_cache_sitename
    cache_f.write(hms_get_sitename(ext_key, jks, passwd, alias))
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 679, in hms_get_sitename
    ext_key, jks, passwd, alias
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 412, in get_sitename
    output = commands.execute(cmd, None, __HMS_HOME__)[0]
  File "/opt/vmware/share/htdocs/service/hms/cgi/commands.py", line 324, in execute
    raise Exception('Bad exit code: %d' % proc.returncode)
Exception: Bad exit code: 1

So it looks like there is any issue with the copy of files from the old vR server to the new one. In the sshd_config file under /etc/ssh/ on the old vR server, the following was an entry:

Subsystem sftp /usr/lib64/ssh/sftp-server

Edit this line, so it will be:
Subsystem sftp /usr/lib/ssh/sftp-server

Then retry the Upgrade by deploying a fresh 8.1 and going through the "upgrade" process again and this time it should complete successfully.

Hope this helps!

Upgrading vSphere Replication From 6.5 To 8.1

With the release of vSphere Replication 8.1, the upgrade path is not how it was earlier. The 8.1 vR server now hosts a PhotonOS and the upgrade is similar to a vCenter migration. In this case, you will deploy a new 8.1 vR server via the OVF template with a temporary IP and then follow a series of upgrade / migrate steps to transfer data from the old vR server to the new one.

1. Proceed with the regular deployment of the vSphere replication appliance, where you download the 8.1 ISO, mount it on a windows server and choose the support.vmdk, system.vmdk, certificate, manifest and the OVF file for deployment. A temporary IP is needed for the appliance to be on network.

2. Once the deployment is done, power on the 8.1 appliance and open a VM console. During the boot you will be presented with the below options.


The 192.168.1.110 is my 6.5 vSphere Replication appliance and it was already registered to the vCenter server. Select the Option 3 to proceed with the Upgrade.

NOTE: For Bad Exit Code 1 error during upgrade, refer this article here.

3. Provide in the root password of the old replication server to proceed.


4. The upgrade process begins to install the necessary RPMs. This might take about 10 minutes to complete.


5. You will then be prompted to enter the SSO user name of the corresponding vCenter this vR is registered to and it's password.


6.  Post a few configuration progress in the window, the upgrade is done and you will be presented with the 8.1 banner page.


That should be it. Hope this helps!

Tuesday, 1 May 2018

VDP Backup And Maintenance Tasks Fail With "DDR result code: 5004, desc: nothing matched"

I had come across a scenario where the backups on VDP appliance connected to a data domain used to fail constantly and the backup logs for the avtar service had the following snippet:

2018-04-24T20:01:16.180+07:00 avtar Info <19156>: - Establishing a connection to the Data Domain system with encryption (Connection mode: A:3 E:2).
2018-04-24T20:01:26.473+07:00 avtar Error <10542>: Data Domain server "data-domain.vcloud.local" open failed DDR result code: 5004, desc: nothing matched
2018-04-24T20:01:26.473+07:00 avtar Error <10509>: Problem logging into the DDR server:'', only GSAN communication was enabled.
2018-04-24T20:01:26.474+07:00 avtar FATAL <17964>: Backup is incomplete because file "/ddr_files.xml" is missing
2018-04-24T20:01:26.474+07:00 avtar Info <10642>: DDR errors caused the backup to not be posted, errors=0, fatals=0
2018-04-24T20:01:26.474+07:00 avtar Info <12530>: Backup was not committed to the DDR.
2018-04-24T20:01:26.475+07:00 avtar FATAL <8941>: Fatal server connection problem, aborting initialization. Verify correct server address and login credentials.
2018-04-24T20:01:26.475+07:00 avtar Info <6149>: Error summary: 4 errors: 10542, 8941, 10509, 17964
2018-04-24T20:01:26.476+07:00 avtar Info <8468>: Sending wrapup message to parent
2018-04-24T20:01:26.476+07:00 avtar Info <5314>: Command failed (4 errors, exit code 10008: cannot establish connection with server (possible network or DNS failure))

Whenever we have issues with a VDP system with data domain we usually have a look into the ddrmaintlogs located under /usr/local/avamar/var/ddrmaintlogs

Under this, I saw the following:

Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Data Domain Engine (3.1.0.1 build 481386)
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig ddr_info version is 5.
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::enumerate_ddrconfig found 1 ddrconfig records in ddr_info
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: dpnid 1494515613 from flag
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: LSU: avamar-1494515613
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: cplist::open_all_ddrs: server=data-domain.vcloud.local(1),id=4D752F9C9A984D026520ACA64AA465388352BAB1,user=vmware_vdp
Apr 25 09:03:49 slc2pdvdp01 ddrmaint.bin[84192]: Info: Establishing a connection to the Data Domain system with basic authentication (Connection mode: A:0 E:0).
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::open_ddr: DDR_Open failed: data-domain.vcloud.local(1) lsu: avamar-1494515613, DDR result code: 5004, desc: nothing matched
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: cplist::body - ddr_info file from the persistent store has no ddr_config entries
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Error: <4750>Datadomain get checkpoint list operation failed.
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist finished in 11 seconds
Apr 25 09:03:59 slc2pdvdp01 ddrmaint.bin[84192]: Info: ============================= cplist cmd finished =============================

Out of curiosity, when I try to list data domain checkpoints, it fails too.

root@vdp:~/#: ddrmaint cplist
<4750>Datadomain get checkpoint list operation failed.

The DDoS seemed to be supported for the VDP node:

root@vdp:/usr/local/avamar/var/ddrmaintlogs/#: ddrmaint read-ddr-info --format=full
====================== Read-DDR-Info ======================

 System name        : data-domain.vcloud.local
 System ID          : 4D752F9C9A984D026520ACB64AA465388352BAB1
 DDBoost user       : vmware_vdp
 System index       : 1
 Replication        : True
 CP Backup          : True
 Model number       : DD670
 Serialno           : 3FA0807274
 DDOS version       : 6.0.1.10-561375
 System attached    : 1970-01-01 00:00:00 (0)
 System max streams : 16

You can SSH into the Data Domain and view the ddfs logs using the below command:
# log view debug/ddfs.info

I noticed the following in this log:

04/25 12:08:57.141 (tid 0x7fbf4fa68490): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:57.141 (tid 0x7fbf4fa68490): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:58.147 (tid 0x6314fb0): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:58.147 (tid 0x6314fb0): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:58.147 (tid 0x6314fb0): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfs3_fm_lookup_by_path("/data/col1/backup/ost") failed: 5183
04/25 12:08:59.153 (tid 0x7fc24501bd40): exp_find_export: mount failed from (10.67.167.96/slc2pdvdp01.take.out): export point /backup/ost does not exist
04/25 12:08:59.153 (tid 0x7fc24501bd40): nfsproc3_ost_mnt_3_svc: connection failed mount error 2 from plugin 10.67.167.96 version 3.1

The VDP server relies upon the ost folder to perform the maintenance tasks and this folder was missing from the data domain causing maintenance and backup tasks to fail. 

To fix this we need to recreate the ost folder and export the NFS mount

1. Login to the Bash shell of Data Domain. You can view this article for the steps.

2. Navigate to the below directory:
# cd /data/col1/backup

3. Verify that there is no ost folder when you list for files and folders

4. Create the directory and set the ownership and permissions with the below commands:
# mkdir ost
# chmod 777 ost
# chown <your_ddboost_user> ost

5. Exit the bash shell with exit

6. Create the NFS mount with:
# nfs add /backup/ost *

You will see the below message:
NFS export for "/backup/ost" added.

Exit the data domain and back in VDP putty, run ddrmaint cplist and this should return the checkpoint lists successfully. This should proceed with a successful maintenance and backup tasks.

Hope this helps!

Thursday, 26 April 2018

SRM Service Fails To Start: "Could not initialize Vdb connection Data source name not found and no default driver specified"

In few cases, you might come across a scenario where the Site Recovery Manager service does not start and in the Event Viewer you will notice the following back trace for the vmware-dr service.

VMware vCenter Site Recovery Manager application error.
class Vmacore::Exception "DBManager error: Could not initialize Vdb connection: ODBC error: (IM002) - [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] vmware-dr.exe[0x00107621]
backtrace[05] MSVCR120.dll[0x00066920]
backtrace[06] MSVCR120.dll[0x0005E36D]
backtrace[07] ntdll.dll[0x00092A63]
backtrace[08] vmware-dr.exe[0x00014893]
backtrace[09] vmware-dr.exe[0x00015226]
backtrace[10] windowsService.dll[0x00002BF5]
backtrace[11] windowsService.dll[0x00001F24]
backtrace[12] sechost.dll[0x00005ADA]
backtrace[13] KERNEL32.DLL[0x000013D2]
backtrace[14] ntdll.dll[0x000154E4]
[backtrace end]  

There are no logs generated in vmware-dr.log and the ODBC connection test completes successfully too. 

However, when you go to vmware-dr.xml file located under C:\Program Files\VMware\VMware vCenter Site Recovery Manager\config and search for the tag <DBManager> you will notice the <dsn> name will be incorrect.

Upon providing in the right dsn name within the <dsn> </dsn> you will then notice a new back trace when you attempt to start the service again

VMware vCenter Site Recovery Manager application error.
class Vmacore::InvalidArgumentException "Invalid argument"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] listener.dll[0x0000BCBC]

What I suspect is something has gone wrong with the vmware-dr.xml file and the fix for this is to re-install the SRM application with an existing database. 

Post this, the service starts successfully. Hope this helps. 

Friday, 20 April 2018

Upgrading vCenter Appliance From 6.5 to 6.7

So as you know, vSphere 6.7 is now GA and this article will speak about upgrading an embedded PSC deployment of 6.5 vCenter appliance to 6.7. Once you download the 6.7 VCSA ISO installer
mount the ISO on a local windows machine and then you can use the ui installer for windows to begin the upgrade phase.

You will be presented with the below choices:


We will be going with the Upgrade option. The upgrade is going to be like the earlier path wherein the process will deploy a new 6.7 VCSA and perform a data and configuration migration from the older 6.5 appliance and then power down the old server when the upgrade is successful.


Accept the EULA to proceed further.


In the next step we will connect to the source appliance so provide in the IP/FQDN of the source 6.5 vCenter server.


Once the Connect To Source goes through you will then be asked to enter the SSO details and the ESX details where the 6.5 vCSA is running.


Then the next step is to provide information about the target appliance, the 6.7 appliance. You will select the ESX where the target appliance should be deployed.


Then provide the inventory display name for the target vCenter 6.7 along with the a root password.


Select the appliance deployment size for the target server. Make sure this matches or is greater than the source 6.7 server.


Then select the datastore where the target appliance should reside.


Next, we will provide a set of temporary network details for the 6.7 appliance. The appliance will inherit the old 6.5 network configuration post a successful migration.


Review the details and Finish the begin the Stage 1 deployment process.


Once the Stage 1 is done, you can Continue to proceed further with the Stage 2



In the Stage 2 we will be performing a data copy from the source vCenter appliance to the deployed target from Stage 1


Provide in the details to connect to the source vCenter server.


Select the type of data to be copied over to the destination vCenter server. In my case, I just want to migrate the configuration data.


Join the CEIP and proceed further


Review the details and Finish to begin the data copy.


The source vCenter will be shutdown post the data copy.


The data migration will take a while to complete and is in 3 stages.


And that's it. If all goes well, the migration is complete and you can access your new vCenter from the URL.

Hope this helps.