Thursday, 25 January 2018

SRM Service Crashes During A Recovery Operation With timedFunc BackTrace

In few scenarios when you run a test recovery or a planned migration, the SRM service will crash. This might happen when you run a specific recovery plan or any recovery plan.

If you look into the vmware-dr.log you will notice the following back-trace:

--> Panic: VERIFY d:\build\ob\bora-3884620\srm\public\functional/async/timedFunc.h:210
--> Backtrace:
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
xxxxxxxxxxxxxxxxxxxxx Cut Logs Here xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
--> backtrace[36] ntdll.dll[0x000154E4]
--> [backtrace end]

The timedFunc back-trace is seen when "Wait For VMware Tools" is set to 0 minutes and 0 seconds

And just about few lines top of this back trace you will see the faulty VM which caused this crash.

You will see something similar to:

2018-01-21T08:37:05.421-05:00 [44764 info 'VmDomain' ctxID=57d5ae61 opID=21076ff:c402:4147:d883] Waiting for VM '[vim.VirtualMachine:b2ab3f04-c72e-43ca-b93d-de1566e4de14:vm-323]' to reach desired powered state 'poweredOff' within '0' seconds.

The VM ID is given here. To find this VM ID you will need to go to the vCenter MOB page.

The way I found out to correlate this is:
1. Login to MOB page for vCenter (https://vcenter-ip/mob)
2. Content > group-d1 (Datacenters)
3. Respective datacenter under "Child Entity"
4. Then under vmFolder group-v4 (vm)
5. Expand childEntity and this will list out all the VMs in that vCenter.

My output was similar to:

The VM was CentOS7.2

> Then navigate to the Recovery plan in SRM
> Select the affected Recovery plan this VM is part of > Related Objects > Virtual Machines
> Right click this VM and select Configure Recovery

Here the Wait For VMware Tools were set to 0,0 timeout. We had to change this to a valid non zero value. 

Post this, the recovery plan completed fine without crashing the SRM service. This should ideally be fixed in the newer SRM releases as it would not let you set a 0 timeout. 

Hope this helps!

VDP Restore Fails: Virtual Machine Must Be Powered Off To Restore

There are few cases where you try to perform an in place restore of a virtual machine and it fails and complains the virtual machine is not powered off. The virtual machine is indeed powered off, however, the VDP does not understand this. You will see something like:

There might be two possible causes.

1. The MCS cache might not be updated with the power state.
Run the below command:
# mccli vmcache show --name=/vc-domain-name/VirtualMachines/<vm-name> | grep -i "power status"

Example command:
# mccli vmcache show --name=/cartman.southpark.local/VirtualMachines/Test-1 | grep -i "power status"

The output you should ideally see on a powered off VM is:
Power Status         poweredOff

In the above case, you might see:
Power Status         poweredOn

If this is the issue, then update the MCS VM cache by issuing the below command:
# mccli vmcache sync --name=/vc-domain-name/VirtualMachines/<vm-name>

Example command:
# mccli vmcache sync --name=/cartman.southpark.local/VirtualMachines/Test-1

Then the power state should be updated and the in place restore should work. 

2. The MCS cache might be updated but the tomcat is out of sync with MCS. 

In many cases the MCS and tomcat don't sync together. Due to this, the CLI will show one set of results and the GUI will say otherwise. To sync them up simply restart the tomcat service by issuing:
# --restart

Note that post restarting the tomcat service, it will take a while to connect the appliance back in the web client as it has to rebuild the cache. 
If you tail the vdr-server.log located under /usr/local/avamar/var/vdr/server_logs/vdr-server.log then once you see the below logging, it would indicate the connection has completed successfully:

2018-01-25 09:05:57,566 INFO  [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location  /usr/local/avamar/var/vdr/phonehome/vdp_state
2018-01-25 09:05:57,567 INFO  [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location  /usr/local/avamar/var/vdr/phonehome/vdp_state

The restore should then work successfully. 

If it still continues to fail, then try performing a restore to a different location as a workaround. 

Hope this helps!

Tuesday, 23 January 2018

File Level Restore - Select Destination Is Empty

In few cases, when you try to perform a File Level Restore on a windows server, the "Select Destination" window is empty.

This can happen due to few reasons. If this is happening on ESX 6.0 then it would imply that the vSphere Guest API would be used instead of the VIX API. We need to modified the configuration of the config.xml to use the VIX API only.

Make sure the values in the config.xml file are what is expected. Run the below commands:

# egrep "vmmgrflags|mountmgr" /usr/local/avamarclient/bin/config.xml
# egrep "vmmgrflags|mountmgr" /usr/local/avamarclient/bin/MountPoint/config.xml

The above two commands should display the below output:


If you see this, skip to "Other possible solutions" section. If not, then edit these two files and make sure the values reflect the above mentioned output. 

Then restart the FLR service using:
# service vmwareflr restart

Then re-try the FLR operation. 

Other possible solutions:
In most recent VDP releases this value should be already set to VIX API only, and we might still run into the same issues. In that case a strong case be:

> Outdated VM Tools on that windows machine. Update the VMware tools and restart the VM and then re-attempt the FLR task. 
> Also, add the FLR URL to the browser's trusted website list. 

If the issue still persists, open a case with VMware Support.

Hope this helps!

Friday, 19 January 2018

MCS Service Crashes Due To Locked User Account

This is issue is only seen in an upgraded 6.1.5 instance, where the MCS service constantly crashes with a locked user account. In some cases, the MCS might be running, but none of the mccli commands run and in few cases the backup scheduler service would not start.

If you try to start MCS from the admin mode using --start --verbose, it starts successfully but crashes immediately.

If you run the below command:
# grep locked /usr/local/avamar/var/mc/server_log/mcserver.log.0

You will notice the account being locked:

WARNING: The user MCUser@/ is locked. Product VDP
WARNING: The user MCUser@/ is locked. Product VDP
WARNING: The user MCUser@/ is locked. Product MCCLI
WARNING: The user MCUser@/ is locked. Product VDP
WARNING: The user MCUser@/ is locked. Product VDP

When you start backup scheduler, you might see it fail with:

root@Jimbo:/usr/local/avamar-tomcat/lib/#: dpnctl start sched
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Resuming backup scheduler...
dpnctl: ERROR: error return from "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mccli mcs resume-scheduler" - exit status 1
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

In the dpnctl.log you will notice the following:

2018/01/19-06:13:55 output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mccli mcs resume-scheduler":
2018/01/19-06:13:55 - - - - - - - - - - - - - - - BEGIN
2018/01/19-06:13:55 1,22801,User login failure.
2018/01/19-06:13:55 Attribute Value
2018/01/19-06:13:55 --------- --------------------
2018/01/19-06:13:55 reason    Locked user account.
2018/01/19-06:13:55 - - - - - - - - - - - - - - - END

This is because of a missing java archive file for MCS service due to which the account locks out. 

If you run the below command you should be able to see the symbolic link. 
# ls -lh /usr/local/avamar-tomcat/lib/mccommons.jar

lrwxrwxrwx 1 root root 35 Jan 19 11:44 /usr/local/avamar-tomcat/lib/mccommons.jar -> /usr/local/avamar/lib/mccommons.jar

If this is missing, then you will run into the above issue. To fix this:

1. Be logged in as root user into VDP
2. Navigate to the the below directory:
# cd /usr/local/avamar-tomcat/lib/mccommons.jar

3. Then run the below command to recreate the symbolic link:
# /bin/ln -fs /usr/local/avamar/lib/mccommons.jar .

The . is also required as you are linked to file to the current working directory.

4. Restart tomcat service using:
# --restart

5. Restart MCS using:
# --restart --verbose

That should fix it. Hope this helps!

Wednesday, 17 January 2018

MCS Start Up Fails In VDP With: Permission Denied At

In few rare cases the management services might fail to start up with a permission denied message.
If you run dpnctl start mcs you will notice the almost generic failure:

root@vdp:/#: dpnctl start mcs
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-21990
dpnctl: ERROR: error return from "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/ --start" - exit status 13
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

If you start MCS from admin mode using --start --verbose you will notice:

admin@vdp:~/>: --start --verbose
Permission deniedPermission denied at /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi/XML/ line 587.
Could not create file parser context for file "/usr/local/avamar/var/mc/server_data/prefs/mcserver.xml": Permission denied at /usr/local/avamar/lib/ line 128

The same above is seen if you run --test

In the dpnctl.log you will notice the same:

2018/01/17-15:13:55 dpnctl: INFO: - - - - - - - - - - - - - - - BEGIN
2018/01/17-15:13:55 Permission deniedPermission denied at /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi/XML/ line 587.
2018/01/17-15:13:55 Could not create file parser context for file "/usr/local/avamar/var/mc/server_data/prefs/mcserver.xml": Permission denied at /usr/local/avamar/lib/ line 128
2018/01/17-15:13:55 dpnctl: INFO: - - - - - - - - - - - - - - - END
2018/01/17-15:13:55 /bin/cat /tmp/dpnctl-get-mcs-status-status-21389 2>&1
2018/01/17-15:13:55 [ "/bin/cat /tmp/dpnctl-get-mcs-status-status-21389 2>&1" exit status = 0 ]
2018/01/17-15:13:55 dpnctl: INFO: "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/ --test" exit status = 13
2018/01/17-15:13:55 rm -f /tmp/dpnctl-get-mcs-status-status-21389 /tmp/dpnctl-get-mcs-status-output-21389
2018/01/17-15:13:55 dpnctl: INFO: "rm -f /tmp/dpnctl-get-mcs-status-status-21389 /tmp/dpnctl-get-mcs-status-output-21389" - exit status 0
2018/01/17-15:13:55 dpnctl: INFO: MCS status - - - - - - - - - - END
2018/01/17-15:13:55 dpnctl: INFO: MCS status: down.

This is because of incorrect permissions and ownership on the mcserver.xml file. I do not know the reason why the permissions and ownership was flipped.

The ideal permission and ownership on this file should be:
-rwxrwx--- 1 admin admin 50320 Jan 13 01:48 /usr/local/avamar/var/mc/server_data/prefs/mcserver.xml

To change ownership (Be logged in as root)
# chown admin:admin mcserver.xml

To change permission run the below:
# chmod u+rwx,g+rwx mcserver.xml

Then you should be able to proceed further with the restart of the MCS service.

Hope this helps!

VDP Upgrade ISO Not Detected

There are a ton of reasons why VDP ISO might not be available in the vdp-configure page (Few of those are available on my blog here) This is another one of the possible causes.

> When we run df -h the ISO shows as mounted
> In the avinstaller.log.0 under /usr/local/avamar/var/avi/server_log/ shows the package extraction and package checksum verification is completed successfully.

The is responsible for all the upgrade related tasks. And a restart of this service should help putting few things back in place. However, when I restarted this service, I ended up with the below backtrace:

2017-07-19 10:29:32.441 1776 [main] WARN  o.e.jetty.webapp.WebAppContext - Failed startup of context o.e.j.w.WebAppContext@5d11346a{/avi,file:/usr/local/avamar/lib/jetty/work/jetty-,STARTING}{/usr/local/avamar/lib/jetty/avi.war}
java.lang.IllegalStateException: java.lang.IllegalStateException: Failed to locate resource[] in ClassLoader
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized( ~[na:na]
        at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized( ~[jetty-server-9.0.6.v20130930.jar:9.0.6.v20130930]
        at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized( ~[jetty-servlet-9.0.6.v20130930.jar:9.0.6.v20130930]
Cutting down log snippet
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start( [jetty-util-9.0.6.v20130930.jar:9.0.6.v20130930]
        at com.avamar.avinstaller.AVIMain.start( [avistart.jar:na]
        at com.avamar.avinstaller.AVIMain.main( [avistart.jar:na]
Caused by: java.lang.IllegalStateException: Failed to locate resource[] in ClassLoader
        at com.avamar.avinstaller.util.AppConfig.getInstance( ~[na:na]
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized( ~[na:na]

So, there was a missing repo symlink in webapp directory. You can verify this by navigating to:
# cd /usr/local/avamar/lib/jetty/work/jetty-

If you ls -lh you should see this symbolic link here:
lrwxrwxrwx 1 root root    19 Jan 11 20:25 datalink -> /data01/avamar/repo

If this is missing, recreate it by using:
# ln -s /data01/avamar/repo datalink

Next, navigate to the below location:
# cd /usr/local/avamar/lib/jetty/work/jetty-

Here you should have 13 files and I was missing file. I had to recreate this file. If you have a second VDP appliance copy this file from that server. Else, create a new file with the name using vi editor and copy paste the below content:


# db settings

# db settings for EM

# local repository

#client package

The was still failing with this backtrace:

java.lang.IllegalStateException: org.jbpm.api.JbpmException: resource jbpm.cfg.xml does not exist
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized( ~[na:na]
        at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized( ~[jetty-server-9.0.6.v20130930.jar:9.0.6.v20130930]
        at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized( ~[jetty-servlet-9.0.6.v20130930.jar:9.0.6.v20130930]
Cutting down log snippet
        at com.avamar.avinstaller.AVIMain.start( [avistart.jar:na]
        at com.avamar.avinstaller.AVIMain.main( [avistart.jar:na]
Caused by: org.jbpm.api.JbpmException: resource jbpm.cfg.xml does not exist
        at ~[na:na]
        at org.jbpm.pvm.internal.xml.Parse.getInputSource( ~[na:na]
        at org.jbpm.pvm.internal.xml.Parser.buildDocument( ~[na:na]
Cutting down log snippet
        at com.avamar.avinstaller.process.PVM.getInstance( ~[na:na]
        at com.avamar.avinstaller.util.LifecycleListener.contextInitialized( ~[na:na]
        ... 20 common frames omitted

I was also missing jbpm.cfg.xml file. Recreate this file and enter the below contents:

<?xml version="1.0" encoding="UTF-8"?>


  <import resource="jbpm.default.cfg.xml" />
  <import resource="jbpm.businesscalendar.cfg.xml" />
  <import resource="jbpm.tx.hibernate.cfg.xml" />
  <import resource="jbpm.jpdl.cfg.xml" />
  <import resource="jbpm.identity.cfg.xml" />

  <!-- Job executor is excluded for running the example test cases. -->
  <!-- To enable timers and messages in production use, this should be included. -->
  <import resource="jbpm.jobexecutor.cfg.xml" />

  <import resource="jbpm.mail.templates.examples.xml" />


Restart the avinstaller again using --restart and everything went fine and the ISO was detected in the configure page. 

Hope this helps!

Resetting Site Recovery Manager's Embedded DB Password

This article is applicable only to Postgres database which is the embedded DB option available during install. If you have forgotten the database password then you will not be able to login to DB to alter tables or perform a repair / modify on the SRM instance.

Before resetting the password, make sure the SRM machine is on a snapshot and a backup is available for the DB.

1. First we will need to edit the pg_hba.conf file to allow all users as trusted users so that a password-less authentication will be performed. The pg_hba.conf file is located under:
C:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\data\pg_hba.conf
Make a backup of the file before editing it.

Locate this section in the conf file:

# IPv4 local connections:
host all all md5
# IPv6 local connections:
host all all ::1/128 md5
# Allow replication connections from localhost, by a user with the
# replication privilege.
#host replication postgres md5
#host replication postgres ::1/128 md5

Replace that complete set with this:

# IPv4 local connections:
host all all trust
# IPv6 local connections:
host all all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
#host replication postgres md5
#host replication postgres ::1/128 md5

Save the file.

2. Restart the SRM Embedded database service from services.msc

3. Open a command prompt in admin mode and now we will have to login to the database. Navigate to the below bin directory:
C:\Program Files\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin

4. Connect to the postgres database using:
psql -U postgres -p 5678

Port might vary if you had a custom install of the database. 

5. Run the below query to change the password:
ALTER USER "enter srm db user here" PASSWORD 'new_password';

The srm db user / port information can be found from the 64 bit ODBC connection. 
A successful execution will return the output: "ALTER ROLE"

6. Revert the changes performed the pg_hba.conf file so that md5 authentication is required for users to login to SRM database. 

7. Restart the SRM Embedded DB service again

Post this, the SRM service will fail to restart and you will notice the following backtrace in vmware-dr.log

2018-01-17T02:25:15.628Z [01748 error 'WindowsService'] Application error:
--> std::exception 'class Vmacore::Exception' "DBManager error: Could not initialize Vdb connection: ODBC error: (08001) - FATAL:  password authentication failed for user "srmadmin"
--> "
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x0001362B]
--> backtrace[04] vmware-dr.exe[0x0015C59A]
--> backtrace[05] MSVCR90.dll[0x00074830]
--> backtrace[06] MSVCR90.dll[0x00043B3C]
--> backtrace[07] ntdll.dll[0x0009CED3]
--> backtrace[08] vmware-dr.exe[0x000060AF]
--> backtrace[09] vmware-dr.exe[0x00006A5E]
--> backtrace[10] windowsService.dll[0x00002BCE]
--> backtrace[11] windowsService.dll[0x000020DD]
--> backtrace[12] sechost.dll[0x000081D5]
--> backtrace[13] KERNEL32.DLL[0x0000168D]
--> backtrace[14] ntdll.dll[0x00074629]
--> [backtrace end]

Run a Modify on the SRM instance from Add/Remove programs and provide the new database password during this process and the service will start up just fine. 

Hope this helps.

Monday, 15 January 2018

vSphere Replication 6.5.1 With vRealize Orchestrator 7.3

Here we will be looking into how to configure and use vSphere replication with vRealize Orchestrator. The version of my setup is:

vCenter Appliance 6.5 U1
vSphere Replication 6.5.1
vRealize Orchestrator 7.3

In brief, deploy the vRealize Orchestrator OVA template. Then navigate to "https://<vro-fqdn>:8283/vco-controlcenter/" to begin the configuration.

I have a standalone Orchestrator deployment with vSphere Authentication mode.

SSO user name and password is required to complete the registration. A restart of vRO would be needed to complete the configuration.

Next, download the vSphere Replication vmoapp file from this link here.

To install this file, click on the Manage Plugins tab in the Orchestrator control center and browse for the downloaded vmoapp file.

Then accept the EULA to Install the Plugin.

If prompted, click Save Changes and this should show the vR plugin is available and enabled in the plugin page.

Next, register the vCenter Site for the replication server using the below "Register VC Site" Workflow. All the next tasks are done from the Orchestrator client.

Once done, you can verify the vSphere Replication site is now visible under Administer mode of vRO.

Next, we will configure replication for one virtual machine. With the Run mode execute the "Configure Replication" workflow.

The Site (source) will be selected first.

Selecting virtual machine will be the next task.

Target site vR selection will be next. I am replicating within the same vCenter, so the source and target vR site is the same machine.

Next, we will select the target datastore where the replicated files should reside.

Lastly, we will choose the RPO and other required parameters to complete the replication task and click Submit.

Finally, you can see the VM under Outgoing Replication tab for vCenter.

That's pretty much it!

Friday, 12 January 2018

VDP 6.1.6 Does Not Connect To Web Client After An Upgrade

Recently after I upgraded my VDP to 6.1.6, there was issues connecting this appliance to the web client. The screen used to be grayed out forever and the vdr-server.log did not have any information about this cause.

When we ran the below command we saw the there were 5 vCenter connections down
# mccli server show-services

Name                               Status
---------------------------------- ---------------------------
/cartman.southpark.local           5 vCenter connection(s) down.

The MCS restart failed with the following in the mcserver.out log file located under
# cd /usr/local/avamar/var/mc/server_log/mcserver.out

Caught Exception :  Exception : org.apache.axis.AxisFault Message : ; nested exception is: PKIX path building failed: Could not build a validated path. StackTrace :
 faultCode: {}Server.userException
 faultString: PKIX path building failed: Could not build a validated path.
        {} PKIX path building failed: Could not build a validated path.

Caused by: PKIX path building failed: Could not build a validated path.
        at Source)
        at Source)
        at Source)
        at Source)
        at Source)
        at Source)
        ... 30 more
Caused by: Could not build a validated path.
        at com.rsa.cryptoj.o.qb.engineBuild(Unknown Source)

This is because of the ignore_vc_cert = false
# grep ignore_vc /usr/local/avamar/var/mc/server_data/prefs/mcserver.xml
<entry key="ignore_vc_cert" value="false" />

To fix this, either edit the mcserver.xml file manually and change the value from false to true or run the below command (Make sure a backup of mcserver.xml is taken):
# sed -i -e 's/entry key="ignore_vc_cert" value="false"/entry key="ignore_vc_cert" value="true"/g'  /usr/local/avamar/var/mc/server_data/prefs/mcserver.xml

Restart the MCS (From admin mode) using:
# --restart

That should fix the connection issue. Hope this helps!

Thursday, 11 January 2018

vSphere Replication Sync Fails With Exception: com.vmware.hms.replication.sync.DeltaAbortedException

There are few instances when a vSphere Replication Sync (RPO based or a manual sync) fails with Delta Aborted Exception. This in turn will also affect a test / planned migration when performed with Site Recovery Manager.

In the hms.log located under /opt/vmware/hms/logs on the vSphere Replication Server, you will notice something like:

2018-01-10 14:35:59.950 ERROR com.vmware.hms.replication.sync.ReplicationSyncManager [hms-sync-progress-thread-0] (..replication.sync.ReplicationSyncManager) operationID=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613 | Completing sync operation because of error: {OnlineSyncOperation, OpId=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613, GroupMoId=GID-2319507d-e668-4eea-aea9-4d7d241dd886, ExpInstSeqNr=48694, TaskMoId=HTID-56fd57dd-408b-4861-a124-70d8c53a1194, InstanceId=2f900595-2822-4f2b-987d-4361f7035
05c, OpState=started, VcVmMoid=vm-28686, createInstanceRetryCount=2, fullSyncOngoing=false, operationId=null}
        at com.vmware.hms.replication.sync.SyncOperation.checkHealth(
        at com.vmware.hms.replication.sync.SyncOperation$
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$
        at java.util.concurrent.Executors$
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(
        at java.util.concurrent.ScheduledThreadPoolExecutor$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

This occurs when the outgoingeventlogentity and incomingeventlogentity tables on the vR database are having a large number of entries.

The following fix should be applied at own risk. Have a snapshot and/or a backup of the vR server before performing the change.

1. Navigate to the VRMS database's bin directory:
# cd /opt/vmware/vpostgres/9.x/bin
The postgres version varies depending on the replication server release.

2. Backup the replication database using the below command:
# ./pg_dump -U vrmsdb -Fp -c > /tmp/DBBackup.bak

3. Connect to the vR database using:
# ./psql -U vrmsdb

4. Run the below queries to extract the number of events for the logentity tables:

select count(*) from outgoingeventlogentity; 
select count(*) from incomingeventlogentity;

In my case, the output on the production site vR was:

vrmsdb=# select count(*) from incomingeventlogentity;
(1 row)

vrmsdb=# select count(*) from outgoingeventlogentity;
(1 row)

And on the recovery site, the outgoingeventlogentity was having 21k+ events. 

5. First, you can change the max event age limit to 10800 in the hms-configuration.xml file located at:
# cd /opt/vmware/hms/conf/hms-configuration.xml

This should be the output after the edit:

6. Next, we will have to purge the event logs from the above mentioned tables. There are lot of fields in the table if you run select * from <table-name>; 
The one column we need is the "timestamp" column. 

The timestamp column would have a value like this: 1515479242006
To convert this to human readable date, you will have to:

> Remove the last 3 digits from the above output. 
So 1515479242006 will be 1515479242. Then convert this EPOCH time to normal convention using this link here.

Now, you will have to use a timestamp in such a way that anything before that would be purged from the database. During the purge, the timestamp has to be the complete value obtained from the timestamp column. Then, run the below query:

DELETE from incomingeventlogentity WHERE timestamp < 1515479242006;
DELETE from outgoingeventlogentity WHERE timestamp < 1515479242006;

7. Then restart the hms service using:
# systemctl stop hms
# systemctl start hms

The above is applicable from 6.1.2 vR onward. For lower versions:
# service hms restart

8. Re-pair the sites and then perform a sync now operation and we should be good to go. 

Hope this help!

Wednesday, 10 January 2018

ISO Undetected During VDP 6.1.6 Upgrade

With the release of VDP 6.1.6 comes more issues with upgrade, specially the main one; ISO is not detected in the vdp-configure page. You might see something like:

To upgrade your VDP appliance, please connect a valid upgrade ISO image to the appliance. 

On the command line, you will see the ISO is already mounted:

root@Jimbo:~/#: df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        32G  5.9G   25G  20% /
udev            1.9G  148K  1.9G   1% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
/dev/sda1       128M   37M   85M  31% /boot
/dev/sda7       1.5G  158M  1.3G  12% /var
/dev/sda9       138G  5.1G  126G   4% /space
/dev/sdd1       256G  208M  256G   1% /data02
/dev/sdc1       256G  208M  256G   1% /data03
/dev/sdb1       256G  4.3G  252G   2% /data01
/dev/sr0        4.1G  4.1G     0 100% /mnt/auto/cdrom

If it is not, use the below command to mount:
# mount /dev/sr0 /mnt/auto/cdrom

If it still does not help, then proceed further.

For all upgrade issues, we look at avinstaller.log located at  /usr/local/avamar/var/avi/server_log/
However, I noticed here that the last update time on the log was Nov 22

-rw-r--r-- 1 root  root  1686327 Nov 22 21:12 avinstaller.log.0

This said the avinstaller script is not updating the log. A restart of avinstaller using the below command will not help either:
# --restart

The process is running fine though:

root@Jimbo:/usr/local/avamar/var/avi/server_log/#: --test
Avistart process: 31230

INFO: AVI is running.

For this, we will have to rename the web server folder "jetty-" located under:
# cd /usr/local/avamar/lib/jetty/work

To rename use:
# mv jetty- jetty-

Then restart the avinstaller script using --restart

Now the avinstaller.log is updated:
-rw-r--r-- 1 root  root  1750549 Jan 10 19:38 avinstaller.log.0

And the ISO is detected in the vdp-configure page.

If this still does not work, then we will have to look into the avinstaller.log which can have a ton of causes. Best to reach out to VMware.

Hope this helps.

Tuesday, 9 January 2018

Unable To Connect VDP To Web Client

In couple of cases, either on fresh VDP deploy or an existing deployment, the connection from VDP to web client via the plugin might fail. 

It would generically say, 

Unable to connect to the requested VDP Appliance. Would you like to be directed to the VDP Configuration utility to troubleshoot the issue? 

The vdp-configure page does not tell much about any of these errors and all the services seem to be running fine. You can also try restarting tomcat service on VDP using --restart, however that would might not help. 

In case of this above error, the first logs you need to look at is the web client logs on the vCenter. And in this, the following was logged:

[2018-01-09T19:42:06.634Z] [INFO ] http-bio-9443-exec-27         com.emc.vdp2.api.impl.BaseApi                                     Connecting to VDP at: [https://x.x.x.x:8543/vdr-server/auth/login]
[2018-01-09T19:42:06.646Z] [INFO ] http-bio-9443-exec-27         com.emc.vdp2.api.impl.BaseApi                                     Setting the session ID to: null
[2018-01-09T19:42:06.656Z] [WARN ] http-bio-9443-exec-27         org.springframework.flex.core.DefaultExceptionLogger              The following exception occurred during request processing by the BlazeDS MessageBroker and will be serialized back to the client:  flex.messaging.MessageException: java.lang.NullPointerException : null

[2018-01-09T19:42:06.889Z] [WARN ] http-bio-9443-exec-26 org.springframework.flex.core.DefaultExceptionLogger The following exception occurred during request processing by the BlazeDS MessageBroker and will be serialized back to the client: flex.messaging.MessageException: org.eclipse.gemini.blueprint.service.ServiceUnavailableException : service matching filter
=[(objectClass=com.emc.vdp2.api.ActionApiIf)] unavailable
at com.vmware.vise.messaging.remoting.JavaAdapterEx.invoke(
at flex.messaging.MessageBroker.routeMessageToService(
at flex.messaging.endpoints.AbstractEndpoint.serviceMessage(
at flex.messaging.endpoints.AbstractEndpoint$$FastClassByCGLIB$$1a3ef066.invoke(<generated>)
at net.sf.cglib.proxy.MethodProxy.invoke(
at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
at org.springframework.flex.core.MessageInterceptionAdvice.invoke(
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
at org.springframework.aop.framework.adapter.ThrowsAdviceInterceptor.invoke(
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
at org.springframework.aop.framework.Cglib2AopProxy$FixedChainStaticTargetInterceptor.intercept(
at flex.messaging.endpoints.AMFEndpoint$$EnhancerByCGLIB$$72c7df65.serviceMessage(<generated>)
at flex.messaging.endpoints.amf.MessageBrokerFilter.invoke(
at flex.messaging.endpoints.amf.LegacyFilter.invoke(
at flex.messaging.endpoints.amf.SessionFilter.invoke(
at flex.messaging.endpoints.amf.BatchProcessFilter.invoke(
at flex.messaging.endpoints.amf.SerializationFilter.invoke(
at flex.messaging.endpoints.BaseHTTPEndpoint.service(
at flex.messaging.endpoints.AMFEndpoint$$EnhancerByCGLIB$$72c7df65.service(<generated>)
at org.springframework.flex.servlet.MessageBrokerHandlerAdapter.handle(
at org.springframework.web.servlet.DispatcherServlet.doDispatch(
at org.springframework.web.servlet.DispatcherServlet.doService(

In an event when we see this back trace with the com.emc.vdp2.api.ActionApi, the connection call is unavailable when a login request is sent by VDP. And due to this the connection fails. 

To resolve this clear the SerenityDB on the vCenter Server. Follow this Knowledge base article for the procedure.

Post this, re-login back to web client and then connect to the VDP appliance and it should go through successfully. 

Hope this helps.