Thursday, 11 January 2018

vSphere Replication Sync Fails With Exception: com.vmware.hms.replication.sync.DeltaAbortedException

There are few instances when a vSphere Replication Sync (RPO based or a manual sync) fails with Delta Aborted Exception. This in turn will also affect a test / planned migration when performed with Site Recovery Manager.

In the hms.log located under /opt/vmware/hms/logs on the vSphere Replication Server, you will notice something like:

2018-01-10 14:35:59.950 ERROR com.vmware.hms.replication.sync.ReplicationSyncManager [hms-sync-progress-thread-0] (..replication.sync.ReplicationSyncManager) operationID=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613 | Completing sync operation because of error: {OnlineSyncOperation, OpId=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613, GroupMoId=GID-2319507d-e668-4eea-aea9-4d7d241dd886, ExpInstSeqNr=48694, TaskMoId=HTID-56fd57dd-408b-4861-a124-70d8c53a1194, InstanceId=2f900595-2822-4f2b-987d-4361f7035
05c, OpState=started, VcVmMoid=vm-28686, createInstanceRetryCount=2, fullSyncOngoing=false, operationId=null}
com.vmware.hms.replication.sync.DeltaAbortedException
        at com.vmware.hms.replication.sync.SyncOperation.checkHealth(SyncOperation.java:911)
        at com.vmware.hms.replication.sync.SyncOperation$4.run(SyncOperation.java:735)
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$RunnableWithDiagnosticContext.run(LoggerOpIdConfigurator.java:133)
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$2.run(LoggerOpIdConfigurator.java:100)
        at com.vmware.jvsl.sessions.net.impl.TlsPreservingWrapper$2.run(TlsPreservingWrapper.java:47)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

This occurs when the outgoingeventlogentity and incomingeventlogentity tables on the vR database are having a large number of entries.

The following fix should be applied at own risk. Have a snapshot and/or a backup of the vR server before performing the change.

1. Navigate to the VRMS database's bin directory:
# cd /opt/vmware/vpostgres/9.x/bin
The postgres version varies depending on the replication server release.

2. Backup the replication database using the below command:
# ./pg_dump -U vrmsdb -Fp -c > /tmp/DBBackup.bak

3. Connect to the vR database using:
# ./psql -U vrmsdb

4. Run the below queries to extract the number of events for the logentity tables:

select count(*) from outgoingeventlogentity; 
select count(*) from incomingeventlogentity;

In my case, the output on the production site vR was:

vrmsdb=# select count(*) from incomingeventlogentity;
 count
-------
 21099
(1 row)

vrmsdb=# select count(*) from outgoingeventlogentity;
 count
-------
   146
(1 row)

And on the recovery site, the outgoingeventlogentity was having 21k+ events. 

5. First, you can change the max event age limit to 10800 in the hms-configuration.xml file located at:
# cd /opt/vmware/hms/conf/hms-configuration.xml

This should be the output after the edit:
<hms-eventlog-maxage>10800</hms-eventlog-maxage>

6. Next, we will have to purge the event logs from the above mentioned tables. There are lot of fields in the table if you run select * from <table-name>; 
The one column we need is the "timestamp" column. 

The timestamp column would have a value like this: 1515479242006
To convert this to human readable date, you will have to:

> Remove the last 3 digits from the above output. 
So 1515479242006 will be 1515479242. Then convert this EPOCH time to normal convention using this link here.

Now, you will have to use a timestamp in such a way that anything before that would be purged from the database. During the purge, the timestamp has to be the complete value obtained from the timestamp column. Then, run the below query:

DELETE from incomingeventlogentity WHERE timestamp < 1515479242006;
DELETE from outgoingeventlogentity WHERE timestamp < 1515479242006;

7. Then restart the hms service using:
# systemctl stop hms
# systemctl start hms

The above is applicable from 6.1.2 vR onward. For lower versions:
# service hms restart

8. Re-pair the sites and then perform a sync now operation and we should be good to go. 

Hope this help!