Friday, 16 March 2018

Unable To Make Changes On A Virtual Machine - The operation is not allowed in the current state of the datastore

Recently, while working on a case we noticed that we were unable to make changes to any virtual machine on a particular NetApp NFS datastore. We were unable to add disks, increase existing VMDKs or create virtual machines on that datastore.

The error we received was:

When we login to the host directly via UI client then we were able to perform all the above mentioned changes. This pointed out that, there seemed to be an issue with the vCenter server and not the datastore.

So looking into the vpxd.log for vCenter, this is what we saw:

2018-03-16T09:11:48.485Z info vpxd[7FB8F56ED700] [Originator@6876 sub=Default opID=VmConfigFormMediator-applyOnMultiEntity-93953-ngc:70007296-f3] [VpxLRO] -- ERROR task-252339 -- vm-44240 -- vim.VirtualMachine.reconfigure: vim.fault.InvalidDatastoreState:
--> Result:
--> (vim.fault.InvalidDatastoreState) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    datastoreName = "dc1_vmware01"
-->    msg = ""
--> }
--> Args:
--> Arg spec:
--> (vim.vm.ConfigSpec) {
-->    changeVersion = "2018-03-15T08:04:06.266218Z",
-->    name = <unset>,

To fix this we had to change the thin_prov_space_flag from 1 to 0 on the vCenter server database. In my case, the vCenter was an appliance (The process remains more or less same for Windows based vCenter as well) 

The fix:

1. Always have a snapshot of the vCenter server before making any changes within it.

2. Stop the vCenter server service using:
# service-control --stop vmware-vpxd

3. Connect to the vCenter database using:
# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U vc

The password for the vCenter DB can be found in the below file:
/etc/vmware-vpx/ file

Run this below query to list out the datastores with this vCenter:
select * from vpx_datastore where name='<enter-your-datastore-name>';

The output would be something like:
   11 | dc1_vmware01        | ds:///vmfs/volumes/7a567de9-3e3c0969/                   | 5277655814144 | 1745005346816 | NFS  |      | <obj xmlns:xsd="" xmlns:xsi=""
 xmlns="urn:vim25" versionId="6.5" xsi:type="DatastoreCapability"><directoryHierarchySupported>true</directoryHierarchySupported><rawDiskMappingsSupported>false</rawDiskMappingsSupported><perFileThinProvisioningSupported>true</perFileThi
orted><vmfsSparseSupported>true</vmfsSparseSupported><vsanSparseSupported>false</vsanSparseSupported></obj> |             2 |            0 |                        30 |                0 |                    1 | automatic
  |                        90 |                  1

Too much information here, so we can filter it out using the below query
select id,thin_prov_space_flag from vpx_datastore;

Now you can see:

VCDB=> select id,thin_prov_space_flag from vpx_datastore;
  id  | thin_prov_space_flag
 5177 |                    0
 6449 |                    0
 5178 |                    0
   12 |                    0
  795 |                    0
  149 |                    0
  793 |                    0
   11 |                    1

Now, we need to change the thin_prov_space_flag from 1 to 0 for id=11

So run this query:
update vpx_datastore set thin_prov_space_flag=0 where id=<enter-your-id>;

Quit the database view using \q

Start the web client service using:
# service-control --start vmware-vpxd

Re-login back to the vCenter and now you should be able to make the necessary changes. 

Hope this helps!

Tuesday, 13 March 2018

Creating A VMFS Volume From Command Line

One of the alternate methods to formatting a new VMFS volume from the GUI is to create the same from the SSH of the ESXi host.

The process is quite simple and you can follow them as mentioned below:

1. Make sure the device is presented to the ESX and visible. If not, perform a Rescan Storage and check if the device is visible.

2. You can get the device identifier from the SSH of the ESX by navigating to:
# cd /vmfs/devices/disks

In my case, the device I was interested was mpx.vmhba1:C0:T3:L0

3. Next, we need to create a partition on this device and we no longer use fdisk for ESX as this is deprecated. So we will use partedUtil

So, we will create a partition (Number=1) at an offset of 128. The partition identifier is 0xfb which is a VMFS partition. 0xfb = 251. Along with this we will specify the ending sector.

To calculate ending sector:
The disk has 512 bytes per sector. In my case the device is 12 GB.
So number of bytes is 12884901887.99998
Dividing this by 512 is 25165824 sectors.

Do not use the complete sector value as it might complain out of bound sector value, so use one number less.

The command would then be:
# partedUtil set /vmfs/devices/disks/device-name "1 128 <ending-sector> 251 0"

Sample command:
# partedUtil set /vmfs/devices/disks/mpx.vmhba1:C0:T3:L0 "1 128 25165823 251 0"

A successful output would be:
0 0 0 0
1 128 25165823 251 0

4. Next, you format a VMFS volume using the vmkfstools -C command. 

The command would be:
# vmkfstools -C <vmfs-version> -b <block-size> -S <name-of-datastore> /vmfs/devices/disks/<device-name>:<partition-number> 

So the command for me would be (For a VMFS5 partition with 1 mb block size)
# vmkfstools -C vmfs5 -b 1m -S Test /vmfs/devices/disks/mpx.vmhba1:C0:T3:L0:1

A successful output would be:
Checking if remote hosts are using this device as a valid file system. This may take a few seconds...
Creating vmfs5 file system on "mpx.vmhba1:C0:T3:L0:1" with blockSize 1048576 and volume label "Test".
Successfully created new volume: 5aa7d4e8-1e99a608-f609-000c292cd901

Now, back in the GUI just do a refresh on the storage section and this volume is visible for the host. 

Hope this helps!

Friday, 2 March 2018

SRM CentOS 7.4 IP Customization Fails

If you are using SRM 6.0.x or SRM 6.1.x and you are trying to test failover a CentOS 7.4 machine with IP Customization the Customize IP Section of the recovery fails with the message

The guest operating system '' is not supported

In the vmware-dr.log on the DR site SRM, you will notice the following:

2018-03-02T02:10:43.405Z [01032 error 'Recovery' ctxID=345cedf opID=72d8d85a] Plan 'CentOS74' failed: (vim.fault.UnsupportedGuest) {
-->    faultCause = (vmodl.MethodFault) null, 
-->    property = "guest.guestId", 
-->    unsupportedGuestOS = "", 
-->    msg = ""
--> }

This is because the CentOS7.4 is not a part of supported guest in the imgcust binaries of the 6.0 release. For CentOS 7.4 customization to work, the SRM needs to be on a 6.5 release. In my case, I upgraded vCenter to 6.5 Update 1 and SRM to 6.5.1 post which the test recovery completed without issues.

If there is no plan for immediate upgrade of your environment, but would still like to have the customizations completing, then use this workaround.

If you look at the redhat-release file
# cat /etc/redhat-release

The contents are:
CentOS Linux release 7.4.1708 (Core)

So you remove this and then add:
Red Hat Enterprise Linux Server release 7.0 (Maipo)

Since RHEL 7.0 is supported in imgcust for 6.0 the test recovery completes fine. Hope this helps!

Friday, 23 February 2018

VDP Backup Fails With Time Out - End

So you might be aware that a backup job in VDP can run upto 24 hours, post which it ideally times out with Time Out end status. If this is what you are facing, then you can extend the backup window overtime using this KB article here

In some cases, this might not be the situation. You might not hit the 24 hour window but still the backup fails with Time Out - End

ID Status Error Code Start Time Elapsed End Time Type Progress Bytes New Bytes Client Domain
9151917120002509 Timed Out - End 0 2018-02-21 00:01 IST 02h:59m:31s 2018-02-21 03:00 GMT Scheduled Backup 179.9 GB 15.4% Test-VM /vcenter.southpark.local/VirtualMachines

The backup logs will have the same logging that one would expect to see when the time out issue occurs.

The avtar.log will have something similar to:

2018-02-21T03:00:23.253-01:00 avtar Info <7061>: Canceled by '3016-vmimagew' - exiting...
2018-02-21T03:00:23.258-01:00 avtar Info <9772>: Starting graceful (staged) termination, cancel request (wrap-up stage)
2018-02-21T03:00:23.347-01:00 [avtar] ERROR: <0001> backstreamdir::childdone error merging history stream data during phase_do_hidden while processing directory 'VMFiles/2'
2018-02-21T03:00:23.392-01:00 [avtar] ERROR: <0001> backstreamdir::childdone error merging history stream data during phase_do_hidden while processing directory 'VMFiles'
2018-02-21T03:00:23.438-01:00 avtar Info <7202>: Backup CANCELED, wrapping-up session with Server

The backup.log will have:

2018-02-21T03:00:23.169-01:00 avvcbimage Info <9740>: skipping 2048 unchanged sectors in PAX stream at sector offset 241468416
2018-02-21T03:00:23.196-01:00 avvcbimage Info <9772>: Starting graceful (staged) termination, MCS cancel (wrap-up stage)
2018-02-21T03:00:23.212-01:00 avvcbimage Info <19692>: Cancelled by MCS with timeout=0 sec
2018-02-21T03:00:23.647-01:00 avvcbimage Info <40654>: isExitOK()=169
2018-02-21T03:00:23.647-01:00 avvcbimage Info <16022>: Cancel detected(externally cancelled by Administrator), isExitOK(0).
2018-02-21T03:00:23.657-01:00 avvcbimage Info <16041>: VDDK:VixDiskLib: VixDiskLib_Close: Close disk.

The avagent.log will have the following:

2018-02-21T03:00:23.137-01:00 avagent Info <5964>: Requesting work from
2018-02-21T03:00:23.162-01:00 avagent Info <8636>: Starting CTL cancel of workorder "BackupTest-BackupTest-1519171200009".
2018-02-21T03:00:31.454-01:00 avagent Info <6688>: Process 127319 (/usr/local/avamarclient/bin/avvcbimage) finished (code 169: externally cancelled by Administrator)
2018-02-21T03:00:31.454-01:00 avagent Warning <6690>: CTL workorder "BackupTest-BackupTest-1519171200009" non-zero exit status 'code 169: externally cancelled by Administrator'

When I checked the backup duration window, it was set only to 3 hours. Not sure how that changed

root@vdp:/usr/local/avamar/var/vdr/server_logs/#: mccli schedule show --name=/vcenter.southpark.local/VirtualMachines/BackupTest | grep "Backup Window Duration" 
Backup Window Duration 3 hours and 0 minutes

So we had to flip it back to 24 hours:

root@vdp:/usr/local/avamar/var/vdr/server_logs/#: mccli schedule edit --name=/vcenter.southpark.local/VirtualMachines/BackupTest --duration=24:00
0,22214,Schedule modified

root@vdp:/usr/local/avamar/var/vdr/server_logs/#: mccli schedule show --name=/vcenter.southpark.local/VirtualMachines/BackupTest | grep "Backup Window Duration" 
Backup Window Duration 24 hours and 0 minutes

The backup should not time out before the 24 hour window anymore. Hope this helps!

VDP Tomcat Service Crashes With "line 1, column 0, byte 0"

While working with one of my colleague the last day, there was an issue where the tomcat service would not start up.

root@vdp:/home/admin/#: dpnctl status
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: gsan status: up
dpnctl: INFO: MCS status: up.
dpnctl: INFO: emt status: down.
dpnctl: INFO: Backup scheduler status: down.
dpnctl: INFO: axionfs status: down.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: enabled.
dpnctl: INFO: avinstaller status: down.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

The core services seemed to be fine, it was just the EM tomcat that was unresponsive. When we tried to restart the tomcat service using the below command as the "root" user of VDP it used to fail:
# --start

Error trace:

syntax error at line 1, column 0, byte 0:
Identity added: /home/admin/.ssh/dpnid (/home/admin/.ssh/dpnid)
<entry key="clean_emdb_cm_queue_info_days" value="365" />
at /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi/XML/ line 187

If we ran the alternative command to start the service, it would fail to and the dpnctl.log would have the similar Error trace.
# dpnctl start emt

Error trace:

2018/02/22-17:11:42 syntax error at line 1, column 0, byte 0:
2018/02/22-17:11:42 Identity added: /home/admin/.ssh/dpnid (/home/admin/.ssh/dpnid)
2018/02/22-17:11:42 tomcatctl: ERROR: problem running command "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/ --start" - exit status 255

So, it looks like the emserver.xml file had been corrupted and we just need to restore this particular file from a previous EM Flush.

To perform this, the steps would be:

1. List out the latest 5 available EM backups using:
# avtar --backups --path=/EM_BACKUPS --count=5 | less

The output would be:

Date      Time    Seq       Label           Size     Plugin    Working directory         Targets
---------- -------- ----- ----------------- ---------- -------- --------------------- -------------------
2018-02-21 08:00:05   213                          57K Linux    /usr/local/avamar     var/em/server_data
2018-02-20 08:00:05   212                          57K Linux    /usr/local/avamar     var/em/server_data
2018-02-19 08:00:08   211                          57K Linux    /usr/local/avamar     var/em/server_data
2018-02-18 08:00:04   210                          57K Linux    /usr/local/avamar     var/em/server_data
2018-02-17 08:00:11   209                          57K Linux    /usr/local/avamar     var/em/server_data

2. Next, we will choose a label number and then restore the EM Flush to a temp directory. The command would be:
# avtar -x --labelnum=209 --path=/EM_BACKUPS --target=/tmp/emback

A successful restore would end with a similar snippet:

avtar Info <5259>: Restoring backup to directory "/tmp/emback"
avtar Info <5262>: Restore completed
avtar Info <7925>: Restored 55.55 KB from selection(s) with 55.55 KB in 10 files, 6 directories
avtar Info <6090>: Restored 55.55 KB in 0.01 minutes: 306.5 MB/hour (56,489 files/hour)
avtar Info <7883>: Finished at 2018-02-22 21:26:53 GST, Elapsed time: 0000h:00m:00s
avtar Info <6645>: Not sending wrapup anywhere.
avtar Info <5314>: Command completed (1 warning, exit code 0: success)

3. Rename the older emserver.xml file using:
# mv /space/avamar/var/em/server_data/prefs/emserver.xml /space/avamar/var/em/server_data/prefs/emserver.xml.old

4. Copy to restored file to the actual location using:
# cp -p /tmp/emback/var/em/server_data/prefs/emserver.xml /space/avamar/var/em/server_data/prefs/

5. List out the permissions on the file using:
# ls -lh  /space/avamar/var/em/server_data/prefs/

The permissions should be:

-rw------- 1 admin admin 9.4K Jul 26  2017 emserver.xml
-rw------- 1 admin admin 9.3K Feb 22 09:09 emserver.xml.old
-r-------- 1 admin admin 4.5K Jul 26  2017 preferences.dtd

6. Restart the tomcat service using:
# --start

It should now start up successfully. Hope this helps!

Tuesday, 20 February 2018

SRM Service Crashes Due To CR/LF Conversion

In a postgreSQL SRM deployment, the service might crash if the Carriage Return / Line Feed bit is enabled. The back trace would not tell much, even with trivia logging I could not make much out of it. This is what I saw in vmware-dr.log:

2018-02-20T01:05:26.359Z [02712 verbose 'Replication.Folder'] Reconstructing folder '_replicationRoot':'DrReplicationRootFolder' from the database
2018-02-20T01:05:26.468Z [02712 panic 'Default'] 
--> Panic: TerminateHandler called
--> Backtrace:
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
--> backtrace[05] vmacore.dll[0x001D9FD0]
--> backtrace[40] vmacore.dll[0x00065FEB]
--> backtrace[41] vmacore.dll[0x0015BC50]
--> backtrace[42] vmacore.dll[0x001D2A5B]
--> backtrace[43] MSVCR90.dll[0x00002FDF]
--> backtrace[44] MSVCR90.dll[0x00003080]
--> backtrace[45] KERNEL32.DLL[0x0000168D]
--> backtrace[46] ntdll.dll[0x00074629]
--> [backtrace end]

The last OpID is 02712 and even if I search with this while trivia logging being enabled, it would not give me much information.

Apparently, the CR/LF bit in ODBC driver causing some kind of truncation causing SRM service to crash.

This setting is available Under:
ODBC 64 bit > System DSN > SRM DSN (Configure) > Datasource > Page 2

Uncheck this option and then the service should start successfully.

Hope this kind of helps!

Monday, 19 February 2018

Unable To Send Scheduled Email Reports In VDP "com.vmware.vim25.ManagedObjectNotFound"

When I was working on a 6.1.5 VDP deployment, the test emails were successful, however, the scheduled email reports used to fail and in the vdr-server.log the below back trace was seen:

2018-02-14 09:00:00,157 ERROR [VDP-email-report-timer-task]-schedule.EmailReportTimerTask: Failed to send the email summary report. Reason: java.rmi.RemoteException: VI SDK invoke exception:com.vmware.vim25.Manage
dObjectNotFound; nested exception is:
java.lang.RuntimeException: java.rmi.RemoteException: VI SDK invoke exception:com.vmware.vim25.ManagedObjectNotFound; nested exception is:
        at com.emc.vdp2.common.converter.VmClientConverterImpl.getRealName(
        at com.emc.vdp2.common.converter.VmClientConverterImpl.convert(
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addVmClientToBackupJob(
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addClientsToBackupJob(
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addTargetItemsToBackupJob(
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(
        at java.util.TimerThread.mainLoop(Unknown Source)
        at Source)

The main part of the back trace we are interested is this:

        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addTargetItemsToBackupJob(
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(

The backup job conversion to group was having an exception. So out of curiosity, I ran the below command:
# mccli group show --recursive=true

This showed 2 backups jobs whereas the GUI reported one backup job. 

So to fix this, I had to restart the tomcat service using:
# --restart

Post this, the GUI updated the right number of backup jobs and the scheduled email reports were sent successfully. 

If you run into something similar hope this helps!

Wednesday, 14 February 2018

VDP Cannot Connect To Web Client "MIME media type application/octet-stream was not found"

Yet another issue. You will see this generic message in web client when you connect to the web client VDP plugin:

As usual, we always check the web client logs on the vCenter for more information. And in this case, below was the back trace:

[2018-01-24T15:02:43.924Z] [ERROR] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.sun.jersey.api.client.ClientResponse A message body reader for Java class com.emc.vdp2.model.error.VdrError, and Java type class com.emc.vdp2.model.error.VdrError, and MIME media type application/octet-stream was not found
[2018-01-24T15:02:43.924Z] [ERROR] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.sun.jersey.api.client.ClientResponse The registered message body readers compatible with the MIME media type are:
application/octet-stream ->

[2018-01-24T15:02:43.924Z] [WARN ] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.emc.vdp2.api.impl.ActionApi Caught UniformInterfaceException [POST returned a response status of 400 Bad Request], recieved HTTP response: [400] com.sun.jersey.api.client.UniformInterfaceException: POST returned a response status of 400 Bad Request
at com.sun.jersey.api.client.WebResource.handle(
at com.sun.jersey.api.client.WebResource.access$200(
at com.sun.jersey.api.client.WebResource$
at com.emc.vdp2.api.impl.ActionApi.connectVDR(
at sun.reflect.GeneratedMethodAccessor2344.invoke(Unknown Source)

[2018-01-24T15:02:43.933Z] [WARN ] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 org.springframework.flex.core.DefaultExceptionLogger The following exception occurred during request processing by the BlazeDS MessageBroker and will be serialized back to the client: flex.messaging.MessageException: com.sun.jersey.api.client.ClientHandlerException : A message body reader for Java class com.emc.vdp2.model.error.VdrError, and Java type class com.emc.vdp2.model.error.VdrError, and MIME media type application/octet-stream was not found
at com.vmware.vise.messaging.remoting.JavaAdapterEx.invoke(
at flex.messaging.MessageBroker.routeMessageToService(

The issue is due to a corrupted server.xml for the tomcat library. Due to this, none of the vCenter users would be able to connect or only domain users might face this issue.

To fix this:
1. Download the patched server.xml file from this link here:

2. Copy the file to the /root of VDP via WinSCP

3. Backup the original file:
# cp -p /usr/local/avamar-tomcat/conf/server.xml ~/server.xml.bak

4. Remove the old server.xml file:
# rm -f /usr/local/avamar-tomcat/conf/server.xml

5. Replace the patch file in the conf path:
# cp -p ~/server.xml /usr/local/avamar-tomcat/conf

6. Update permissions and ownership:
# chown root:root server.xml && chmod 644 server.xml

7. Restart tomcat service
# --restart

The connection will take a while now since it needs to rebuild the cache, but it will be successful.

Hope this helps.

Friday, 9 February 2018

vSphere Replication Jobs Fail Due to NFC_NO_MEMORY

There might be instances where the replication will run into an Error State or RPO violation state with NFC errors. When you click on the vCenter object in web client and navigate to the summary tab you can view the list of issues and when you highlight the vSphere Replication issues you will notice the NFC errors.

You will notice the below in the logs.
Note: The GID and other values will be different for each environment.

In source ESX where the virtual machine having issue is hosted, you will notice the below in vmkernel.log

2018-02-09T12:07:02.728Z cpu2:3055234)Hbr: 2998: Command: INIT_SESSION: error result=Failed gen=-1: Error for (datastoreUUID: "4723769b-f34bce3e"), (diskId: "RDID-0aaaa0e1-66e1-447f-97f5-19072c00d01e"), (hostId: "host-575"), (pathname: "Test-VM/hbrdis$
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 3007: Command INIT_SESSION failed (result=Failed) (isFatal=FALSE) (Id=0) (GroupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6)
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 4570: Failed to establish connection to []:31031(groupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6): Failure

In the hbrsrv.log under /var/log/vmware you will notice:

2018-02-09T13:12:17.024+01:00 warning hbrsrv[7FF152B01700] [Originator@6876 sub=Libs] [NFC ERROR] NfcFssrvrClientOpen: received unexpected message 4 from server
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-525.
2018-02-09T13:12:17.024+01:00 verbose hbrsrv[7FF152B01700] [Originator@6876 sub=HostPicker] AffinityHostPicker forgetting host affinity for context '[] /vmfs/volumes/4723769b-f34bce3e/Test-VM2'
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] HbrError for (datastoreUUID: "4723769b-f34bce3e"), (hostId: "host-525"), (pathname: "Test-VM2/Tes-VM2.vmdk"), (flags: retriable, pick-new-host) stack:
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [0] Class: NFC Code: 8
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [1] NFC error: NFC_SESSION_ERROR
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [2] Code set to: Host unable to process request.
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [3] Set error flag: retriable
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [4] Set error flag: pick-new-host
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [5] Can't open remote disk /vmfs/volumes/4723769b-f34bce3e/Test-VM2/Test-VM2.vmdk

Now, you can run the below command to check if there is one affected host or multiple:
# grep -i "Destroying NFC connection" /var/log/vmware/hbrsrv.log | awk '{ $1="";print}' | sort -u

This will give you the list of host MoID. Something like this, neatly sorted out:

 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-660.
 info hbrsrv[7FF1531E6700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-352.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-390.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-487.

Then use this host-id to see which host name it corresponds to from the vCenter MOB page. 

Then on that affected host, you will see this in the hostd.log

2018-02-09T12:17:21.339Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.399Z info hostd[4B040B70] [Originator@6876 sub=Nfcsvc] PROXY connection to NFC(useSSL=0): found session ticket:[N9VimShared15NfcSystemTicketE:0x4c224d24]
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Successfully initialized nfc callback for a  write to the socket to be invoked on a separate thread
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Plugin started
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 50331560 and allocating 119 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcProcessStreamMsg: fssrvr failed with NFC error code = 5
2018-02-09T12:17:21.448Z error hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Read error from the nfcLib: NFC_NO_MEMORY (done=yep)

To fix this, you will need to increase the hostd NFC memory on the target affected ESX host.

1. SSH to the host and navigate to the below location;
# etc/vmware/hostd/config.xml

You will want the following snippet (backup the file before edit)


So here change value for maxMemory to 62914560

So after edit:


2. Restart the hostd service using:
# /etc/init.d/hostd restart

3. Then initiate a force sync on the replication and it should resume successfully.

Hope this helps!