Friday, 9 February 2018

vSphere Replication Jobs Fail Due to NFC_NO_MEMORY

There might be instances where the replication will run into an Error State or RPO violation state with NFC errors. When you click on the vCenter object in web client and navigate to the summary tab you can view the list of issues and when you highlight the vSphere Replication issues you will notice the NFC errors.

You will notice the below in the logs.
Note: The GID and other values will be different for each environment.

In source ESX where the virtual machine having issue is hosted, you will notice the below in vmkernel.log

2018-02-09T12:07:02.728Z cpu2:3055234)Hbr: 2998: Command: INIT_SESSION: error result=Failed gen=-1: Error for (datastoreUUID: "4723769b-f34bce3e"), (diskId: "RDID-0aaaa0e1-66e1-447f-97f5-19072c00d01e"), (hostId: "host-575"), (pathname: "Test-VM/hbrdis$
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 3007: Command INIT_SESSION failed (result=Failed) (isFatal=FALSE) (Id=0) (GroupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6)
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 4570: Failed to establish connection to [10.254.2.37]:31031(groupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6): Failure

In the hbrsrv.log under /var/log/vmware you will notice:

2018-02-09T13:12:17.024+01:00 warning hbrsrv[7FF152B01700] [Originator@6876 sub=Libs] [NFC ERROR] NfcFssrvrClientOpen: received unexpected message 4 from server
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-525.
2018-02-09T13:12:17.024+01:00 verbose hbrsrv[7FF152B01700] [Originator@6876 sub=HostPicker] AffinityHostPicker forgetting host affinity for context '[] /vmfs/volumes/4723769b-f34bce3e/Test-VM2'
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] HbrError for (datastoreUUID: "4723769b-f34bce3e"), (hostId: "host-525"), (pathname: "Test-VM2/Tes-VM2.vmdk"), (flags: retriable, pick-new-host) stack:
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [0] Class: NFC Code: 8
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [1] NFC error: NFC_SESSION_ERROR
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [2] Code set to: Host unable to process request.
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [3] Set error flag: retriable
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [4] Set error flag: pick-new-host
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [5] Can't open remote disk /vmfs/volumes/4723769b-f34bce3e/Test-VM2/Test-VM2.vmdk

Now, you can run the below command to check if there is one affected host or multiple:
# grep -i "Destroying NFC connection" /var/log/vmware/hbrsrv.log | awk '{ $1="";print}' | sort -u

This will give you the list of host MoID. Something like this, neatly sorted out:

 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-660.
 info hbrsrv[7FF1531E6700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-352.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-390.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-487.

Then use this host-id to see which host name it corresponds to from the vCenter MOB page. 

Then on that affected host, you will see this in the hostd.log

2018-02-09T12:17:21.339Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.399Z info hostd[4B040B70] [Originator@6876 sub=Nfcsvc] PROXY connection to NFC(useSSL=0): found session ticket:[N9VimShared15NfcSystemTicketE:0x4c224d24]
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Successfully initialized nfc callback for a  write to the socket to be invoked on a separate thread
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Plugin started
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 50331560 and allocating 119 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcProcessStreamMsg: fssrvr failed with NFC error code = 5
2018-02-09T12:17:21.448Z error hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Read error from the nfcLib: NFC_NO_MEMORY (done=yep)

To fix this, you will need to increase the hostd NFC memory on the target affected ESX host.

1. SSH to the host and navigate to the below location;
# etc/vmware/hostd/config.xml

You will want the following snippet (backup the file before edit)

<nfcsvc>
    <path>libnfcsvc.so</path>
    <enabled>true</enabled>
    <maxMemory>50331648</maxMemory>
    <maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

So here change value for maxMemory to 62914560

So after edit:

<nfcsvc>
    <path>libnfcsvc.so</path>
    <enabled>true</enabled>
    <maxMemory>62914560</maxMemory>
    <maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

2. Restart the hostd service using:
# /etc/init.d/hostd restart

3. Then initiate a force sync on the replication and it should resume successfully.

Hope this helps!