Tuesday, 20 February 2018

SRM Service Crashes Due To CR/LF Conversion

In a postgreSQL SRM deployment, the service might crash if the Carriage Return / Line Feed bit is enabled. The back trace would not tell much, even with trivia logging I could not make much out of it. This is what I saw in vmware-dr.log:

2018-02-20T01:05:26.359Z [02712 verbose 'Replication.Folder'] Reconstructing folder '_replicationRoot':'DrReplicationRootFolder' from the database
2018-02-20T01:05:26.468Z [02712 panic 'Default'] 
--> 
--> Panic: TerminateHandler called
--> Backtrace:
--> 
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
--> backtrace[05] vmacore.dll[0x001D9FD0]
<<<<SHORTENED BACKTRACE>>>>
--> backtrace[40] vmacore.dll[0x00065FEB]
--> backtrace[41] vmacore.dll[0x0015BC50]
--> backtrace[42] vmacore.dll[0x001D2A5B]
--> backtrace[43] MSVCR90.dll[0x00002FDF]
--> backtrace[44] MSVCR90.dll[0x00003080]
--> backtrace[45] KERNEL32.DLL[0x0000168D]
--> backtrace[46] ntdll.dll[0x00074629]
--> [backtrace end]

The last OpID is 02712 and even if I search with this while trivia logging being enabled, it would not give me much information.

Apparently, the CR/LF bit in ODBC driver causing some kind of truncation causing SRM service to crash.


This setting is available Under:
ODBC 64 bit > System DSN > SRM DSN (Configure) > Datasource > Page 2

Uncheck this option and then the service should start successfully.

Hope this kind of helps!

Monday, 19 February 2018

Unable To Send Scheduled Email Reports In VDP "com.vmware.vim25.ManagedObjectNotFound"

When I was working on a 6.1.5 VDP deployment, the test emails were successful, however, the scheduled email reports used to fail and in the vdr-server.log the below back trace was seen:

2018-02-14 09:00:00,157 ERROR [VDP-email-report-timer-task]-schedule.EmailReportTimerTask: Failed to send the email summary report. Reason: java.rmi.RemoteException: VI SDK invoke exception:com.vmware.vim25.Manage
dObjectNotFound; nested exception is:
        com.vmware.vim25.ManagedObjectNotFound
java.lang.RuntimeException: java.rmi.RemoteException: VI SDK invoke exception:com.vmware.vim25.ManagedObjectNotFound; nested exception is:
        com.vmware.vim25.ManagedObjectNotFound
        at com.vmware.vim25.mo.ManagedObject.retrieveObjectProperties(ManagedObject.java:158)
        at com.vmware.vim25.mo.ManagedObject.getCurrentProperty(ManagedObject.java:179)
        at com.vmware.vim25.mo.ManagedEntity.getName(ManagedEntity.java:99)
        at com.emc.vdp2.common.converter.VmClientConverterImpl.getRealName(VmClientConverterImpl.java:113)
        at com.emc.vdp2.common.converter.VmClientConverterImpl.convert(VmClientConverterImpl.java:62)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addVmClientToBackupJob(BackupJobConverterImpl.java:329)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addClientsToBackupJob(BackupJobConverterImpl.java:280)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addTargetItemsToBackupJob(BackupJobConverterImpl.java:258)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:181)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:204)
        at com.emc.vdp2.services.BackupQueryServiceWS.getAllVmBackupJobs(BackupQueryServiceWS.java:80)
        at com.emc.vdp2.services.BackupQueryServiceWS.getAllVmBackupJobs(BackupQueryServiceWS.java:64)
        at com.emc.vdp2.email.EmailSummaryReport.createReport(EmailSummaryReport.java:295)
        at com.emc.vdp2.email.EmailSummaryReport.createReport(EmailSummaryReport.java:260)
        at com.emc.vdp2.email.EmailSummaryReport.createAndSendReport(EmailSummaryReport.java:142)
        at com.emc.vdp2.schedule.EmailReportTimerTask.run(EmailReportTimerTask.java:105)
        at java.util.TimerThread.mainLoop(Unknown Source)
        at java.util.TimerThread.run(Unknown Source)

The main part of the back trace we are interested is this:

com.emc.vdp2.common.converter.BackupJobConverterImpl.addClientsToBackupJob(BackupJobConverterImpl.java:280)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.addTargetItemsToBackupJob(BackupJobConverterImpl.java:258)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:181)
        at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:204)

The backup job conversion to group was having an exception. So out of curiosity, I ran the below command:
# mccli group show --recursive=true

This showed 2 backups jobs whereas the GUI reported one backup job. 

So to fix this, I had to restart the tomcat service using:
# emwebapp.sh --restart

Post this, the GUI updated the right number of backup jobs and the scheduled email reports were sent successfully. 

If you run into something similar hope this helps!

Wednesday, 14 February 2018

VDP Cannot Connect To Web Client "MIME media type application/octet-stream was not found"

Yet another issue. You will see this generic message in web client when you connect to the web client VDP plugin:

As usual, we always check the web client logs on the vCenter for more information. And in this case, below was the back trace:

[2018-01-24T15:02:43.924Z] [ERROR] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.sun.jersey.api.client.ClientResponse A message body reader for Java class com.emc.vdp2.model.error.VdrError, and Java type class com.emc.vdp2.model.error.VdrError, and MIME media type application/octet-stream was not found
[2018-01-24T15:02:43.924Z] [ERROR] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.sun.jersey.api.client.ClientResponse The registered message body readers compatible with the MIME media type are:
application/octet-stream ->
com.sun.jersey.core.impl.provider.entity.ByteArrayProvider
com.sun.jersey.core.impl.provider.entity.FileProvider

[2018-01-24T15:02:43.924Z] [WARN ] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.emc.vdp2.api.impl.ActionApi Caught UniformInterfaceException [POST https://192.168.246.10:8543/vdr-server/auth/login returned a response status of 400 Bad Request], recieved HTTP response: [400] com.sun.jersey.api.client.UniformInterfaceException: POST https://192.168.246.10:8543/vdr-server/auth/login returned a response status of 400 Bad Request
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:688)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:560)
at com.emc.vdp2.api.impl.ActionApi.connectVDR(ActionApi.java:40)
at sun.reflect.GeneratedMethodAccessor2344.invoke(Unknown Source)

[2018-01-24T15:02:43.933Z] [WARN ] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 org.springframework.flex.core.DefaultExceptionLogger The following exception occurred during request processing by the BlazeDS MessageBroker and will be serialized back to the client: flex.messaging.MessageException: com.sun.jersey.api.client.ClientHandlerException : A message body reader for Java class com.emc.vdp2.model.error.VdrError, and Java type class com.emc.vdp2.model.error.VdrError, and MIME media type application/octet-stream was not found
at flex.messaging.services.remoting.adapters.JavaAdapter.invoke(JavaAdapter.java:412)
at com.vmware.vise.messaging.remoting.JavaAdapterEx.invoke(JavaAdapterEx.java:72)
at flex.messaging.services.RemotingService.serviceMessage(RemotingService.java:180)
at flex.messaging.MessageBroker.routeMessageToService(MessageBroker.java:1472)

The issue is due to a corrupted server.xml for the tomcat library. Due to this, none of the vCenter users would be able to connect or only domain users might face this issue.

To fix this:
1. Download the patched server.xml file from this link here:
https://github.com/happycow92/Patches

2. Copy the file to the /root of VDP via WinSCP

3. Backup the original file:
# cp -p /usr/local/avamar-tomcat/conf/server.xml ~/server.xml.bak

4. Remove the old server.xml file:
# rm -f /usr/local/avamar-tomcat/conf/server.xml

5. Replace the patch file in the conf path:
# cp -p ~/server.xml /usr/local/avamar-tomcat/conf

6. Update permissions and ownership:
# chown root:root server.xml && chmod 644 server.xml

7. Restart tomcat service
# emwebapp.sh --restart

The connection will take a while now since it needs to rebuild the cache, but it will be successful.

Hope this helps.

Friday, 9 February 2018

vSphere Replication Jobs Fail Due to NFC_NO_MEMORY

There might be instances where the replication will run into an Error State or RPO violation state with NFC errors. When you click on the vCenter object in web client and navigate to the summary tab you can view the list of issues and when you highlight the vSphere Replication issues you will notice the NFC errors.

You will notice the below in the logs.
Note: The GID and other values will be different for each environment.

In source ESX where the virtual machine having issue is hosted, you will notice the below in vmkernel.log

2018-02-09T12:07:02.728Z cpu2:3055234)Hbr: 2998: Command: INIT_SESSION: error result=Failed gen=-1: Error for (datastoreUUID: "4723769b-f34bce3e"), (diskId: "RDID-0aaaa0e1-66e1-447f-97f5-19072c00d01e"), (hostId: "host-575"), (pathname: "Test-VM/hbrdis$
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 3007: Command INIT_SESSION failed (result=Failed) (isFatal=FALSE) (Id=0) (GroupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6)
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 4570: Failed to establish connection to [10.254.2.37]:31031(groupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6): Failure

In the hbrsrv.log under /var/log/vmware you will notice:

2018-02-09T13:12:17.024+01:00 warning hbrsrv[7FF152B01700] [Originator@6876 sub=Libs] [NFC ERROR] NfcFssrvrClientOpen: received unexpected message 4 from server
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-525.
2018-02-09T13:12:17.024+01:00 verbose hbrsrv[7FF152B01700] [Originator@6876 sub=HostPicker] AffinityHostPicker forgetting host affinity for context '[] /vmfs/volumes/4723769b-f34bce3e/Test-VM2'
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] HbrError for (datastoreUUID: "4723769b-f34bce3e"), (hostId: "host-525"), (pathname: "Test-VM2/Tes-VM2.vmdk"), (flags: retriable, pick-new-host) stack:
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [0] Class: NFC Code: 8
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [1] NFC error: NFC_SESSION_ERROR
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [2] Code set to: Host unable to process request.
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [3] Set error flag: retriable
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [4] Set error flag: pick-new-host
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [5] Can't open remote disk /vmfs/volumes/4723769b-f34bce3e/Test-VM2/Test-VM2.vmdk

Now, you can run the below command to check if there is one affected host or multiple:
# grep -i "Destroying NFC connection" /var/log/vmware/hbrsrv.log | awk '{ $1="";print}' | sort -u

This will give you the list of host MoID. Something like this, neatly sorted out:

 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-660.
 info hbrsrv[7FF1531E6700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-352.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-390.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-487.

Then use this host-id to see which host name it corresponds to from the vCenter MOB page. 

Then on that affected host, you will see this in the hostd.log

2018-02-09T12:17:21.339Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.399Z info hostd[4B040B70] [Originator@6876 sub=Nfcsvc] PROXY connection to NFC(useSSL=0): found session ticket:[N9VimShared15NfcSystemTicketE:0x4c224d24]
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Successfully initialized nfc callback for a  write to the socket to be invoked on a separate thread
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Plugin started
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 50331560 and allocating 119 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcProcessStreamMsg: fssrvr failed with NFC error code = 5
2018-02-09T12:17:21.448Z error hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Read error from the nfcLib: NFC_NO_MEMORY (done=yep)

To fix this, you will need to increase the hostd NFC memory on the target affected ESX host.

1. SSH to the host and navigate to the below location;
# etc/vmware/hostd/config.xml

You will want the following snippet (backup the file before edit)

<nfcsvc>
    <path>libnfcsvc.so</path>
    <enabled>true</enabled>
    <maxMemory>50331648</maxMemory>
    <maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

So here change value for maxMemory to 62914560

So after edit:

<nfcsvc>
    <path>libnfcsvc.so</path>
    <enabled>true</enabled>
    <maxMemory>62914560</maxMemory>
    <maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

2. Restart the hostd service using:
# /etc/init.d/hostd restart

3. Then initiate a force sync on the replication and it should resume successfully.

Hope this helps!

Thursday, 8 February 2018

VDP Expired MCSSL, Reports 7778, 7779, 7780, 7781, 9443 As Vulnerable In Nessus Scan

In one of my case, there was a report that 7778, 7779, 7780, 7781, 9443 where reported as vulnerable on VDP 6.1.6. All these are MCS java based ports and you can confirm them by running:
# netstat -nlp | grep <enter-port>

To check your MCS SSL validity perform the below commands:
# /usr/java/default/bin/keytool -list -keystore /usr/local/avamar/lib/rmi_ssl_keystore -storepass changeme

The output:

Keystore type: JKS
Keystore provider: SUN

Your keystore contains 2 entries

mcssl, Feb 1, 2008, PrivateKeyEntry,
Certificate fingerprint (SHA1): F1:61:A7:FE:36:A9:E9:7E:DB:92:AE:89:05:52:13:B6:3C:FA:55:A7
vcenterrootca, Jan 8, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): F0:46:B4:00:B8:52:24:6E:A2:94:6B:17:CE:83:23:49:54:9A:3A:49

Then export the cert to root directory:
# /usr/java/default/bin/keytool -exportcert -v -alias mcssl -keystore /usr/local/avamar/lib/rmi_ssl_keystore -storepass changeme -file /root/mcssl.cer -rfc

The output:
Certificate stored in file </root/mcssl.cer>

Then read the certificate:
# /usr/java/default/bin/keytool -printcert -v -file /root/mcssl.cer

The output:

Owner: CN=Administrator, OU=Avamar, O=EMC, L=Irvine, ST=California, C=US
Issuer: CN=Administrator, OU=Avamar, O=EMC, L=Irvine, ST=California, C=US
Serial number: 47a25760
Valid from: Fri Feb 01 00:18:56 CET 2008 until: Mon Jan 29 00:18:56 CET 2018
Certificate fingerprints:
MD5: 61:42:FC:CD:FC:CB:6E:59:CC:48:5E:D9:71:05:F0:B4
SHA1: F1:61:A7:FE:36:A9:E9:7E:DB:92:AE:89:05:52:13:B6:3C:FA:55:A7
SHA256: B4:E6:71:77:58:9B:58:64:E2:F7:3A:A0:2A:07:F8:7B:2E:CA:1B:22:2B:C3:98:A8:90:F8:D8:7A:8E:0A:EE:F9
Signature algorithm name: SHA1withDSA
Version: 1

Due to this expired cert, the java ports are vulnerable. To fix this, you will have to regenerate the certs. The process would be:

1. Backup existing keystore:
# cp /usr/local/avamar/lib/rmi_ssl_keystore ~root/rmi_ssl_keystore_backup-`date -I`

2. Regenerate the mcssl:
# /usr/java/latest/bin/keytool -genkeypair -v -alias mcssl -keyalg RSA -sigalg SHA256withRSA -keystore /usr/local/avamar/lib/rmi_ssl_keystore -storepass changeme -keypass changeme -validity 3650 -dname "CN=`hostname -f`, OU=Avamar, O=EMC, L=Irvine, S=California, C=US" -keysize 2048

Generates a SHA256 SSL which is valid for 10 years. 

3. Update the permissions on the rmi_ssl_keystore
# chmod 444 /usr/local/avamar/lib/rmi_ssl_keystore

4. Update owners for the keystore:
# chown root:admin /usr/local/avamar/lib/rmi_ssl_keystore

5. Switch to admin mode and restart MCS:
# mcserver.sh --stop 
# mcserver.sh --start --verbose

6. Verify all vCenter Connections are OK:
# mccli server show-services

That should be it. Now when you re-run the scan these ports are no longer vulnerable. 

Hope this helps!

Unable To Connect VDP To vCenter: "login returned a response status of 204 No Content"

When connecting a newly deployed VDP or an existing VDP to the web client, you might run into the following error:

This is a very generic message and if you have a look at the web client logs, you will notice the following back trace:

[2018-02-08T09:03:57.295Z] [WARN ] http-bio-9090-exec-5         70000222 100008 200003 org.springframework.flex.core.DefaultExceptionLogger The following exception occurred during request processing by the BlazeDS MessageBroker and will be serialized back to the client:  flex.messaging.MessageException: com.sun.jersey.api.client.UniformInterfaceException : POST https://10.10.0.12:8543/vdr-server/auth/login returned a response status of 204 No Content
        at flex.messaging.services.remoting.adapters.JavaAdapter.invoke(JavaAdapter.java:444)
        at com.vmware.vise.messaging.remoting.JavaAdapterEx.invoke(JavaAdapterEx.java:50)
        at flex.messaging.services.RemotingService.serviceMessage(RemotingService.java:183)


Caused by: com.sun.jersey.api.client.UniformInterfaceException: POST https://10.10.0.12:8543/vdr-server/auth/login returned a response status of 204 No Content
        at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:609)
        at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:586)
        at com.emc.vdp2.api.impl.BaseApi.convertToFlexException(BaseApi.java:184)

Looking further into the vdr-server.log, you will notice this:

2018-02-08 10:04:44,850 ERROR [http-nio-8543-exec-9]-rest.AuthenticationService: Failed To Get VDR Info
java.lang.NullPointerException
        at com.emc.vdp2.common.appliance.ApplianceServiceImpl.getApplianceState(ApplianceServiceImpl.java:47)
        at com.emc.vdp2.services.VdrInfoServiceImpl.getVdrInfo(VdrInfoServiceImpl.java:171)
        at com.emc.vdp2.services.ApplianceInfoServiceWS.getVdrInfo(ApplianceInfoServiceWS.java:50)
        at com.emc.vdp2.rest.AuthenticationService.login(AuthenticationService.java:87)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)

The next piece of stack trace might vary, but if you see the above messages then you are bound to see a similar trace as below:

2018-02-08 10:04:44,727 INFO  [http-nio-8543-exec-9]-rest.AuthenticationService: Logging into appliance with type: vdp
2018-02-08 10:04:44,768 INFO  [http-nio-8543-exec-9]-connection.Mcsdk10StubManager: ServiceInstanceMoref desc=Service Id: urn:uuid:9FBE7B2DFEF05750401518080690404 name=urn:uuid:9FBE7B2DFEF05750401518080690404 value=SERVICE
2018-02-08 10:04:44,771 INFO  [http-nio-8543-exec-9]-connection.McAccessManager: Creating new mcsdk stub handler for connection key: [2091248218, Service Id: urn:uuid:9FBE7B2DFEF05750401518080690404] on Thread: [http-nio-8543-exec-9]
2018-02-08 10:04:44,849 ERROR [http-nio-8543-exec-9]-db.ApplianceStateDAO: ApplianceStateDAO.getApplianceState failed to execute ApplianceState query.
java.sql.SQLException: ERROR: relation "appliance_state" does not exist Query: select * from appliance_state Parameters: []
        at org.apache.commons.dbutils.AbstractQueryRunner.rethrow(AbstractQueryRunner.java:320)
        at org.apache.commons.dbutils.QueryRunner.query(QueryRunner.java:349)
        at org.apache.commons.dbutils.QueryRunner.query(QueryRunner.java:305)

Right after it initiates Authentication, it queries the vdr database. And in this case, appliance_state, table is missing from vdrdb. 

To connect to vdrdb on VDP, run:
# psql -p 5555 -U admin vdrdb

Type \d to list all tables. You should see 26 tables here:

                          List of relations
 Schema |                  Name                   |   Type   | Owner
--------+-----------------------------------------+----------+-------
 public | appliance_state                         | table    | admin
 public | compatibility                           | table    | admin
 public | container_group_membership              | table    | admin
 public | container_group_membership_id_seq       | sequence | admin
 public | email_report_settings                   | table    | admin
 public | entity_display_path                     | table    | admin
 public | entity_display_path_id_seq              | sequence | admin
 public | esx_hosts                               | table    | admin
 public | esx_hosts_id_seq                        | sequence | admin
 public | group_app_client_targets                | table    | admin
 public | group_app_client_targets_id_seq         | sequence | admin
 public | identity                                | table    | admin
 public | identity_id_seq                         | sequence | admin
 public | job_migration_history                   | table    | admin
 public | job_migration_history_id_seq            | sequence | admin
 public | locked_backup_retentions                | table    | admin
 public | mc_activity_monitor                     | table    | admin
 public | mc_replication_activity_monitor         | table    | admin
 public | user_log                                | table    | admin
 public | user_log_id_seq                         | sequence | admin
 public | v_vm_group_membership_by_container      | view     | admin
 public | vcenter_event_monitor                   | table    | admin
 public | vdp_migration_history                   | table    | admin
 public | vdp_migration_history_id_seq            | sequence | admin
 public | vm_group_membership_by_container        | table    | admin
 public | vm_group_membership_by_container_id_seq | sequence | admin

If you are missing one or more tables, the vdr service is not initialized and the connection fails.
To recreate the missing tables, open a case with VMware Support. I had to fix these tables manually. If someone has a better way, I'm open for suggestions.

Post recreating the tables, restart the tomcat service using:
# emwebapp.sh --restart
That's it!

Tuesday, 6 February 2018

SRM Plugin Not Available In Web Client

Today while working on a 6.1.1 fresh SRM deployment we were unable to see the Site Recovery Manager plugin in the web client. The first thing, we do in this case is to go to the Managed Object Browser page and check if the SRM extension is registered successfully. The URL for MOB page is https://vcenter-ip-or-fqdn/mob

Here we browse further to content > ExtensionManager. Under the properties section, we should have an SRM extension, which is com.vmware.vcDr, by default. If you have installed SRM with a custom identifier then you would see something like, com.vmware.vcDr-<your-custom-identifier-name>
In our case, the extension was available.

Next, looking at the web client logs, in our case a vCenter appliance, we noticed the following:

[2018-02-06T12:00:13.283+03:00] [ERROR] vc-extensionmanager-pool-81  70000046 100002 200001 com.vmware.vise.vim.extension.VcExtensionManager Package com.vmware.vcDr-custom was not installed!
Error downloading https://SRM-Local-IP:9086/srm-client.zip. Make sure that the URL is reachable then logout/login to force another download. java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)

So the vCenter was unable to pull the plugin manually from that URL. So under the plugin package folder we did not have any SRM plugin folder. The location to this plugin package folder on vCenter appliance is:
# /etc/vmware/vsphere-client/vc-packages/vsphere-client-serenity

Here you should have a "com.vmware.vcDr-<version-ID>" folder which in our case was missing. So we had to manually dump this package in this location.

To fix this:
1. Navigate to the URL from the log from a browser,  https://SRM-Local-IP:9086/srm-client.zip
This will prompt you for a download of the plugin zip file. Download this file and put into the above mentioned vsphere-client-serenity location via a WinSCP

2. Now, we will have to manually create this plugin folder. There are few catches to this.

If you are using default plugin identifier for SRM, then the naming convention would be:
com.vmware.vcDr-<srm-version-string>

If you are using custom identifier for SRM, then the naming convention would be:
com.vmware.vcDr-customName-<srm-version-string> 

How do you find this exact SRM version string?

A) Go back to the MOB page where you had left off in ExtensionManager. Click the com.vmware.vcDr extension. This will in turn open a new page.

B) Here click on the client under VALUE 

C) Now you can see the version string and the value. In a 6.1.1 SRM for example, the version string is 6.1.1.1317

So the plugin folder now will be:
Default:
com.vmware.vcDr-6.1.1.1317

Custom:
com.vmware.vcDr-custom-6.1.1.1317

3. Copy the zip file into this folder and then extract it. The outcome would be a plugin-package.xml and a plugins folder.

4. Restart the web client service for the vCenter. The command varies for 6.5 and 6.0 vCenter.

5. Re-login back to the web client once the web client loads up and you should have the plugin.

Hope this helps!

Thursday, 25 January 2018

SRM Service Crashes During A Recovery Operation With timedFunc BackTrace

In few scenarios when you run a test recovery or a planned migration, the SRM service will crash. This might happen when you run a specific recovery plan or any recovery plan.

If you look into the vmware-dr.log you will notice the following back-trace:

--> Panic: VERIFY d:\build\ob\bora-3884620\srm\public\functional/async/timedFunc.h:210
-->
--> Backtrace:
-->
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
xxxxxxxxxxxxxxxxxxxxx Cut Logs Here xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
--> backtrace[36] ntdll.dll[0x000154E4]
--> [backtrace end]
-->

The timedFunc back-trace is seen when "Wait For VMware Tools" is set to 0 minutes and 0 seconds

And just about few lines top of this back trace you will see the faulty VM which caused this crash.

You will see something similar to:

2018-01-21T08:37:05.421-05:00 [44764 info 'VmDomain' ctxID=57d5ae61 opID=21076ff:c402:4147:d883] Waiting for VM '[vim.VirtualMachine:b2ab3f04-c72e-43ca-b93d-de1566e4de14:vm-323]' to reach desired powered state 'poweredOff' within '0' seconds.

The VM ID is given here. To find this VM ID you will need to go to the vCenter MOB page.

The way I found out to correlate this is:
1. Login to MOB page for vCenter (https://vcenter-ip/mob)
2. Content > group-d1 (Datacenters)
3. Respective datacenter under "Child Entity"
4. Then under vmFolder group-v4 (vm)
5. Expand childEntity and this will list out all the VMs in that vCenter.

My output was similar to:


The VM was CentOS7.2

> Then navigate to the Recovery plan in SRM
> Select the affected Recovery plan this VM is part of > Related Objects > Virtual Machines
> Right click this VM and select Configure Recovery

Here the Wait For VMware Tools were set to 0,0 timeout. We had to change this to a valid non zero value. 


Post this, the recovery plan completed fine without crashing the SRM service. This should ideally be fixed in the newer SRM releases as it would not let you set a 0 timeout. 

Hope this helps!

VDP Restore Fails: Virtual Machine Must Be Powered Off To Restore

There are few cases where you try to perform an in place restore of a virtual machine and it fails and complains the virtual machine is not powered off. The virtual machine is indeed powered off, however, the VDP does not understand this. You will see something like:


There might be two possible causes.

1. The MCS cache might not be updated with the power state.
Run the below command:
# mccli vmcache show --name=/vc-domain-name/VirtualMachines/<vm-name> | grep -i "power status"

Example command:
# mccli vmcache show --name=/cartman.southpark.local/VirtualMachines/Test-1 | grep -i "power status"

The output you should ideally see on a powered off VM is:
Power Status         poweredOff

In the above case, you might see:
Power Status         poweredOn

If this is the issue, then update the MCS VM cache by issuing the below command:
# mccli vmcache sync --name=/vc-domain-name/VirtualMachines/<vm-name>

Example command:
# mccli vmcache sync --name=/cartman.southpark.local/VirtualMachines/Test-1

Then the power state should be updated and the in place restore should work. 

2. The MCS cache might be updated but the tomcat is out of sync with MCS. 

In many cases the MCS and tomcat don't sync together. Due to this, the CLI will show one set of results and the GUI will say otherwise. To sync them up simply restart the tomcat service by issuing:
# emwebapp.sh --restart

Note that post restarting the tomcat service, it will take a while to connect the appliance back in the web client as it has to rebuild the cache. 
If you tail the vdr-server.log located under /usr/local/avamar/var/vdr/server_logs/vdr-server.log then once you see the below logging, it would indicate the connection has completed successfully:

2018-01-25 09:05:57,566 INFO  [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location  /usr/local/avamar/var/vdr/phonehome/vdp_state
2018-01-25 09:05:57,567 INFO  [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location  /usr/local/avamar/var/vdr/phonehome/vdp_state

The restore should then work successfully. 

If it still continues to fail, then try performing a restore to a different location as a workaround. 

Hope this helps!