Tuesday, 24 July 2018

Understanding Perfbeat Logging In GSAN

If you have ever come across GSAN logs in VDP located under /data01/cur you would sometimes notice the below logging:

2018/05/28-10:34:31.01397 {0.0} [perfbeat.3:196]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=1397.27 limit=139.7273 mbpersec=0.79
2018/05/28-10:35:38.67619 {0.0} [perfbeat.2:194]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=53.72 limit=5.3722 mbpersec=0.88
The perfbeat outoftolerance is logged against various process.  In the above example, the task running is garbage collection and flush. This can be hfscheck, backup, restore and so on. Ideally, you will see this logging whenever that particular task has slow performance, causing the respective maintenance or backup jobs to take a long time to complete. If you are in a situation where the backup or restore or any Filesystem check is taking suspiciously long time to complete, then this would be a best place to look.

On a high level, GSAN measures the current performance over a period of previously measured average performance.

A simple explanation to the above logging is this. The average performance for the task within [] was 53.72, which was measured over a period of time. The current performance is 10 percent below the read average. (10 percent of 53.72 is 5.372) and the current mbpersec is 0.88

This mentions that there is a stress on the underlying storage or something wrong with that particular storage in terms of performance. Since VDP runs as a virtual machine. The flow would be:

> Check the load on the VDP itself. See if there is unusual load on the system and if yes, determine if there is a process hogging up the resources
> If the VM level checks out, then see if there are any issues on the DAVG or the VMFS file system. Perhaps there are multiple high I/O VMs running on this storage and there is a resource contention occurring? I would start with the vobd.log and vmkernel.log for that particular datastore naa.ID and then verify the Device Avg for that device.
> If this checks out too, then the last part would be the storage array itself. Moving VDP to another datastore is not an ideal test since these appliances fairly large in size.

Hope this helps!

Sunday, 8 July 2018

Unable To Pair SRM Sites: "Server certificate chain not verified"

So first things first, as of this post, you might know that I have moved out of VMware and ventured further into backup and recovery solutions domain. Currently, I work as a solutions engineer at Rubrik.

There are a lot of instances where you are unable to manage anything in Site Recovery Manager; regardless of the version (Also, applicable to vSphere Replication) and the common error that pops up on the bottom right of the web client is Server certificate chain not verified

Failed to connect to vCenter Server at vCenter_FQDN:443/sdk. Reason:
com.vmware.vim.vmomi.core.exception CertificateValidationException: Server certificate chain not verified.

This article will briefly explain only on the embedded Platform Services Controller deployment model. Similar logic needs to be extrapolated to the external deployments.

These issues are ideally seen when:
> PSC is migrated from embedded to external
> Certificates are replaced for the vCenter

I will be simplifying this KB article here for reference. Before proceeding have a powered off snapshot of the PSC and vCenters involved. 

So, for embedded deployment of VCSA:

1. SSH into the VCSA and run the below command:
# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost/lookupservice/sdk --no-check-cert --ep-type com.vmware.cis.cs.identity.sso 2>/dev/null

This command will give you the ssl Trust that is currently stored on your PSC. Now, consider you are using an embedded PSC deployment on production and another embedded deployment in DR (No Enhanced Linked Mode). In this case, when you run the above command, you are expected to see just one single output where the URL section is the FQDN of your current PSC node and associated with it would be its current ssl trust. 

URL: https://current-psc.vmware.local/sts/STSService/vsphere.local
SSL trust: MIIDWDCCAkCgAwIBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...Reducing output...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10ggClaP8=

If this is your case, proceed to step (2), if not jump to step (3)

2. Run the next command:
# echo | openssl s_client -connect localhost:443

This is the current ssl Trust that is used by your deployment post the certificate replacement. Here look at the extract which speaks about the certificate chain.

Server certificate
-----BEGIN CERTIFICATE-----
MIIDWDCCyAHikleBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10fGhhDDqm=
-----END CERTIFICATE-----

So from here, the chain obtained from the first command (current ssl trust in PSC) does not match the chain from the second command (Actual ssl trust). And due to this mismatch you would see the chain not verified message in the UI.

To fix this, the logic is; find all the services using the thumbprint of the old ssl Trust (step 1) and update them with the thumbprint from step 2. The steps in KB article is a bit confusing, so this is what I follow to fix it.

A) Copy the SSL trust you obtained from the first command to Notepad++ Everything that starts from MIID... in my case (No need to include SSL Trust option in it).

B) The chain should contain 65 characters in one line. So in notepad++ place the mouse after a character and see what the col option reads at the bottom. Hit Enter at the mark when col: 65
Format this for the complete chain (The last line might have <65 characters which is okay)

C) Append -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- before and after the chain (5 hyphens are used before and after)

D) Save the Notepadd++ document as a .cer extension.

E) Open the certificate file that you saved and navigate to Details > Thumbprint. You will notice a string of hexa with spacing after every 2 characters. Copy this to a Notepadd++ and append : after every 2 characters, so you will end up with the thumbprint similar to: 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88

F) Next, we will export the current certificate using the below command
# /usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output /certificates/new_machine.crt

This will export the current certificate to the /certificates directory. 

G) Run the update thumbprint option using the below command:
# cd /usr/lib/vmidentity/tools/scripts/
# python ls_update_certs.py --url https://FQDN_of_Platform_Services_Controller/lookupservice.sdk --fingerprint Old_Certificate_Fingerprint --certfile New_Certificate_Path_from_/Certificates --user Administrator@vsphere.local --password 'Password'

So a sample command would be:
# python ls_update_certs.py --url https://vcsa.vmware.local --fingerprint 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88 --certfile /certificates/new_machine.crt --user Administrator@vsphere.local --password 'Password'

This will basically look for all the services using the old thumbprint (04:83:88) and then update them with the current thumbprint from the new_machine.crt

Note: Once you paste the thumbprint in the SSH, remove any extra spaces before and after the beginning and end of thumbprint respectively. I have seen the Update service fail because of this as it picks up some special character in few cases (Special characters that you would not see in the terminal) So remove the space after fingerprint and re-add the space back. Do the same before the --certfile switch too.

Re-run the commands in step 1 and 2 and the SSL trust should now match. If yes, then re-login back to the web client and you should be good to go. 

--------------------------------------------------------------------------------------------------------------------------
3. In this case, you might see two PSC URL outputs in the step 1 command. Both the PSC have the same URL, and it might or might not have the same SSL trust. 

Case 1:
Multiple URL with different SSL trust.

This would be a easy one. On the output from Step 1, you will see two outputs with same PSC URL, but different SSL trust. And one of the SSL trust from here will match the current certificate from step 2. So this means, the one that does not match is the stale one and can be removed from the STS store. 

You can remove them from the CLI, however, I stick to using Jxplorer tool to remove it from the GUI. You can connect to PSC from Jxplorer using this KB article here.

Once connected, navigate to Configuration > Sites > LookupService > Service Registrations. 
One of the fields from command in step 1 is Service ID. Which is something similar to:
Service ID: 04608398-1493-4482-881b-b35961bf5141

Locate this similar service ID in the service registrations and you should be good to remove it. 

Case 2:
Multiple URL with same SSL trust. 

In this case, after the output from step 1, you will see two same PSC URL along with the same SSL trust. And these might or might not match the output from step 2. 

The first step of this fix is:

Note down both of the service IDs from the output of step 1. Connect the Jxplorer as mentioned above. Select the service ID and on the right side, click Table Editor view and click submit. You can view the last modified date of this service registration. The service ID having the older last modified date would be the stale registration and can be removed via Jxplorer. Now, when you run the command from Step 1, it would have one output. If this matches the thumbprint from step 2, great! If not, then an additional step of updating the thumbprint needs to be performed. 

In an event of external PSC deployment, let's say one PSC in production site and one in recovery site in ELM, then the command from step 1 is supposed to populate two outputs with two different URL (production and DR PSC) since they are replicating. This will of course change if there are multiple PSCs replicating with or without a load balancer. The process would be too complex to explain using text, so in this event it would be best to involve VMware Support for assistance. 

Hope this helps!