Tuesday, 24 July 2018

Understanding Perfbeat Logging In GSAN

If you have ever come across GSAN logs in VDP located under /data01/cur you would sometimes notice the below logging:

2018/05/28-10:34:31.01397 {0.0} [perfbeat.3:196]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=1397.27 limit=139.7273 mbpersec=0.79
2018/05/28-10:35:38.67619 {0.0} [perfbeat.2:194]  WARN: <1060> perfbeat::outoftolerance mask=[gc,flush,gcmark] average=53.72 limit=5.3722 mbpersec=0.88
The perfbeat outoftolerance is logged against various process.  In the above example, the task running is garbage collection and flush. This can be hfscheck, backup, restore and so on. Ideally, you will see this logging whenever that particular task has slow performance, causing the respective maintenance or backup jobs to take a long time to complete. If you are in a situation where the backup or restore or any Filesystem check is taking suspiciously long time to complete, then this would be a best place to look.

On a high level, GSAN measures the current performance over a period of previously measured average performance.

A simple explanation to the above logging is this. The average performance for the task within [] was 53.72, which was measured over a period of time. The current performance is 10 percent below the read average. (10 percent of 53.72 is 5.372) and the current mbpersec is 0.88

This mentions that there is a stress on the underlying storage or something wrong with that particular storage in terms of performance. Since VDP runs as a virtual machine. The flow would be:

> Check the load on the VDP itself. See if there is unusual load on the system and if yes, determine if there is a process hogging up the resources
> If the VM level checks out, then see if there are any issues on the DAVG or the VMFS file system. Perhaps there are multiple high I/O VMs running on this storage and there is a resource contention occurring? I would start with the vobd.log and vmkernel.log for that particular datastore naa.ID and then verify the Device Avg for that device.
> If this checks out too, then the last part would be the storage array itself. Moving VDP to another datastore is not an ideal test since these appliances fairly large in size.

Hope this helps!