Wednesday, 6 December 2017

SRM Service Crashes After A Failed Recovery With "abrRecoveryEngine" Backtrace

In some instances, when you are running Array Based Replication for SRM, a failed planned migration might cause the SRM service to crash. In the vmware-dr.log found on the SRM machine, we will notice the following backtrace

2017-12-06T09:55:38.620-05:00 panic vmware-dr[06076] [Originator@6876 sub=Default] 
--> 
--> Panic: Assert Failed: "ok (Dr::Providers::Abr::AbrRecoveryEngine::AbrRecoveryEngineImpl::LoadFromDb: Unable to insert post failover info object 212337205 for group vm-protection-group-121101624 array pair array-pair-7065)" @ d:/build/ob/bora-6014840/srm/src/providers/abr/common/abrRecoveryEngine/abrRecoveryEngine.cpp:244
--> Backtrace:
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
--> backtrace[00] vmacore.dll[0x001F29FA]
--> backtrace[01] vmacore.dll[0x00067D60]
--> backtrace[02] vmacore.dll[0x0006A20E]
--> backtrace[03] vmacore.dll[0x002245A7]
--> backtrace[04] vmacore.dll[0x00224771]
--> backtrace[05] vmacore.dll[0x00059C0D]
--> backtrace[06] dr-abr-recoveryEngine.dll[0x00028A91]
--> backtrace[07] dr-abr-recoveryEngine.dll[0x00015199]
--> backtrace[08] dr-abr-recoveryEngine.dll[0x002DB368]
--> backtrace[09] dr-abr-recoveryEngine.dll[0x002DB913]
--> backtrace[10] vmacore.dll[0x001D6ACC]
--> backtrace[11] vmacore.dll[0x001865AB]
--> backtrace[12] vmacore.dll[0x0018759C]
--> backtrace[13] vmacore.dll[0x002202E9]
--> backtrace[14] MSVCR120.dll[0x00024F7F]
--> backtrace[15] MSVCR120.dll[0x00025126]
--> backtrace[16] KERNEL32.DLL[0x000013D2]
--> backtrace[17] ntdll.dll[0x000154E4]
--> [backtrace end]

This is seen when there are issues unmounting the source datastore or demoting the source datastore. 

Disclaimer: Modifying database tables is done by VMware. Do this at your own risk.

The fix is:

1. Make sure SRM service is stopped on both sites
2. Backup the SRM databases on both sites
3. Login to the database either using PGadmin or SQL management studio depending on the type of database used
4. Open this table "pda_grouppostfailoverinfo"
5. Here we need to remove the db_id which is available from the back trace. In my case it is: 212337205
6. Once this is done, start the SRM service. If it crashes again, it usually generates another object ID and repeat the process.

And that should be it.

Thursday, 30 November 2017

Unable To Protect a VM In SRM: "Object not found"

So there's a rare instance where you will be unable to protect a VM and the error it throws out is:
Internal error: class Vmacore::NotFoundException "Object not found"

Under Protection Groups > Related Objects > Virtual Machines, you will see the VM coming up as Not Configured.


And when you try to right click this and say Configure protection, you will notice that the Device Status will come up as Non-replicated 



And if you browse the recovery location and provide the path of the replicated VMDK, you will run into this error.

In the web client logs, you will see:

[2017-11-28T09:27:50.156-06:00] [ERROR] srm-client-thread-1253 70015389 101315 201173 com.vmware.srm.client.infraservice.tasks.FakeTaskImpl [DrVmodlFakeTask:srm-fake-task-11:fake-server-guid]: com.vmware.vim.binding.dr.fault.DrRuntimeFault: Task Failed
at com.vmware.srm.client.infraservice.util.ExceptionUtil.newRuntimeFault(ExceptionUtil.java:92)
at com.vmware.srm.client.infraservice.util.ExceptionUtil.newRuntimeFault(ExceptionUtil.java:68)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl.getSingleError(MultiTaskProgressUpdaterImpl.java:89)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl.updateProgress(MultiTaskProgressUpdaterImpl.java:222)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl$3.run(MultiTaskProgressUpdaterImpl.java:431)
at $java.lang.Runnable$$FastClassByCGLIB$$36fc6471.invoke(<generated>)
at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:149)
at com.vmware.srm.client.topology.impl.osgi.aop.HttpRequestContextAdvice$CallInterceptor.intercept(HttpRequestContextAdvice.java:53)
at com.vmware.srm.client.topology.impl.osgi.aop.HttpRequestContextAdvice$Base$$EnhancerByCGLIB$$b6ab80b4.run(<generated>)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl$4.run(MultiTaskProgressUpdaterImpl.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.vmware.vim.binding.dr.fault.InternalError: Internal error: class Vmacore::NotFoundException "Object not found"
[context]zKq7AVMEAQAAAHjHWwAUdm13YXJlLWRyAACoLwpkci1yZXBsaWNhdGlvbi5kbGwAAGEbCgASaT8AAy5BAOv/QACT9EABuSMCY29ubmVjdGlvbi1iYXNlLmRsbAABx3QCAccrAgGg8AABPUMBAccrAgGSLgMBdwgDARb3AgHHKwIBuSMCAXcIAwEW9wIBxysC[/context].
at sun.reflect.GeneratedConstructorAccessor614.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)



The reason, one of them, is the source VMX file has some corrupt or incorrect entries.
So let's have a look at the VM's vmx file.

I will be looking for lines in this file which has a datastore path reference like:

vmx.log.filename = "/vmfs/volumes/58780b1d-045e1100-0efa-0025b5e01a45/Test-1/vmware.log"
sched.swap.derivedName = "/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1/Test-1-932448b9.vswp"

I have two UUIDs here, 58780b1d-045e1100-0efa-0025b5e01a45 and 59a30e4d-647fd9f2-2e66-000c295e9f61

But, when I run:

[root@Wendy:/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1] esxcfg-scsidevs -m
mpx.vmhba1:C0:T0:L0:3                                            /vmfs/devices/disks/mpx.vmhba1:C0:T0:L0:3 599ffcb3-d9ece508-7576-000c295e9f61  0  Wendy-Local
mpx.vmhba1:C0:T1:L0:1                                            /vmfs/devices/disks/mpx.vmhba1:C0:T1:L0:1 59a30e4d-647fd9f2-2e66-000c295e9f61  0  VDP-Storage

I just have these two UUIDs which do not match the one's in the VMX file. So these incorrect references are causing this drive status to be non replicated in turn causing issues with VM protection.
You might have one or more such entries in the VMX file. 

Power off the virtual machine on source and then backup the VMX file and edit it to provide the UUID of the datastore where the VM resides / the appropriate UUID where the respective files should reside. In my case the Test-1 VM runs on VDP-Storage, which is 59a30e4d-647fd9f2-2e66-000c295e9f61

So the new VMX entry looks as:

vmx.log.filename = "/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1/vmware.log"
sched.swap.derivedName = "/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1/Test-1-932448b9.vswp"

Reload the VMX using:

# vim-cmd vmsvc/reload <vm-id>

The vm-id can be obtained from

# vim-cmd vmsvc/getallvms

Then Power on the VM and then right click the VM in protection group and configure recovery, this time the hard drive status will be displayed as replicated.


And that's pretty much it. Usually this is seen, when vmware.log files are configured to a different datastore and that particular datastore is no longer available.

Hope this helps.

Wednesday, 8 November 2017

VDP Expired Certificate

There has been a lot of issues going on around the VDP deployment due to an expired certificate issued to the OVF template.

Basically, if you are running vCenter 6.5. then the web client is the only option to deploy the OVA files. And you cannot move past the section where it displays the certificate section as expired. If you are using pre 6.5 vCenter, then you can deploy this through the Windows C# client. Even though it says "Invalid" certificate, you can still click Next and proceed further.

If you are on 6.5, then the workaround is this:
1. Download the required version of VDP Server. All of them have their certificates expired around September.
2. Use a 7-zip utility to extract the OVA template. This will give you 4 files. The VMDK, OVF, MF and the CER.
3. In web client, when you deploy OVA, you can multi select the files. So select the 3 files (vmdk, ovf and mf) excluding the .cer file
4. This then displays No Certificate during the deployment and let's you proceed further.

This certificate is signed just for the OVA template and not for any particular port / service for the VDP itself.

EMC is currently working to update the certificate information for these templates. Hope this helps!

Monday, 28 August 2017

Bash Script To Extract vSphere Replication Job Information

Below is one bash script that extracts information about replication for configured VMs. It displays, the name of the virtual machine, if yes or no for quiesce Guest OS and Network Compression. Then it tabulates RPO (in minutes) as "bc" is unsupported on vR SUSE to perform hour floating calculations and then the datastore MoRef ID.

The complete updated script can be accessed from my GitHub Repo:
https://github.com/happycow92/shellscripts/blob/master/vR-jobs.sh

As and when I add more or reformat the information the script in the link will be updated.

#!/bin/bash
clear
echo -e " -----------------------------------------------------------------------------------------------------------"
echo -e "| Virtual Machine | Network Compression | Quiesce | RPO | Datastore MoRef ID |"
echo -e " -----------------------------------------------------------------------------------------------------------"
cd /opt/vmware/vpostgres/9.3/bin
./psql -U vrmsdb << EOF
\o /tmp/info.txt
select name from groupentity;
select networkcompressionenabled from groupentity;
select rpo from groupentity;
select quiesceguestenabled from groupentity;
select configfilesdatastoremoid from virtualmachineentity;
EOF
cd /tmp
name_array=($(awk '/name/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
quiesce_array=($(awk '/networkcompressionenabled/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
compression_array=($(awk '/quiesceguestenabled/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
rpo_array=($(awk '/rpo/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
datastore_array=($(awk '/configfilesdatastoremoid/{i=1;next}/ro/{i=0} {if (i==1){i++;next}}i' info.txt))
length=${#name_array[@]}
for ((i=0;i<$length;i++));
do
printf "| %-32s | %-23s | %-10s | %-10s| %-20s|\n" "${name_array[$i]}" "${quiesce_array[$i]}" "${compression_array[$i]}" "${rpo_array[$i]}" "${datastore_array[$i]}"
done
rm -f info.txt
echo && echo

For any questions, do let me know. Hope this helps. Thanks.

Wednesday, 9 August 2017

Bash Script To Export VDP Backup Job Details

So you can use this script to export your current backup and replication job configurations to a text file and save it to your local desktop. In case if you run into any redeployment situation and you are unaware of the backup configuration, you can have a look at the exported text file.

The script exports, Job Name, State of the job, Clients in the job, Schedule, Retention and the type.
It currently does not export agent level backup jobs such as SQL, Exchange and Share-point.

The script needs the MCS service to be up as it relies on that. I am planning to export details from psql which can be used even when MCS is down.

This is what I have for right now. The script can be accessed from the below link:
https://github.com/happycow92/shellscripts/blob/master/backup-job-detail.sh

Suggestions and bugs are always welcome. Drop a comment for anything.

Hope this helps!

Sunday, 30 July 2017

Bash Script To Determine Retired Clients.

While in VDP you have a built in feature for unprotected VMs (That is VMs not added to VDP backup job) you might need a script to determine if VMs are missing from a backup job.

The script has a simple algorithm:
> The first time it runs it creates a file to gather all the protected client list
> The next time it runs it will check what is missing since the last protect client list.
> New added VMs will not be considered as Missing, however on Next iteration of script execution it will run a check to see if the new clients are missing.
> If you remove the first generated file for protected list post your second execution, then the third iteration will be void as it will generate a new protected client list.

The script has an email feature to send the output to a mailing address. If you want to exclude this, then discard line-21 to line-32. If you want to run the script as a cronjob, you can add it to crontab -e, but you cannot have manual email address input running in the script. You will have to create a constant for your email address and call it in the EOF.

The script can be accessed from my repository here:
https://github.com/happycow92/shellscripts/blob/master/missing-client.sh

The code {}

#!/bin/bash
IFS=$(echo -en "\n\b")
FILE=/tmp/protected_client.txt
if [ ! -f $FILE ]
then
client_list=$(mccli client show --recursive=true | grep -i /$(cat /usr/local/vdr/etc/vcenterinfo.cfg | grep vcenter-hostname | cut -d '=' -f 2)/VirtualMachines | awk -F/ '{print $(NF-2)}')
echo "$client_list" &> /tmp/protected_client.txt
sort /tmp/protected_client.txt -o /tmp/protected_client.txt
else
new_list=$(mccli client show --recursive=true | grep -i /$(cat /usr/local/vdr/etc/vcenterinfo.cfg | grep vcenter-hostname | cut -d '=' -f 2)/VirtualMachines | awk -F/ '{print $(NF-2)}')
echo "$new_list" &> /tmp/new_list.txt
sort /tmp/new_list.txt -o /tmp/new_list.txt
missing=$(comm -3 /tmp/protected_client.txt /tmp/new_list.txt | sed 's/^ *//g')
if [ -z "$missing" ]
then
printf "\nNo Client's missing\n"
else
printf "\nMissing Client is:\n" | tee -a /tmp/email_list.txt
printf "$missing\n\n" | tee -a /tmp/email_list.txt
printf "Emailing the list\n"
FILE=/tmp/email_list.txt
read -p "Enter Your Email: " TO
FROM=admin@$(hostname)
(cat - $FILE)<< EOF | /usr/sbin/sendmail -f $FROM -t $TO
Subject: Missing VMs from Jobs
To: $TO
EOF
sleep 2s
printf "\nEmail Sent. Exiting Script\n\n"
fi
rm /tmp/new_list.txt
rm -f /tmp/email_list.txt
fi

Feel free to reply for any issues. Hope this helps!

Monday, 17 July 2017

Bash Script To Determine Backup Protocol

In vSphere Data Protection, you have couple of backup protocols. SAN mode, HotAdd, NBD and NBD over SSL. HotAdd is always the recommended protocol, as data handling and transfer is much faster than the rest. If your backups are running slow, then the first thing we will check is the backup protocol mode. Then we will move further to VDP load and finally the VMFS / Array performance.

If you have few VMs, you can easily find out the protocol type from the logs. However, if you have a ton of VMs and would like to determine the protocol, then you can use this script that I have written.
https://github.com/happycow92/shellscripts/blob/master/backup-protocol-type.sh

#!/bin/bash
clear
IFS=$(echo -en "\n\b")
echo "This script should be executed on a proxy machine"
echo "Checking current Machine......"
directory="/usr/local/vdr"
if [ ! -d "$directory" ]
then
printf "Current machine is Proxy machine"
else
printf "Current machine is VDP Server"
fi
echo && echo
sleep 2s
echo -e "--------------------------------------------------------"
echo -e "| Client Name | Backup Type | Proxy Used |"
echo -e "--------------------------------------------------------"
cd /usr/local/avamarclient/var
backupLogList=$(ls -lh | grep -i "vmimagew.log\|vmimagel.log" | awk '{for (i=1; i<=8; i++) $i=""; print $0}' | sed 's/^ *//')
for i in $backupLogList
do
clientName=$(cat $i | grep -i "<11982>" | awk '{print $NF}' | cut -d '/' -f 1)
protocolType=$(cat $i | grep -i "<9675>" | awk '{print $7}' | head -n 1)
proxyName=$(cat $i | grep -i "<11979>" | cut -d ',' -f 2)
if [ "$protocolType" == "hotadd" ]
then
protocol="hotadd"
elif [ "$protocolType" == "nbdssl" ]
then
protocol="nbdssl"
elif [ "$protocolType" == "nbd" ]
then
protocol="nbd"
else
protocol="SAN Mode"
fi
printf "| %-20s| %14s| %12s|\n" "$clientName" "$protocolType" "$proxyName"
done
echo && echo
Few things:
> The script must be always executed on a proxy machine. If your VDP is using internal proxy, then run it on the VDP machine itself.
> If you are using one or more External Proxy, then you need to run this on each of the proxy machines.
> Note, this will work on 6.x VDP and above.

I have added an IFS (Internal Field Separator) to handle spaces in backup job names. The rough version of script had issues handling spaces in job names.

It's a very lightweight script, takes seconds to execute and does not make any changes to your system.

Hope this helps.