Friday, 25 December 2015

Understanding VMkernel.log for vMotion Operation

Written by Suhas Savkoor



Let's decode the vMotion logging in VMkernel.log.

Open a SSH (Putty) to the host where this virtual machine currently resides. Change the directory to:



Capture the live logging of VMkernel using the following command:



Perform vMotion of a virtual machine residing on this host to any other available host with a shared storage. You will see the below logging:

I will break down the logging with " // " for comments.

2015-12-25T16:39:25.565Z cpu4:2758489)Migrate: vm 2758492: 3284: Setting VMOTION info: Source ts = 1451061663105920, src ip = <192.168.1.176> dest ip = <192.168.1.177> Dest wid = 1830931 using SHARED swap

//The first line Migrate vm 2758492 does not tell which virtual machine is being migrated. It tells the world ID of the virtual machine that is going to be migrated. To find the world ID of the virtual machine, before migrating run the command # esxcli vm process list This command lists all the virtual machines world IDs that is residing on the host.

// The Setting vMotion info 1451061663105920 is the vMotion ID. This vMotion ID is required because when you "grep" for this ID in the hostd.log or vmware.log (residing in the virtual machine directory) gives you further information of vMotion. In vmware.log you can see the transitioning states for vMotion, with each state performing a set of steps.

// The source ip  where this virtual machine currently resides is 192.168.1.176 and the destination to where the virtual machine is being migrated to is 192.168.1.177

// The dest wid 1830931 is the world ID for this virtual machine once the vMotion is completed.


2015-12-25T16:39:25.567Z cpu4:2758489)Tcpip_Vmk: 1288: Affinitizing 192.168.1.176 to world 2772001, Success
2015-12-25T16:39:25.567Z cpu4:2758489)VMotion: 2734: 1451061663105920 S: Set ip address '192.168.1.176' worldlet affinity to send World ID 2772001
2015-12-25T16:39:25.567Z cpu4:2758489)Hbr: 3340: Migration start received (worldID=2758492) (migrateType=1) (event=0) (isSource=1) (sharedConfig=1)

// Here the host is being prepared for migration by taking it's IP address into consideration.

//The migration start received logs the vMotion type. The World ID 2758492 is recorded. MigrateType=1 is host migration,

//The host where I am logged in currently via the SSH is the source host which shows the isSource=1 and sharedConfig=1


2015-12-25T16:39:25.567Z cpu5:2771999)CpuSched: 583: user latency of 2771999 vmotionStreamHelper0-2758492 0 changed by 2771999 vmotionStreamHelper0-2758492 -1
2015-12-25T16:39:25.568Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'

//Here the connection from source host vmkernel port-group is established.

2015-12-25T16:39:25.570Z cpu5:33435)MigrateNet: vm 33435: 2096: Accepted connection from <::ffff:192.168.1.177>

// Here the destination source has accepted the connection for vMotion


2015-12-25T16:39:25.570Z cpu5:33435)MigrateNet: vm 33435: 2166: dataSocket 0x410958a8dc00 receive buffer size is -565184049
2015-12-25T16:39:25.570Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'
2015-12-25T16:39:25.571Z cpu4:2772001)VMotionUtil: 3396: 1451061663105920 S: Stream connection 1 added.
2015-12-25T16:39:25.571Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'
2015-12-25T16:39:25.572Z cpu4:2772001)VMotionUtil: 3396: 1451061663105920 S: Stream connection 2 added.

//Both the surce and destination have established the connection and the vMotion process takes place. The VMkernel.log does not record the details of vMotion. If you check the vmware.log for this virtual machine, you can see the states and progress of vMotion in detail.

2015-12-25T16:39:25.848Z cpu3:2758492)VMotion: 4531: 1451061663105920 S: Stopping pre-copy: only 0 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~2.116 MB/s, 52403100% t2d)

//In short how vMotion works is:


  • A shadow VM is created on the destination host.
  • Copy each memory page from the source to the destination via the vMotion network. This is known as preCopy.
  • Perform another pass over the VM’s memory, copying any pages that changed during the last preCopy iteration
  • Continue the pre-copy iteration until no changed page remains
  • Stun the VM and resume the destination VM
//Basically, the memory state of the virtual machine is being transferred to the shadow virtual machine created on the destination machine. The memory is nothing but pages.  The pages are transferred to the shadow VM over the vMotion network. Larger the VM I/Os, longer the vMotion process. 

//Towards the end of vMotion the source VM must be destroyed and the operations should continue at the destination end. For this, the ESXi should determine, that the last few memory pages can be transferred over to the destination quickly. Which is the switch-over goal of 0.5 seconds

//So here when it says only 0 pages left to send, which can be sent within the switchover time goal of 0.500 seconds it means that there are no more active memory pages left to be transferred. So the host declares that the source VM can be destroyed and the vMotion can be completed and the destination VM can resume. All this can happen within the feasible switch time. 


2015-12-25T16:39:25.952Z cpu5:2772001)VMotionSend: 3643: 1451061663105920 S: Sent all modified pages to destination (no network bandwidth estimate)

//Here it tells that the for the vMotion ID the "S" Source has sent all the memory pages to the destination.


2015-12-25T16:39:26.900Z cpu0:2758489)Hbr: 3434: Migration end received (worldID=2758492) (migrateType=1) (event=1) (isSource=1) (sharedConfig=1)
2015-12-25T16:39:26.908Z cpu3:32820)Net: 3354: disconnected client from port 0x200000c
2015-12-25T16:39:26.967Z cpu3:34039)DLX: 3768: vol 'Recovery_LUN', lock at 116094976: [Req mode 1] Checking liveness:

//Here the migration has completed for the world ID, the migration type. And the virtual machine in my case is residing on Recovery_LUN is locked by the new host that is residing on with a new world ID that was assigned during the vMotion.


So you know what a successful vMotion looks like in the vmkernel.log
In depth vMotion can be found in the vmware.log, which can be self explanatory once you know what to look at and where to look at.

Tuesday, 22 December 2015

Unable To Delete Orphaned/Stale VMDK File

Written by Suhas Savkoor



So today I got a case where we were trying to delete an orphaned flat.vmdk file.

A brief background of what was being experienced here:

There were three ESXi hosts and 2 shared datastores among these hosts. Now, there were couple of folders in these 2 shared datastores which contained only flat.vmdk files. These flat files were not associated with any virtual machines and also the last modified date of these files were somewhere about a year ago.

However, every time we tried to delete the file from the datastore browser GUI, we got the error:

Cannot Delete File [Datastore Name] File_Name.vmdk


So, when we try to delete this file from the command line using the " rm -f <file_name> " we got the error:

rm: cannot remove 'File.vmdk': No such file or directory

Also:
We were able to move the file to another datastore and remove it successfully. But, the stale file copy was still left behind in the original datastore.

So, how do we remove this stale file?

Step 1:

  • Take a SSH session to all the hosts that have access to this datastore where the stale file resides. 
  • In my case all the three hosts in the cluster.

Step 2:

  • Run the below command. This command has to be executed from the SSH(Putty) of all the hosts having connectivity to that datastore.

This can result in two error outputs:

First error:
Could not open /vmfs/volumes/xxxxxxxx/xxxxxxx/xxxxxx-flat.vmdk 
Command release failed Error: Device or resource busy

Second error:
Command release failed 
Error: Inappropriate ioctl for device


In my case it was the second error.

The host that gives you the second error has the stale lock on the file. All the three hosts returned the second error, and I had to reboot all the three hosts. 

Once the hosts are rebooted, you can successfully remove the stale flat.vmdk files.

Note:
If the remove operation still fails, then you will have to storage vMotion all the VMs from the affected datastore, then delete the VMFS volume and reformat it again.

Sunday, 20 December 2015

Configure Remote Syslog for ESXi host

Written by Suhas Savkoor



When you have installed and set-up an ESXi host, you would have configured a scratch location for all the host logging to go to. The configuration might have been done on the local datastore or a SAN.
You can also preserve your host logging on to a remote machine as well, configure host log rotation to retain logs for a longer time by using syslog. 

Here, I am going to configure my host logging in such a way that all the ESXi logging must go to a remote machine, in my case, a vCenter Windows machine. 

Step 1:

Installing the Syslog Collector:

From the ISO that you installed your vCenter Server, you will have an option for Syslog Collector. 



Go Next and accept the EULA


Once you go next, you get an option to configure a couple of things:

  • First, where you want the syslog collector to be installed
  • Second, where the syslog data logging to be configured to
  • Log rotation file size for the host logs which will be created in a .txt format
  • And how many log rotations to be retained. 

So basically, once the syslog text file reaches the rotate constraint, which by default is 2 MB, it will be zipped and the new logging will be done in a new text file. And 8 rotated zipped files will be retained at one time.


Choose a type of installation that is required and go Next


The default TCP and UDP port being used for syslog is 514, give a custom port if required. If you are using a custom port, then document it, as it would be necessary for configuration.


You can choose how your syslog should be identified on the network by either the vCenter IP/FQDN


Click Next > Install and Finish once the installation is complete. 

Step 2:

Once the syslog collector is installed, it is then time to configure syslog for the required ESXi host. 

Take a SSH session to the host that requires the syslog configuration to be done. Run the following command:


This will tell the current logging configuration of the ESXi host. The output is something as below:


Notice that I do not have Remote Host syslog configuration done yet. 

Next, run the following command to configure syslog to the required machine on a required protocol and port:

For udp:


For tcp:


If you are using a custom port, then specify that custom port in the above command. 

Next, Run the command to perform a syslog reload for the changes to take effect:


Now, you may need to manually open the Firewall rule set for syslog when redirecting logs. For this, we need to set a syslog rule-set in the defined firewall rules and reload the changes.



Now, let's check the directory to see if syslog is available for the host. 


The log file is created and when you review the syslog configuration for the host, you can now see the remote server IP.


Cheers!

Thursday, 17 December 2015

How To Analyze PSOD

Written by Suhas Savkoor



Purple Screen of Death or commonly known as PSOD is something which we see most of the times when we run an ESXi host.

Usually when we experience PSOD, we reboot the host (which is a must) and then gather the logs and upload it to VMware support for analysis (where I spend a good amount of time going through it)

Why not take a look at the dumps by yourself?

Step 1:
I am going to simulate a PSOD on my ESXi host. You need to be logged into the host's SSH. The command is



And when you open a DCUI to the ESXi host, you can see the PSOD


Step 2:
Sometimes, we might miss out on the screenshot of PSOD. Well that's alright! If we have core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.

Reboot the host, if it is in the PSOD screen. Once the host is back up, login to the SSH/Putty of the host and go to the core directory. The core directory is the location where your PSOD logging go to.



Then list out the files here:



Here you can see the vmkernel dump file, and the file is in the zdump format.

Step 3:
How do we extract it?

Well, we have a nice extract script that does all the job, " vmkdump_extract ". This command must be executed against the zdump.1 file, which looks something like this:



It creates four files:
a) vmkernel-log.1
b) vmkernel-core.1
c) visorFS.tar
d) vmkernel-pci

All we require for analysis is the vmkernel-log.1 file

Step 4:
Open the vmkernel-log.1 file using the below command:



Skip to the end of the file by pressing Shift+G. Now let's slowly go to the top by pressing PageUp.
You will come across a line that says @BlueScreen: <event>

In my case, the dumps were:




  • The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case it is CrashMe which is for a manual crash. 
  • The VMKuptime tells the Kernel up-time before the crash.
  • The logging after that is the information that we need to be looking for, the cause as to why the crash occurred. 
Now, here the crash dump varies for every crash. These issues can range from hardware errors / driver issues / issues with ESXi build and a lot more.

Each dump analysis would be different. But the basic is the same. 

So, you can try analyzing the dumps by yourself. However, if you are entitled to VMware support, I will do the job for you.


Cheers!







Wednesday, 16 December 2015

Unable To Take A Snapshot Of A VM - An error occurred while taking a snapshot: Change tracking target file already exists.



Written by Suhas Savkoor



You had a virtual machine which was scheduled for a backup task. The backup job completed and now you want to perform a manual snapshot operation for this virtual machine.

When you take a manual snapshot of this VM, it fails! With the following error;

An error occurred while taking a snapshot: Change tracking target file already exists.
An error occurred while saving the snapshot: Change tracking target file already exists.

There are no snapshots in the snapshot manager. The virtual machine disks are running on the base disk and not a snapshot disk. 

However, when you browse the datastore for this virtual machine, you notice certain ctk.vmdk files. There will be N number of ctk.vmdk for N number of VMDKs. 

A VMware CTK file contains a list of all the changes made to a VMware virtual machine (VM) since it was last backed up. 
CTK change tracking files exist on all VMs where VMware's Changed Block Tracking (CBT) technology is enabled. CBT relies on the CTK file's information to back up only the VM information blocks that have changed, instead of backing up the entire VM. 

There is only one CTK file in the virtual machine disk files, and it has the file extension .ctk. The CTK file is always much smaller than VMware Virtual Machine Disk File (VMDK) block it describes.

What needs to be done?

1. Either browse the datastore for the virtual machine folder from the GUI or open a SSH/Command Line to the host where this virtual machine resides.

2. Delete the ctk.vmdk files

3. Perform the snapshot operation. 

Success!

Monday, 14 December 2015

Configuring VFRC From An Emulated SSD Drive

Written by Suhas Savkoor



If you have read my previous article, it tells you about how to emulate a SSD drive from a Non-SSD disk.

In this article, I will show you the use case of the SSD emulation. I will be configuring VFRC from an emulated SSD drive.

Here, I have created a 2 GB VFRC-Test datastore (VMFS-5) and tagged it as a SSD drive.


Now, if I log in to the web client and select the host and under Manage tab if I select Virtual Flash Resource Management, I see an empty list.
Also, if I go to Add Capacity, I see the list is empty.


Now, even though we have a SSD datastore why is it showing an empty list here.
Well, the requirement for VFRC is a raw SSD disk. A raw disk indicates that there should be no partition on it.
When we added the non SSD disk to VMware vSphere to tag it as a SSD drive, we had to create a VMFS partition on it. This VMFS partition is not allowing the disk to be used by VFRC.

So, what we need to do now is, we will have to remove the partition for this datastore. Please make sure that there is no data on this datastore as it will be lost during the partition removal.

First. we need to determine what is the partition number for our drive.
For this:

Login to the ESXi host via SSH/Putty and run the following command:



The output is similar to below:


Then, we need to delete the partition. For this, run the below command:



The output is something as seen below


Here we see we no longer have the partition table listed for this disk. Also, when you come back to your GUI, you no longer see the datastore.

Now, let's login to web client and we can see that this SSD disk is now seen in the VFRC add resource window.



Select the disk and Click OK and you can see the VFRC SSD disk is now added.



Emulate all the way!

Sunday, 13 December 2015

Emulating a SSD Drive in VMware 5.x

Written by Suhas Savkoor



You can emulate a SSD disk from a non SSD hard disk.

Just a couple of steps:


Here, I have my Local Datastore, Suhas-Local-4 which is a VMFS-5 datastore and a Non-SSD drive.

Let's tag it.

Step 1:
Take a SSH(Putty) session to the host which has access to the datastore that you want to tag.

Run the command to see if the device is a SSD or not



You can find the name from the GUI; Devices option under Storage
Or you can right click the Datastore > Copy to clipboard and paste it in a notepad to obtain the device name.

The output is similar to this:


Step 2:
We need to add a SATP rule to the device that we want to tag as SSD.

The command is



Step 3:
Reboot the host.
Once the host is rebooted from the GUI you can see the datastore is now tagged as SSD


Verify the same from the command line using the command in Step 1



That's it. We have tagged our Non-SSD drive as a SSD disk.
You can now try configuring VFRC (vFlash Read Cache) without an actual SSD.


Upgrading vCenter Appliance 5.x

Written by Suhas Savkoor



Upgrading vCenter Appliance is quite different when compared to upgrading your Windows based vCenter.
There are just a few steps to be performed and a lot to wait for the upgrade to be completed.

Currently, I have a 5.5 Update 3 vCenter and I will be updating it to 5.5 U3a


How to get here?

1. Open a browser and enter:
 
                   https://<Appliance_IP_or_FQDN>:5480

2. Login with root credentials
3. Click Update Tab and Select the Status sub-tab and expand the Details option.


Step 1:
  • Login to the host/vCenter hosting this vCenter Appliance virtual machine.
  • Right click the Appliance VM and select Open Console




  • Select the CD option in tool bar and CD/DVD Drive.
  • Since I have my 5.5 Update 3a ISO on the local drive, I will select the option Connect to ISO image on local disk
  • Browse to the datastore and select the ISO Image


Step 2:
  • Go back to the VCSA management page > Update
  • Click the Settings tab: Make sure that the Use CDROM Updates is chosen as we are using the ISO from local drive for upgrade.
  • Under Actions > Click Save Settings


Step 3:
  • Now the updates or on the CD and the Appliance is set to look for updates on it CDROM folder. 
  • Go back to Status Tab and click Check Updates
  • A task will run for a couple of seconds and a new Available Updates option is now seen
  • If you click the Details here, you can see what update is available. Here, since we mounted the 5.5 U3a update, we can see this in the available updates.
  • Click Install Updates




Step 4:

  • The update process can take from 90 minutes to 120 minutes. Do NOT reboot or Power OFF the appliance.
  • Reboot the machine ONLY after the update is complete. 


During the update process you can see this pop-up in the web management page.



After rebooting the appliance, you can see the new Version Under Appliance Version in Update Tab.
You can also open a console to the Appliance virtual machine and notice the updated Build Number.

Tuesday, 8 December 2015

Error Logging Into vCenter 6.0 Via vSphere Client With Windows Authentication

Written by Suhas Savkoor



So, you got your new vCenter Appliance 6.0 U1 setup, and you want to dive into managing the environment.
So you fire up a vSphere client, enter the vCenter IP details, and try logging in with Windows Authentication.

And it fails! With this error:



Here, we need to add the vCenter Appliance to a domain.

This is what you have to do:

1. Login to vCenter Appliance console or open a SSH to it. Then change the directory to:



2. Run this command to join the appliance to a domain:



3. After executing the above command you will receive a prompt to enter the domain user's password. Enter the password and hit enter.

4. Verify the join by:



Now you will be able to login to appliance via vSphere Client using Windows Session Credentials.

Monday, 7 December 2015

Configuring VDP 6.x

Written by Suhas Savkoor



Once you have deployed VDP 6.x from the ovf template, you will have to configure the appliance before using it.

To configure VDP you will have to login to the VDP management page. The address would be:

                       https://<vdp_IP_or_FQDN>:8543/vdp-configure

If you receive the error message, "Server has weak ephemeral Diffie-Hellman public key", then follow this article to resolve it.

Step 1:
Once you open the management web page, you will come across this:


Login to the management webpage for VDP configuration.
Username: root
Password: changeme

Once you login, you will encounter VDP configuration wizard.


Step 2:
Click Next and Under network Settings:

1. Enter IPv4 address of VDP appliance
2. Subnet mask and gateway address
3. Primary DNS and an optional secondary DNS.

Add the appliance host-name and Domain. Click Next.



You might receive the following error.


This issue occurs when Fully Qualified Domain Names (FQDN), forward lookup, and reverse lookup are not configured, or they are not configured correctly.

To resolve this, add the VDP appliance name and domain in Forward Lookup in your DNS.


Login to DNS manager > Right click Forward Lookup > Add New Host and provide the Name and IP address of the VDP appliance. Make sure Create associated pointer (PTR) record is checked and select Add Host.

Step 3:
Once successfully added. Select a time zone.



Step 4:
Now, you will receive an option to change your default password "changeme" to a required one.



Step 5:
In the next step, you have to register your VDP appliance to your vCenter machine.

1. Enter vCenter Username. This user must have administrative privileges on the vCenter machine
2. Password for this user
3. vCenter IP and click Test connection.

Once Connection is tested successfully, click Next.



Step 6:
Next is a subjective process where you need to create a new storage for the appliance with the desired size.



Step 7:
Select a datastore where your VDP drives must reside. 3 drives are required, so choose an appropriate datastore with sufficient space.



Step 8:
Enter a required value of CPU and Memory or accept the default ones. Do not lower the default CPU and memory values.



Step 9:
Review the changes and apply the changes. Once the changes are done, the VDP appliance has to be rebooted for the changes to be applied.



And that's it. You got your VDP appliance configured. 

Sunday, 6 December 2015

Upgrading ESXi Host Via Update Manager

Written by Suhas Savkoor



If you have Update Manager (VUM), you can upgrade/patch your ESXi host using it. For GUI lovers, VUM seems to be an easy way out. 
My ESXi host is Currently 5.5 Build 1331820. This Build corresponds to 5.5 GA. You can find out about VMware Build numbers from this link


Step 1:

Select the host and click Update Manager tab. Under the baseline section notice that I do not have anything. Click the Admin View option.



Step 2:
Click ESXi Images and select Import ESXi Image.


Step 3:
Browse to the Datastore and Upload the file and click Next



The ESXi image upload will take a quick minute or two. 



Once the upload is done, click Next and Provide a Baseline Name for this ESXi ISO that was uploaded. Here, the baseline name given is Host_Upgrade. Click Finish



Step 4:
Go back to Compliance View. Right Click the ESXi host and Select Enter Maintenance Mode. Once the host enters maintenance mode, click Attach and check the baseline that was just created with the ESXi image. Click Attach 


Step 5:
Click Scan and Check Upgrades. Once the Scan task is complete the compliance status should come up as Non Compliant (Red)


If it comes up as Incompatible, then reboot the host and perform the Scan again. 

Step 6:
If HA is enabled on the cluster, then Disable HA prior to performing the Remediate option. Once the HA is disabled for the cluster where this ESXI host was, click Remediate


Step 7:
This is a simple wizard driven process with no changes required.
  • First, verify the baseline is seen Under Baseline section of Remediate wizard. Click Next
  • Accept End User License Agreement. Click Next.
  • Enter a Task Description. Click Next.
  • Uncheck DPM if it is enabled for any of the selected Clusters option. Click Next. Review and Finish.




if you observe the upgrade by opening a KVM console to the server hosting this ESXi you will see the upgrade in process.



The upgrade takes about 10-15 minutes and you can verify the upgrade by observing the updated build number.



Simple isn't it?