Saturday, 29 October 2016

Migrating VDP From 5.8 and 6.0 To 6.1.x With Data Domain

You cannot upgrade a vSphere Data Protection appliance from 5.8.x and 6.0.x to 6.1.x due to the difference in the underlying SUSE Linux version. Since the earlier versions of vSphere Data Protection used SLES 11 SP1 and the 6.1.x uses SLES 11 SP3, we will be performing the migrate.

This article only discusses about migrating a VDP appliance from 5.8.x and 6.0.x with a data domain attached. If you had a VDP appliance without a data domain, we would choose the "Migrate" option in the vdp-configure wizard during the setup of the new 6.1.x appliance. However, this is not the path we will follow when the destination storage is an EMC Data Domain. A VDP appliance with Data Domain migration would be done by a process called as checkpoint restore. Let's discuss these steps below...

For this instance let's consider the following setup:
1. A vSphere Data Protection 5.8 appliance
2. A Virtual Edition of EMC Data Domain Appliance (Process is still the same for physical as well)
3. The 5.8 VDP was deployed as a 512GB deployment.
4. The IP address of this VDP appliance was
5. The IP address of the Data Domain appliance is

1. In the point (3) above you saw that the 5.8 VDP appliance was setup with a 512 GB local drives. The first question that comes here is, why have a local drive when the backups are residing on the Data Domain?
A vSphere Data Protection appliance with a Data Domain would still have a local VMDK is to store the meta-data of the client backups. The actual data of the client is deduplicated and stored on the DD appliance and the meta-data of this backup is stored under the /data0?/cur directory on the VDP appliance. So, if your source appliance was of 512 GB deployment, then the destination has to be either equal to or greater than the source deployment.

2. The IP address, DNS name, domain and all other networking configuration of the destination appliance should be same as the source.

3. It is best to keep the same password on the destination appliance during the initial setup process.

4. On the source appliance make sure the Checkpoint Copy is Enabled. To verify this, go to https://vdp-ip:8543/vdp-configure page, select the Storage tab, click the Gear Icon and click Edit Data Domain. The first page displays this option. If this is not checked, then the checkpoint on the source appliance will not be copied over to the Data Domain, and you will not be able to perform a checkpoint restore.

The migration process:
1. Take a SSH to the source VDP appliance and run the below command to get the checkpoint list:
# cplist

The output would be similar to:
cp.20161011033032 Tue Oct 11 09:00:32 2016   valid rol ---  nodes   1/1 stripes     25
cp.20161011033312 Tue Oct 11 09:03:12 2016   valid --- ---  nodes   1/1 stripes     25

Make a note of this output.

2. Run the below command to obtain the Avamar System ID:
# avmaint config --ava | grep -i "system"
The output would be similar to:

Make a note of this output as well.  1476126720 would be the Avamar System ID. This is used to determine which mTree this VDP appliance corresponds to on the Data Domain.

3. Run the below command to obtain the hashed Avamar Root Password. This would be to test the GSAN login if the migration fails. This will be used for VMware Support, so you can skip this step. 
# grep ap /usr/local/avamar/etc/usersettings.cfg
The output would be similar to:

4. Power off the source appliance

5. Deploy VDP 6.1.x appliance via the OVF template, provide the same networking details during the ova deployment and power on the 6.1.x appliance once the ova deployment completes successfully.

6. Go to the https://vdp-ip:8543/vdp-configure page and complete the configuration process for the new appliance. As mentioned above, during the "Create Storage" section in the wizard specify the local storage space, either equal to or greater than the source VDP appliance system. Once the appliance configuration completes, it will reboot the new 6.1.x system.

7. Once the reboot is completed, open a SSH to the 6.1.x appliance and run the below command to list the available checkpoints on the data domain.
# ddrmaint cp-backup-list --full --ddr-server=<data-domain-IP> --ddr-user=<ddboost-user-name> --ddr-password=<ddboost-password>

Sample command from my lab:
# ddrmaint cp-backup-list --full --ddr-server= --ddr-user=ddboost-user --ddr-password=VMware123!
The output would be similar to:
================== Checkpoint ==================
 Avamar Server Name           : vdp58.vcloud.local
 Avamar Server MTree/LSU      : avamar-1476126720
 Data Domain System Name      :
 Avamar Client Path           : /MC_SYSTEM/avamar-1476126720
 Avamar Client ID             : 200e7808ddcde518fe08b6778567fa4f397e97fc
 Checkpoint Name              : cp.20161011033032
 Checkpoint Backup Date       : 2016-10-11 09:02:07
 Data Partitions              : 3
 Attached Data Domain systems :

The highlighted parts are what we need. The avamar-1476126720 would be the Avamar mTree on the data domain. We received this system ID earlier in this article. The checkpoint cp.20161011033032 was also a checkpoint on the source VDP appliance which was copied over to the data domain.

8. Now, we will perform a cprestore to this checkpoint. The command to perform the cprestore is:
# /usr/local/avamar/bin/#: cprestore --hfscreatetime=<avamar-ID> --ddr-server=<data-domain-IP> --ddr-user=<ddboost-user-name> --cptag=<checkpoint-name>

Sample command from my lab:
# /usr/local/avamar/bin/#: cprestore --hfscreatetime=1476126720 --ddr-server= --ddr-user=ddboost-user --cptag=cp.20161011033032
Where, 1476126720 is the Avamar System ID and cp.20161011033032 is a valid checkpoint. Do not rollback if the checkpoint is not valid. If the checkpoint is not validated, then on the source VDP appliance you will have to run an integrity check to generate a valid checkpoint and copy this over to the Data Domain system.

The output would be:
Version: 1.11.1
Current working directory: /space/avamar/var
Log file: cprestore-cp.20161011033032.log
Checking node type.
Node type: single-node server
Create DD NFS Export: data/col1/avamar-1476126720/GSAN
ssh ddboost-user@ nfs add /data/col1/avamar-1476126720/GSAN "(ro,no_root_squash,no_all_squash,secure)"
Execute: ssh ddboost-user@ nfs add /data/col1/avamar-1476126720/GSAN "(ro,no_root_squash,no_all_squash,secure)"
Warning: Permanently added '' (RSA) to the list of known hosts.
Data Domain OS

Enter the data domain password when prompted. Once the password is authenticated, the cprestore will start. It is going to copy the meta data of the backups for the displayed checkpoint on to the 6.1.x appliance. 

The output would be similar to:
[Thu Oct  6 08:24:44 2016] (22497) 'ddnfs_gsan/cp.20161011033032/data01/0000000000000015.chd' -> '/data01/cp.20161011033032/0000000000000015.chd'
[Thu Oct  6 08:24:44 2016] (22498) 'ddnfs_gsan/cp.20161011033032/data02/0000000000000019.wlg' -> '/data02/cp.20161011033032/0000000000000019.wlg'
[Thu Oct  6 08:24:44 2016] (22497) 'ddnfs_gsan/cp.20161011033032/data01/0000000000000015.wlg' -> '/data01/cp.20161011033032/0000000000000015.wlg'
[Thu Oct  6 08:24:44 2016] (22499) 'ddnfs_gsan/cp.20161011033032/data03/0000000000000014.wlg' -> '/data03/cp.20161011033032/0000000000000014.wlg'
[Thu Oct  6 08:24:44 2016] (22498) 'ddnfs_gsan/cp.20161011033032/data02/checkpoint-complete' -> '/data02/cp.20161011033032/checkpoint-complete'
[Thu Oct  6 08:24:44 2016] (22499) 'ddnfs_gsan/cp.20161011033032/data03/0000000000000016.chd' -> '/data03/cp.20161011033032/0000000000000016.chd'

This would keep going on until all the meta-data is copied over. The length of cprestore process would depend on the amount of backup data. Once the process is complete you will see the below message.

Restore data01 finished.
Cleanup restore for data01
Changing owner/group and permissions: /data01/cp.20161011033032
PID 22497 returned with exit code 0
Restore data03 finished.
Cleanup restore for data03
Changing owner/group and permissions: /data03/cp.20161011033032
PID 22499 returned with exit code 0
Finished restoring files in 00:00:04.
Restoring ddr_info.
Copy: 'ddnfs_gsan/cp.20161011033032/ddr_info' -> '/usr/local/avamar/var/ddr_info'
Unmount NFS path 'ddnfs_gsan' in 3 seconds
Execute: sudo umount "ddnfs_gsan"
Remove DD NFS Export: data/col1/avamar-1476126720/GSAN
ssh ddboost-user@ nfs del /data/col1/avamar-1476126720/GSAN
Execute: ssh ddboost-user@ nfs del /data/col1/avamar-1476126720/GSAN
Data Domain OS

Once the data domain password is entered, the cprestore process completes with a kthxbye message.

9. Run the # cplist command on the 6.1.x appliance and you should notice that the checkpoint that was displayed in the cpbackup list is now listing under the 6.1.x checkpoints:

cp.20161006013247 Thu Oct  6 07:02:47 2016   valid hfs ---  nodes   1/1 stripes     25
cp.20161011033032 Tue Oct 11 09:00:32 2016   valid rol ---  nodes   1/1 stripes     25

The cp.20161006013247 is the 6.1.x appliance's local checkpoint and the cp.20161011033032 is the checkpoint of source appliance which was copied over from the data domain during the cprestore.

10. Once the restore is complete, we need to perform a rollback to this checkpoint. So first, you will have to stop all core services on the 6.1.x appliance using the below command:
# dpnctl stop
11. Initiate the force rollback using the below command:
# dpnctl start --force_rollback

You will see the following output:
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Action: starting all
Have you contacted Avamar Technical Support to ensure that this
  is the right thing to do?
Answering y(es) proceeds with starting all;
          n(o) or q(uit) exits
y(es), n(o), q(uit/exit):

Select yes (y) to initiate the rollback. The next set of output you will see is:

dpnctl: INFO: Checking that gsan was shut down cleanly...
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Here is the most recent available checkpoint:
  Tue Oct 11 03:30:32 2016 UTC Validated(type=rolling)
A rollback was requested.
The gsan was shut down cleanly.

The choices are as follows:
  1   roll back to the most recent checkpoint, whether or not validated
  2   roll back to the most recent validated checkpoint
  3   select a specific checkpoint to which to roll back
  4   restart, but do not roll back
  5   do not restart
  q   quit/exit

Choose option 3 and the next set of output you will see is:

Here is the list of available checkpoints:

     2   Thu Oct  6 01:32:47 2016 UTC Validated(type=full)
     1   Tue Oct 11 03:30:32 2016 UTC Validated(type=rolling)

Please select the number of a checkpoint to which to roll back.

     q   return to previous menu without selecting a checkpoint
(Entering an empty (blank) line twice quits/exits.)

So in the earlier cplist command you will notice that the cp.20161011033032 had a time-stamp of Oct 11. So choose option (1) and the next output you will see is:
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
You have selected this checkpoint:
  name:       cp.20161011033032
  date:       Tue Oct 11 03:30:32 2016 UTC
  validated:  yes
  age:        -7229 minutes

Roll back to this checkpoint?
Answering y(es)  accepts this checkpoint and initiates rollback
          n(o)   rejects this checkpoint and returns to the main menu
          q(uit) exits

Verify if this indeed the checkpoint and proceed yes (y) upon confirmation. The GSAN and MCS rollback begins and you will notice this in the console:

dpnctl: INFO: rolling back to checkpoint "cp.20161011033032" and restarting the gsan succeeded.
dpnctl: INFO: gsan started.
dpnctl: INFO: Restoring MCS data...
dpnctl: INFO: MCS data restored.
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-24536
dpnctl: WARNING: 1 warning seen in output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/ --start"
dpnctl: INFO: MCS started.

**If this process fails, open a ticket with VMware support. I cannot provide the troubleshooting steps for this as this is confidential. Request / Add information in your support ticket to contact me if needed for the engineer assigned to run a check past me**

If the rollback goes through successfully you might be presented with an option to restore the tomcat database.

Do you wish to do a restore of the local EMS data?

Answering y(es) will restore the local EMS data
          n(o) will leave the existing EMS data alone
          q(uit) exits with no further action.

Please consult with Avamar Technical Support before answering y(es).

Answer n(o) here unless you have a special need to restore
  the EMS data, e.g., you are restoring this node from scratch,
  or you know for a fact that you are having EMS database problems
  that require restoring the database.

y(es), n(o), q(uit/exit):

I would choose no if my database is not causing issues in my environment. Post this, the remaining services will be started. The output:

dpnctl: INFO: EM Tomcat started.
dpnctl: INFO: Resuming backup scheduler...
dpnctl: INFO: Backup scheduler resumed.
dpnctl: INFO: AvInstaller is already running.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

That should be pretty much it. When you login to https://vdp-ip:8543/vdp-configure page, you should be able to see the Data Domain automatically in the Storage Tab. If not, open a support ticket with VMware

There are couple of post-migration steps:
1. If you are using internal proxy, un-register the proxy and re-register it back from the VDP configure page.
2. External proxies (if used) will be orphaned, so you will have to delete the external proxies, change the VDP root password and re-add the external proxy
3. If you are using Guest Level backups, then the agents for SQL, Exchange, Sharepoint has to be re-installed. 
4. If this appliance is replicating to another VDP appliance, then the replication agents need to be re-registered. Follow the below 4 commands in the same order to perform this:
# service avagent-replicate stop
# service avagent-replicate unregister /MC_SYSTEM
# service avagent-replicate register /MC_SYSTEM
# service avagent-replicate start

And that should be it...