Tuesday, 25 October 2016

MCS Fails To Start On VDP. ERROR: gsan rollbacktime: xxxxxxx does not match stored rollbacktime: xxxxxxxx

Recently while working on a case, I came across the following issue. The MCS service was not coming up on a newly deployed VDP with existing drives. If I tried to start the MCS manually, the error received during this process was:

root@vdp58:#: dpnctl start mcs

Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-26291
dpnctl: ERROR: error return from "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start" - exit status 1
dpnctl: ERROR: 1 error seen in output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start"
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

And if I tailed the log that was displayed during the start attempt:
tail -f /tmp/dpnctl-mcs-start-output-26291

The actual error message was displayed:
ERROR: gsan rollbacktime: 1475722913 does not match stored rollbacktime: 1475722911

This occurs when GSAN has rolled back to a particular checkpoint but the MCS has not. 
Since these are not on the same rollbacktime the MCS service will not start. 

There are a couple of fixes available for this, and I would recommend you to start in the following order.

Fix 1:
Restore MCS

Run the below command to being MCS restore:
# dpnctl start mcs --force_mcs_restore
In most cases, this too fails. For me, it did, with the error:

root@vdp58:#: dpnctl start mcs --force_mcs_restore

Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Restoring MCS data...
dpnctl: ERROR: 1 error seen in output of "[ -r /etc/profile ] && . /etc/profile ; echo 'Y' | /usr/local/avamar/bin/mcserver.sh --restore --id='root' --hfsport='27000' --hfsaddr='' --password='*************'"
dpnctl: ERROR: MCS restore did not succeed, so not restarting MCS
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

If this worked for you and the MCS is restored and started successfully, then stop here. Else, move further. 

Fix 2:
Restore MCS to an older Flush

Basically, your MCS data is constantly backed up, and this is what is called as MCS flush. This is to protect the MCS from server or any hardware failures.
MCS flushes its data to the avamar server every 60 minutes as a part of system checkpoints. This is why, I would recommend you to roll back to a MCS flush which has a valid local checkpoint on that VDP server. So the more older MCS flush you roll back to, the more MCS data is lost. 

The local checkpoints in my case were:

root@vdp58:#: cplist

cp.20161020033059 Thu Oct 20 09:00:59 2016   valid rol ---  nodes   1/1 stripes     25
cp.20161020033339 Thu Oct 20 09:03:39 2016   valid --- ---  nodes   1/1 stripes     25

To list your MCS Flush, run the below command:
avtar --archives --path=MC_BACKUPS
The output is similar to:

   Date      Time    Seq       Label           Size     Plugin    Working directory         Targets
 ---------- -------- ----- ----------------- ---------- -------- --------------------- -------------------
 2016-10-20 15:25:20   372                      369201K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 14:45:20   371                      368582K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 13:45:18   370                      367645K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 12:45:17   369                      366716K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 11:45:19   368                      365779K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 10:45:17   367                      364842K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 09:45:17   366                      363762K Linux    /usr/local/avamar     var/mc/server_data

Here the numbers 372, 371.....are the MCS Flush labels. This list keeps going on till the day where the VDP appliance was deployed. 

I will rollback my appliance to Label 366

The command would be:
mcserver.sh --restore --labelnum=<flush_ID>
In my case:
mcserver.sh --restore --labelnum=366
This will start a small interactive script, where you need to accept the restore, provide the VDP IP to proceed further. Sample output:

root@vdp58:#: mcserver.sh --restore --labelnum=366

mcserver.sh must be run as admin, please login as admin and retry
root@vdp58:/usr/local/avamar/var/log/#: su admin
admin@vdp58:/usr/local/avamar/var/log/#: mcserver.sh --restore --labelnum=366
=== BEGIN === check.mcs (prerestore)
check.mcs                        passed
=== PASS === check.mcs PASSED OVERALL (prerestore)
--restore will modify your Administrator Server database and preferences.
Do you want to proceed with the restore Y/N? [Y]: y
Enter the Avamar Server IP address or fully qualified domain name to
restore from (i.e. dpn.your_company.com):
Enter the Avamar Server IP port to restore from [27000]:

The port will be default 27000. Post this, you will see a long list of logging of the mcsrestore task.
This is going to make certain changes to your MCS database.

If the restore to an older flush completes successfully, then start the MCS using:
mcserver.sh --start --verbose
This started the MCS successfully for me.

Now, I have also worked on a case, where the mcserver restore to an older flush completed with error / warnings causing the mcserver.sh --start to fail with the same error:

ERROR: gsan rollbacktime: 1475722913 does not match stored rollbacktime: 1475722911

You can try rolling to an even older MCS Flush and see how that goes. But, the chances are less that the MCS will ever come up. 

So if this fails, move to the next step:

Fix 3:
Update the MCS Database Manually. 

The last fix for this is to manually update the MCS database with the correct rollbacktime.

**This is a very tricky fix, and is not a best practice or a recommended method. If you are running a lab environment, then go ahead and try this. If you have production data at stake, stop! Involve EMC to check for other alternatives**

With that out of the way, the final fix would be in the order.

1. Connect to the MCS database. 

VDP is a SUSE box, and it runs a PostgreSQL database. The command would be the same as any to connect to the psql DB:
psql -p 5555 -U admin mcdb

The port for MCS database is 5555
We are connecting with admin user as we want to make certain changes on the MCS database. If you want to be in a view only mode then use the "viewuser" to connect to "mcdb"

2. Once you connect, you see the following message:

admin@vdp58:#: psql -p 5555 -U admin mcdb

Welcome to psql 8.3.23, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

3. Run \d to list the MCS tables. The one we are interested in is "property_value"

4. Run the below query to list all the contents of this table:
select * from property_value;
The output is similar to:

      property       |            value
 morning_cron_start  | -1
 evening_cron_start  | -1
 mcsnmp_cron_start   | 1
 clean_db_cron_start | 3
 rollbacktime        | 1475722911
 systemid            | 1476126720@00:50:56:B9:3E:6D
 hfscreatetime       | 1476126720
 systemname          | vdp58.vcloud.local
 restoredFlushTime   | 2016-10-10 19:45:00 PDT
 license_period_day  | 14
 license_buffer_pct  | 10
(11 rows)

The row that we are interested in is rollbacktime. Here we see the rollbacktime is 1475722911 which is not matching the GSAN rollback time of 1475722913

5. To update this, run the below query:
update property_value set value = <GSAN_rollbacktime> where property = 'rollbacktime';
So my query would look like:
update property_value set value = 1475722913 where property = 'rollbacktime'; 
Verify if the rollbacktime parameter is updated with the correct GSAN rollbacktime. 

6. Switch to admin mode of VDP appliance (su admin) and then start the MCS using:
mcserver.sh --start --verbose
This has to start the MCS as we have force synced the MCS. 

If this does not work too, then I do not know what else will.