Wednesday, 4 April 2018

Maintenance Task Fails On VDP When Connected To Data Domain

There are many instances where the maintenance task fails on VDP. This article is in specific to VDP when integrated with data domain and moreover when the DDoS version is 6.1 and above.

The checkpoint and HFS tasks were completing fine without issues:
# dumpmaintlogs --types=cp | grep "<4"

2018/03/19-12:01:04.44235 {0.0} <4301> completed checkpoint maintenance
2018/03/19-12:04:17.71935 {0.0} <4300> starting scheduled checkpoint maintenance
2018/03/19-12:04:40.40012 {0.0} <4301> completed checkpoint maintenance

# dumpmaintlogs --types=hfscheck | grep "<4"

2018/03/18-12:00:59.49574 {0.0} <4002> starting scheduled hfscheck
2018/03/18-12:04:11.83316 {0.0} <4003> completed hfscheck of cp.20180318120037
2018/03/19-12:01:04.49357 {0.0} <4002> starting scheduled hfscheck
2018/03/19-12:04:16.59187 {0.0} <4003> completed hfscheck of cp.20180319120042

Garbage collection task was the one that was failing:
# dumpmaintlogs --types=gc --days=30 | grep "<4"

2018/03/18-12:00:22.29852 {0.0} <4200> starting scheduled garbage collection
2018/03/18-12:00:36.77421 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR
2018/03/19-12:00:23.91138 {0.0} <4200> starting scheduled garbage collection
2018/03/19-12:00:41.77701 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR

From ddrmaint.log located under /usr/local/avamar/var/ddrmaintlogs had the following entry:

Mar 18 12:00:31 VDP01 ddrmaint.bin[14667]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 18 12:00:34 VDP01 ddrmaint.bin[14667]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

Mar 19 12:00:35 VDP01 ddrmaint.bin[13409]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 19 12:00:39 VDP01 ddrmaint.bin[13409]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

It was basically failing to retrieve checkpoint list from the data domain.
Also, the get checkpoint list was failing:

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: <4750>Datadomain get checkpoint list operation failed.

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::execute_cplist: Failed to retrieve snapshot checkpoints from LSU: avamar-1488469814, ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: <4750>Datadomain get checkpoint list operation failed.

From the mTree LSU of this VDP Server, we noticed that the checkpoints were not expired:
# snapshot list mtree /data/col1/avamar-1488469814

Snapshot Information for MTree: /data/col1/avamar-1488469814
----------------------------------------------
Name                Pre-Comp (GiB)   Create Date         Retain Until   Status
-----------------   --------------   -----------------   ------------   ------
cp.20171220090039         128533.9   Dec 20 2017 09:00
cp.20171220090418         128543.0   Dec 20 2017 09:04
cp.20171221090040         131703.8   Dec 21 2017 09:00
cp.20171221090415         131712.9   Dec 21 2017 09:04
.
cp.20180318120414         161983.7   Mar 18 2018 12:04
cp.20180319120042         162263.9   Mar 19 2018 12:01
cp.20180319120418         162273.7   Mar 19 2018 12:04
cur.1515764908            125477.9   Jan 12 2018 13:49
-----------------   --------------   -----------------   ------------   ------
Snapshot Summary
-------------------
Total:          177
Not expired:    177
Expired:          0

Due to this, all the recent checkpoints on VDP were invalid:
# cplist

cp.20180228120038 Wed Feb 28 12:00:38 2018 invalid --- ---  nodes   1/1 stripes     76
.
cp.20180318120414 Sun Mar 18 12:04:14 2018 invalid --- ---  nodes   1/1 stripes     76
cp.20180319120042 Mon Mar 19 12:00:42 2018 invalid --- ---  nodes   1/1 stripes     76
cp.20180319120418 Mon Mar 19 12:04:18 2018 invalid --- ---  nodes   1/1 stripes     76

The case here is the VDP version was 6.1.x and the data domain OS version was 6.1
# ddrmaint read-ddr-info --format=full

====================== Read-DDR-Info ======================
System name        : xxx.xxxx.xxxx
System ID          : Bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx4
DDBoost user       : ddboost
System index       : 1
Replication        : True
CP Backup          : True
Model number       : DDxxx
Serialno           : Cxxxxxxxx
DDOS version       : 6.1.0.21-579789
System attached    : 1970-01-01 00:00:00 (0)
System max streams : 16

6.1 DD OS version is not supported for VDP 6.1.x. 6.0.x is the last DD OS version supported for VDP.

So if your DD OS is on 6.1.x then the choice would be to:
> Migrate the VDP to Avamar Virtual Edition (Recommended)
> Rollback DD OS to 6.0.x

Hope this helps!