Apr 05

Auditing and repairing a deduplicated file storage pool

TSM Server Comments Off on Auditing and repairing a deduplicated file storage pool

Technote (troubleshooting)

Problem(Abstract)

Under some circumstances, an audit of a deduplicated storage pool might be necessary. Two utilities, the dedupAuditTool.pl and dedupRepairTool.pl scripts, are now available. These can be used to audit the deduplication tables to ensure referential integrity, proper chunk linkage, and to clean up previously-marked, damaged files.

Symptom

The following messages are primary indicators of a potential problem within the deduplication catalog:

ANR4895E
ANR1165E
ANR1529I
ANR1162W

 

Cause

The most common cause of invalidated links within a deduplicated storage pool is the forcible removal of data chunks by using the “DELETE VOLUME VolumeName DISCARDDATA=YES” command. This command will remove data chunks from the database and media regardless of dependencies within the catalog, which will create invalidated links. As a result, you must follow the DELETE VOLUME DISCARDDATA=YES command with an audit against the ANR4895E message to cleanup the invalidated links. Until the audit is run, you will see ANR4895E or ANR1162W messages in the activity log that indicates that there are invalidated links. See the “Recovering from lost or damaged FILE volumes in deduplicated storage pool” technote for in-depth cleanup instructions for this situation.

Another potential cause of invalidated link and referential integrity errors is from DB2® table damage that can occur from a hardware failure. In this case, you can use the DB2DART utility to restore structural integrity. Then use the dedupAuditTool perl script attached to this document to identify referential integrity problems.

The following list includes APARs that describe how links can be invalidated during data movement operations:

IC90390
IC96993

To prevent the problems which are documented in these APARs from occurring in your deduplication environment, ensure you are using server version 6.3.5.100 or later.

Environment

This applies to IBM Spectrum Protect™ servers with a DEVType=FILE storage pool that has deduplicated data. The data can be deduplicated either by client-side or server-side (identify processing) deduplication.

Beginning with Version 7.1.3, IBM® Tivoli® Storage Manager is now IBM Spectrum Protect. Some applications such as the software fulfillment systems and IBM License Metric Tool use the new product name. However, the software and its product documentation continue to use the Tivoli Storage Manager product name. To learn more about the rebranding transition, see technote 1963634.

Diagnosing the problem

Use the dedupAuditTool Perl script that is attached to this document.

Resolving the problem

Overview

Note: The dedupRepairTool.pl script currently does not support container storage pools. If you are running server version 7.1.3 or later and have configured container storage pools, do not run the dedupRepairTool.pl script against any storage pools on such servers.

There can be various reasons for encountering server messages that indicate damage within a deduplicated storage pool. The first step to resolving any deduplication catalog inconsistency is to run an analyzer against the storage pool to determine where the problem might be. Then you would complete the steps to resolve the error. The attached dedupAuditTool.pl script does the analyzing.

It is important to note that the dedupAuditTool.pl is not intended to be an overall health check for deduplication enabled storage pools. You should only run the tool if you have experienced one of the symptoms described in this article.

The dedupAuditTool.pl is a perl script, which performs the analysis phase. The analysis phase is designed to accept environment and symptom information from the user. With this information, the tool interrogates the various IBM Spectrum Protect tables that are associated with the given symptom that was specified. At then end of the analysis phase, an audit report is generated. Send this report to IBM support for analysis. If the analysis determines that the problems identified in the report need correcting, they will provide you another tool, dedupRepairTool.pl, to perform the recovery phase. The recovery phase is responsible for either removing the damage, restoring from a copy source, or potentially both, depending on the salvage ability for each given object.

Script Details

General Script Information:

  • Perl must be installed on the system where the IBM Spectrum Protect server resides. Because the Perl scripts interrogate the DB2 database directly, they must be run from this system. To obtain code for the Perl installation, go to http://www.perl.org
  • The tools must be run with a user ID that has access to the DB2 instance.
  • The instance user must have FULL read/write access to the directory where the scripts are being executed and the directory where the client is installed (for example, /opt/tivoli/tsm/client/ba/bin64).
  • The tools require access to both DB2 and IBM Spectrum Protect by using the dsmadmc and DB2 CLIs.
  • On Windows systems, initialize the DB2 command line environment prior to executing the scripts by issuing “db2cmd” from a Windows administrative command prompt.
  • You must use Perl 5.8.0 or later (A Perl interpreter can be download from – www.perl.org).
  • Different versions of the dedupAuditTool scripts are available depending on which interpreter is available
    – dedupAuditTool.pl (5.10 or later)
    – dedupAuditTool_perl_v580 (5.8.0 – 5.10)
  • The tools are interactive scripts. They do not accept command line parameters.
  • The tools are multi-phase scripts (the dedupAuditTool.pl script does not include the final phase) – Environment Setup, Audit Tool Setup, Audit Analysis and Report Generation, Cleanup of Deduplicated Storage Pool.
  • It may be necessary to run the dedupAuditTool.pl and dedupRepairTool.pl multiple times in order to restore the referential integrity of the database.

How Are The Scripts Executed?:
perl dedupAuditTool.pl
perl dedupRepairTool.pl

Important Usage Notes:

  • An AUDIT VOLUME command should be run against all suspect volume(s) PRIOR to running the script dedupRepairTool.pl in RECOVERY mode. This will ensure that all files currently marked as damaged are validated as truly damaged before they are forcibly deleted.
  • If a DELETE VOLUME DISCARDDATA=YES command has been issued against a volume that is located in a deduplicated storage pool, the dedupRepairTool.pl script should be run immediately afterward with the ANR4895E symptom to resolve all of the invalidated links. Contact IBM support for access to dedupRepairTool.pl.
  • The script should be run only in a deduplicated environment for which the following requirements are all true (the script will prompt for these as well):
    – A current FULL database backup is available.
    – The IBM Spectrum Protect server is at 6.3.5.100 or later.
    – The dedupRepairTool.pl script should only be run against primary storage pools. Do not run the script for repairs against deduplicated copy pools and active data pools. The dedupAuditTool.pl script can be run against these pools, but the CHUNKS_NOT_CATALOGED category in the report it generates may include false positives for active data pools.
    – All data movement and expiration activity for the deduplicated storage pool has been quiesced; refer to the section “Preparation for running script” below for details. If the storage pool is not quiesced, the script may report false-positive results, have or cause the server to have severe performance issues, or cause deadlocks in server operations.
  • It is typically recommended to run the RECOVERY mode with BOTH so that all restorable objects are recovered from a copy source and all non-restorable objects are removed. This will result in the fastest and most straightforward cleanup possible.

Audit Details
Audit Symptom Categories and Coverage Details:
– ANR4895E
An invalidated deduplication chunk link has been found.

– ANR1165E and ANR1162W
A damaged file has been found in a deduplicated storage pool.

– ANR1529I
Damaged and expired base chunks have been found in a deduplicated storage pool.

– MISSING_EXTENT_ENTRY
Category that checks for any deduplicated object that is no longer cataloged.

– ORPHANED_EXTENT
Category that checks for an invalidated chunk that only exists in the deduplication catalog table.

– INVALIDATED_LINKS
Category that checks for any invalidated links that might exist in the deduplicated storage pool.

– MISSING_EXTENT_ENTRY
Category that checks for any deduplicated object which no longer has an extent in the deduplication
catalog.

– MISSING_AF_ENTRIES
Category that checks for any deduplicated extent that no longer has a corresponding entry in the
volume tracking table.

– MISSING_VOL_ENTRIES
Category that checks for a deduplicated object that no longer has a valid volume associated to it.

– ZERO_LENGTH_CHUNKS
Category that checks for any base data chunks (not links) that no longer contain actual storage
references.

– MISSING_CHUNKS
Category that checks for an object that has been deduplicated but no longer has entries in the
database deduplication catalog.

– MISSING_CHUNKS_EXTENDED
Category that performs additional resource intensive checks beyond what the MISSING_CHUNKS
symptom covers.
– ALL
The ALL category will go through all of the above symptoms plus a few others that are not
documented above. This category can run for a long period of time, depending on the size of the database and the deduplicated storage pool.

Audit Phase Options (dedupAuditTool.pl, dedupRepairTool.pl) :
* The deduplication catalog can be analyzed by either auditing a symptom or by processing a
previously generated audit file

Recovery Phase Options (dedupRepairTool.pl only) :
* The following cleanup phase options are available. The cleanup phase occurs after the analysis
is completed:
– Either an audit file can be generated, for future processing, or the script can enter a recovery
procedure. If the recovery procedure is chosen, the following modes are available:
– Restore/Delete
All invalid data is attempted for restoration. If data cannot be restored it will then be removed
from the IBM Spectrum Protect server.
– Restore Only
All invalid data is attempted for restoration from a copy pool.
– Delete Only
All invalid data will be removed from the IBM Spectrum Protect server.
– Preview
All recovery steps are taken, including restoration and deletion, but no action is performed. The
console will display each step that would be covered if the Recovery was executed without
PREVIEW.

Preparation for running script

    Before proceeding with the cleanup procedure, perform the following:

    • A current FULL database backup is taken
    • All data movement and deletion activity in the deduplicated Storage pool has been quiesced. This would include the following processes:
        • Expiration
        • Reclamation
        • Migration
        • Move Data
        • Move NodeData
        • Identify Duplicates
        • Backup/Archive
    • To accomplish this add the following options in the server options file and restart the server:
      • NOMIGRRECL
      • EXPINT 0
      • DISABLESCHEDS YES
    • Prevent all processing in the deduplication enabled storage pool by issuing the following commands:
      • DISABLE SESSIONS
      • IDENTIFY DUPLICATES storage_pool_name NUMPROC=0
    • IMPORTANT NOTE: Once you have resolved all issues, reset all options and allow processing to the storage pool to resume, perform the following:
      • Remove the following options you added above:
        • NOMIGRRECL
        • EXPINT 0
        • DISABLESCHEDS YES
      • Restart the server and issue the following command:
        • ENABLE SESSIONS

Script Examples (contains both dedupaudittool.pl and deduprepairtool.pl output)

Hide details for AUDIT OF ANR4895E SYMPTOM (Invalid Links Found) AUDIT OF ANR4895E SYMPTOM (Invalid Links Found)

***********************************************************************
Welcome to dedupAuditTool!
The runtime perl version: 5.010001

Attention!!

Before proceeding with the cleanup procedure, make sure the
following statements are true:

– A current FULL database backup is available!

– All data movement and deletion activity in the deduplicated
pool has been quiesced!
This would include the following processes:
Expiration
Reclamation
Migration
Move Data
Move NodeData
Backup/Archive

– The Tivoli Storage Manager server is at 6.3.5.100 or higher!
************************************************************************
Continue with the deduplication audit? (Y/N): <def: N>: Y

***********************************************************
Phase 1: Environment Setup
***********************************************************
Input the db2 database name <def: TSMDB1>: MGSUNC
Input the db2 schema name <def: TSMDB1>: TSMDB1
The connection to DB2 succeeded
Input the dsmadmc path(double quote the path) < def: “/opt/tivoli/tsm/client/ba/bin64”>:
Input the admin name <def: admin>:admin
Input the password for admin <def: admin>:admin
Input the option file(double quote the path) < def: “/opt/tivoli/tsm/client/ba/bin64/dsm.opt”>:
Connection to DISCO_A_SRV succeeded.

Finding all deduplicated storage pools
.
.

***********************************
DEDUPLICATED STORAGE POOLS:

FILEPOOL
FILENEXTPOOL
***********************************

Input the poolid or poolname to process <def: deduppool>: FILEPOOL
The DB2 command has started
The DB2 command succeeded
Conversion between poolid(4) AND poolname(FILEPOOL) succeeded

***********************************************************
Phase 2: Dedup Audit Tool Setup
***********************************************************

Auditing with script

Would you like to see all available symptoms that can be audited? <def: Y>: Y

*********AUDIT SYMPTOMS*********
ANR1165E
ANR1529I
ANR1162W
ALL (INCLUDES EVERYTHING BELOW):
ANR4895E
MISSING_EXTENT_ENTRIES
MISSING_SEGMENT_ENTRIES
MISSING_AF_ENTRIES
ORPHANED_EXTENT
MISSING_VOL_ENTRIES
INVALIDATED_LINKS
ZERO_LENGTH_CHUNKS
MISSING_CHUNKS
MISSING_CHUNKS_EXTENDED
*********AUDIT SYMPTOMS*********

Input the symptom, or error message, you want to check <def: ANR4895E>: ANR4895E

Checking to see if there are active dedup deletions occurring.
If there are, the audit will wait until the deletions are finished.

Command #1 to perform is:
tsm show dedupdeleteinfo
The TSM command has started
The TSM command succeeded
Deduplicated chunks are not being deleted at this time, proceeding….

***********************************************************
Phase 3: Auditing Deduplicated Storage Pool FILEPOOL
***********************************************************
The dedup audit tool is running in ANR4895E mode
The audit is in processing phase 1: ZERO_LENGTH_BASE_CHUNKS
The DB2 command has started
The DB2 command succeeded
The audit scan found nothing wrong in phase 1: ZERO_LENGTH_BASE_CHUNKS

The audit is in processing phase 2: INVALIDATED_DEDUP_LINKS
The DB2 command has started
The DB2 command succeeded
The audit scan has detected a potential problem in phase 2: INVALIDATED_DEDUP_LINKS!

The audit is in processing phase 3: DAMAGED_DATA
The DB2 command has started
The DB2 command succeeded
The audit scan found nothing wrong in phase 3: DAMAGED_DATA

***********************************************************
Phase 4: Cleanup of the Deduplicated Storage Pool
Cleanup procedures to perform include: INVALIDATED_DEDUP_LINKS
***********************************************************

Perform recovery procedure[RP] or generate an audit file[AF] <def: AF>: AF
Input the report file name <def: reportdedup.out>:
The report file has been generated.
Check reportdedup.out for the results.

Hide details for EXAMPLE OF AUDIT FILE (Invalid Links Found) EXAMPLE OF AUDIT FILE (Invalid Links Found)

******************************************************************
SYMPTOM: INVALIDATED_DEDUP_LINKS
DB2SQL:
“select count(*) from bf_bitfile_extents bfbe where bfbe.srvid=0 and bfb
e.poolid=4 and bfbe.linkbfid=9223372036854775807″
RESULT: 10
STORAGE POOL NAME(POOLID): FILEPOOL(4)
******************************************************************

Hide details for RECOVERY OF INVALID LINKS (Using Audit File with dedupRepairTool ONLY) RECOVERY OF INVALID LINKS (Using Audit File with dedupRepairTool ONLY)

***********************************************************************
Welcome to dedupAuditTool!
The runtime perl version: 5.010001

Attention!!

Before proceeding with the cleanup procedure, make sure the
following statements are true:

– A current FULL database backup is available!

– All data movement and deletion activity in the deduplicated
pool has been quiesced!
This would include the following processes:
Expiration
Reclamation
Migration
Move Data
Move NodeData
Backup/Archive

– The Tivoli Storage Manager server is at 6.3.5.100 or higher!
************************************************************************
Continue with the deduplication audit? (Y/N): <def: N>: Y

***********************************************************
Phase 1: Environment Setup
***********************************************************
Input the db2 database name <def: TSMDB1>: MGSUNC
Input the db2 schema name <def: TSMDB1>: TSMDB1
The connection to DB2 succeeded
Input the dsmadmc path(double quote the path) < def: “/opt/tivoli/tsm/client/ba/bin64”>:
Input the admin name <def: admin>: ADMIN
Input the password for admin <def: admin>: ADMIN
Input the option file(double quote the path) < def: “/opt/tivoli/tsm/client/ba/bin64/dsm.opt”>:
Connection to MSISCO_A_SRV succeeded.

Finding all deduplicated storage pools
.
.

***********************************
DEDUPLICATED STORAGE POOLS:

FILEPOOL
FILENEXTPOOL
***********************************

Input the poolid or poolname to process <def: deduppool>: FILEPOOL
The DB2 command has started
The DB2 command succeeded
Conversion between poolid(4) AND poolname(FILEPOOL) succeeded

***********************************************************
Phase 2: Dedup Audit Tool Setup
***********************************************************

Audit with script[A] or by previously generated audit file[F]? <def: A>: F
Input the report file name < def: reportdedup.out>: reportdedup.out
Read report file reportdedup.out …
The report file( reportdedup.out ) processing was successful

***********************************************************
Phase 4: Cleanup of the Deduplicated Storage Pool
Cleanup procedures to perform include: INVALIDATED_DEDUP_LINKS
***********************************************************

Perform recovery procedure now? <def: N>: Y
showdamagedoutput.out exists from previous audit. Remove the file? (Y/N): <def: Y>: Y

Perform INVALIDATED_DEDUP_LINKS recovery procedure? (Y/N):< def: Y> Y

***********************************************************
Starting recovery procedure for INVALIDATED_DEDUP_LINKS
***********************************************************

Perform restore/delete[B], restore[R], delete[D], preview[P] operation <def: R>: D

Performing DB setup required for recovery operation of the INVALIDATED_DEDUP_LINKS.

Command #1 to perform is:
db2 “select ‘show bfo ‘ || bfid from af_damaged where poolid=4” => showobject.mac
The DB2 command has started
The DB2 command succeeded

Command #2 to perform is:
tsm -itemcommit macro showobject.mac => showdamagedoutput.out
The TSM command has started
The TSM command succeeded

Command #3 to perform is:
db2 “delete from AF_DAMAGED where srvid=0 and poolid=4”
The DB2 command has started
The DB2 command succeeded

Check the showdamagedoutput.out file for any previously marked damaged files.

Performing the recovery operation for INVALIDATED_DEDUP_LINKS corruption

Command #4 to perform is:
db2 “delete from AF_DAMAGED where srvid=0 and poolid=4”
The DB2 command has started
The DB2 command succeeded

Command #5 to perform is:
tsm restore stgpool FILEPOOL w=y
The TSM command has started
The TSM command succeeded

Command #6 to perform is:
db2 “select ‘delete object ‘ || cast( bfbf.owner as char(24) ) || ‘ force=yes’ from bf_bitfile_extents bfbe left join bf_aggregated_bitfiles bfbf on ( bfbe.srvid=bfbf.srvid and bfbe.bfid=bfbf.bfid and bfbe.superbfid=bfbf.superbfid ) where bfbe.srvid=0 and bfbe.poolid=4 and bfbe.linkbfid=9223372036854775807 and bfbf.srvid is not NULL group by bfbf.owner” => deleteobject.mac
The DB2 command has started
The DB2 command succeeded

Command #7 to perform is:
tsm -itemcommit macro deleteobject.mac
The TSM command has started
The TSM command succeeded

Command #8 to perform is:
tsm show dedupdeleteinfo
The TSM command has started
The TSM command succeeded
Deduplicated chunks are not being deleted at this time, proceeding….

Command #9 to perform is:
db2 “insert into AF_DAMAGED ( srvid, bfid, poolid, updator ) (select distinct 0, superbfid, poolid, 2 from bf_bitfile_extents where srvid=0 and poolid=4 and linkbfid=9223372036854775807 and bfid!=9223372036854775807 )”
The DB2 command has started
The DB2 command succeeded

Command #10 to perform is:
db2 “select ‘show bfo ‘ || bfid from af_damaged where poolid=4” => showobject.mac
The DB2 command has started
The DB2 command succeeded

Command #11 to perform is:
tsm -itemcommit macro showobject.mac => showdamagedoutput.out
The TSM command has started
The TSM command succeeded

Check current entries in showdamagedoutput.out for any remaining invalid objects.

Hide details for AUDIT OF ANR4895E SYMPTOM (No Problems Found) AUDIT OF ANR4895E SYMPTOM (No Problems Found)

***********************************************************************
Welcome to dedupAuditTool!
The runtime perl version: 5.010001

Attention!!

Before proceeding with the cleanup procedure, make sure the
following statements are true:

– A current FULL database backup is available!

– All data movement and deletion activity in the deduplicated
pool has been quiesced!
This would include the following processes:
Expiration
Reclamation
Migration
Move Data
Move NodeData
Backup/Archive

– The Tivoli Storage Manager server is at 6.3.5.100 or higher!
************************************************************************
Continue with the deduplication audit? (Y/N): <def: N>: Y

***********************************************************
Phase 1: Environment Setup
***********************************************************
Input the db2 database name <def: TSMDB1>: MGSUNC
Input the db2 schema name <def: TSMDB1>: TSMDB1
The connection to DB2 succeeded
Input the dsmadmc path(double quote the path) < def: “/opt/tivoli/tsm/client/ba/bin64”>:
Input the admin name <def: admin>: ADMIN
Input the password for ADMIN <def: admin>: ADMIN
Input the option file(double quote the path) < def: “/opt/tivoli/tsm/client/ba/bin64/dsm.opt”>:
Connection to MSISCO_A_SRV succeeded.

Finding all deduplicated storage pools
.
.

***********************************
DEDUPLICATED STORAGE POOLS:

FILEPOOL
FILENEXTPOOL
***********************************

Input the poolid or poolname to process <def: deduppool>: FILEPOOL
The DB2 command has started
The DB2 command succeeded
Conversion between poolid(4) AND poolname(FILEPOOL) succeeded

***********************************************************
Phase 2: Dedup Audit Tool Setup
***********************************************************

Auditing with script

Would you like to see all available symptoms that can be audited? <def: Y>: Y

*********AUDIT SYMPTOMS*********
ANR1165E
ANR1529I
ANR1162W
ALL (INCLUDES EVERYTHING BELOW):
ANR4895E
MISSING_EXTENT_ENTRIES
MISSING_SEGMENT_ENTRIES
MISSING_AF_ENTRIES
ORPHANED_EXTENT
MISSING_VOL_ENTRIES
INVALIDATED_LINKS
ZERO_LENGTH_CHUNKS
MISSING_CHUNKS
MISSING_CHUNKS_EXTENDED
*********AUDIT SYMPTOMS*********

Input the symptom, or error message, you want to check <def: ANR4895E>: ANR4895E

Checking to see if there are active dedup deletions occurring.
If there are, the audit will wait until the deletions are finished.

Command #1 to perform is:
tsm show dedupdeleteinfo
The TSM command has started
The TSM command succeeded
Deduplicated chunks are not being deleted at this time, proceeding….

***********************************************************
Phase 3: Auditing Deduplicated Storage Pool FILEPOOL
***********************************************************
The dedup audit tool is running in ANR4895E mode
The audit is in processing phase 1: ZERO_LENGTH_BASE_CHUNKS
The DB2 command has started
The DB2 command succeeded
The audit scan found nothing wrong in phase 1: ZERO_LENGTH_BASE_CHUNKS

The audit is in processing phase 2: INVALIDATED_DEDUP_LINKS
The DB2 command has started
The DB2 command succeeded
The audit scan found nothing wrong in phase 2: INVALIDATED_DEDUP_LINKS

The audit is in processing phase 3: DAMAGED_DATA
The DB2 command has started
The DB2 command succeeded
The audit scan found nothing wrong in phase 3: DAMAGED_DATA

No problems were detected by the dedup audit tool for FILEPOOL!

Attached Scripts
dedupAuditTool.pl – Script that requires PERL version 5.10 or later
dedupAuditTool_v580.pl – Script that requires PERL version 5.8.0 – 5.10

NOTE: It is recommended that dedupAuditTool.pl is used if the 5.10 or later PERL
interpreter is available for the given operating environment.

written by Bosse