Commvault Backup System Automated Maintenance – Setup & Best Practices

Commvault Backup System Automated Maintenance – Setup & Best Practices

What Is Commvault Automated Backup Maintenance?

Automated maintenance in Commvault helps administrators keep backup environments healthy without manual intervention.

By scheduling automated tasks, organizations can:

  • Clean expired backup jobs

  • Optimize database performance

  • Verify backup integrity

  • Manage storage efficiently

  • Improve overall backup reliability

Automation is especially important in large environments where hundreds of backup jobs run daily.

This article shares the development process of a Python-based inspection tool using the Commvault REST API. It supports one-click generation of HTML inspection reports with visual charts, suitable for various online and offline deployment scenarios.


As a practitioner in the data protection field, have you ever faced these frustrations:


- Having dozens or even hundreds of clients in your backup system makes manual daily status checks time-consuming and exhausting.

- Certain clients are offline for long periods without being noticed in time.

- Job failure reasons are scattered across different places, lacking a unified analysis.

- Difficulty in quantitatively assessing RPO (Recovery Point Objective) compliance.

- Needing to organize massive amounts of data just to generate an inspection report for management review.


This article will share my entire process of developing an automated inspection tool for Commvault backup systems.


Requirements Analysis


Core Inspection Metrics

Type
Details
threshold reference
Job Health
Success rate, number of failed jobs
Industry Standard: >95%
Client status
Online/Offline count
offline rate < 10%
Protection status
Unprotected client
Should be 0 or explicitly stated
RPO compliance
Time since last backup
Set according to business requirements
Failure analysis
Specific reason for job failure
Used for root cause analysis

Deployment Scenario Versatility

  • - Online environments — Third-party dependencies can be installed.
  • - Offline environments — Only the Python standard library is available, with no network and no pip.
  • - MCP Integration — Used as an MCP tool for Claude Code.


Technical Selection

API Call Strategy

Commvault provides a RESTful API, with primary endpoints including:



Dependency Strategy

Three versions were designed for different scenarios:

Version
dependencies
Applicable scenarios
health_check_html.py
null
Offline and production environments
health_check_portable.py
requests
Development environment, networked environment
health_check_pro.py
Commvault MCP Server
Claude Code, Openclaw integration

No-Dependency Implementation Tips

Using Python's standard library `urllib` instead of `requests`:

 
class CommvaultClient:
    def get(self, endpoint, params=None):
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        if params:
            url += "?" + urllib.parse.urlencode(params)

        request = urllib.request.Request(url)
        request.add_header('Accept', 'application/json')
        request.add_header('Authtoken', self.access_token)

        # Handle self-signed certificates
        context = ssl._create_unverified_context()
        with urllib.request.urlopen(request, context=context) as response:
            return json.loads(response.read().decode('utf-8'))


Core Function Implementation

Client Status Detection

Problem: The Commvault API job list does not directly return the online/offline status of a client.

Solution: Use backup activity as a proxy metric.

 
def get_client_status(jobs, clients, lookback_days=7):
    # Get clients with backup activity (considered online)
    clients_with_activity = set()
    for job in jobs:
        client_name = job.get('jobSummary', {}).get('subclient', {}).get('clientName')
        if client_name:
            clients_with_activity.add(client_name)

    # status
    online = [c for c in clients if c in clients_with_activity]
    offline = [c for c in clients if c not in clients_with_activity]

    return online, offline


Note: A disclaimer is added to the report: "* Online status is determined based on backup activity within the last X days."


RPO Compliance Analysis


def analyze_rpo(jobs, clients, threshold_hours=24): last_backup = {} # {client_name: timestamp} # Find the last successful backup time for each client. for job in jobs: if job['status'] == 'Completed': client = job['client'] last_backup[client] = max(last_backup.get(client, 0), job['time']) # Check for violations now = datetime.now().timestamp() violations = [] for client in clients: if client notin last_backup: violations.append({'client': client, 'hours': None, 'status': 'Never'}) else: hours = (now - last_backup[client]) / 3600 if hours > threshold_hours: violations.append({'client': client, 'hours': hours}) return violations

Failure Reason Extraction


Failure information returned by the API is located in the `pendingReason` field:

 
for job in jobs:
    if job['status'] == 'Failed':
        # pendingReason Contains detailed error messages, which may include HTML tags
        error = job.get('pendingReason', 'Unknown')
        
        error = error.replace('
', ' | ').replace('
', ' | ') failed_jobs.append({ 'job_id': job['jobId'], 'client': job['client'], 'error': error[:500] # Limit length })


Error Pattern Recognition

By analyzing error messages, common issues can be automatically identified:

 
def analyze_error_patterns(failed_jobs):
    offline_errors = sum(1for j in failed_jobs
                         if any(k in j['error'] for k in
                                ['unreachable', 'offline', 'cannot connect']))
    timeout_errors = sum(1for j in failed_jobs
                         if'timeout'in j['error'].lower())
    storage_errors = sum(1for j in failed_jobs
                         if any(k in j['error'] for k in
                                ['storage', 'media agent', 'library']))

    return {
        'offline': offline_errors,
        'timeout': timeout_errors,
        'storage': storage_errors
    }


Practical Application Case


Typical Inspection Results

Below is an actual inspection result:

 
=== Health Check Summary ===
  Success Rate: 8.7%
  Total Clients: 63
  Offline Clients: 57
  Unprotected Clients: 58
  RPO Violations: 62


Failure Reason Analysis:

  • - Most failures are due to clients being continuously offline for more than 10,080 minutes (7 days).
  • - Commvault automatically terminates backup jobs for clients that have been offline for an extended period.

Recommended Actions:

  • - Check client power status.
  • - Confirm network connectivity.
  • - Clean up unnecessary zombie clients.

Summary

Through the development of this tool, we have achieved:

✅ One-click generation of visual inspection reports
✅ Support for offline environment deployment (no third-party dependencies)
✅ Intelligent error analysis and recommendations
✅ Multiple configuration methods to flexibly adapt to different scenarios
✅ Out-of-the-box Python scripts

Target Audience:

Commvault backup administrators
Data protection engineers
IT operations personnel who need to regularly report backup status

References:

No comments:

Post a Comment

Thank you for your comments.