Commvault Backup System Automated Maintenance – Setup & Best Practices

What Is Commvault Automated Backup Maintenance?

Automated maintenance in Commvault helps administrators keep backup environments healthy without manual intervention.

By scheduling automated tasks, organizations can:

Clean expired backup jobs
Optimize database performance
Verify backup integrity
Manage storage efficiently
Improve overall backup reliability

Automation is especially important in large environments where hundreds of backup jobs run daily.

This article shares the development process of a Python-based inspection tool using the Commvault REST API. It supports one-click generation of HTML inspection reports with visual charts, suitable for various online and offline deployment scenarios.

As a practitioner in the data protection field, have you ever faced these frustrations:

- Having dozens or even hundreds of clients in your backup system makes manual daily status checks time-consuming and exhausting.

- Certain clients are offline for long periods without being noticed in time.

- Job failure reasons are scattered across different places, lacking a unified analysis.

- Difficulty in quantitatively assessing RPO (Recovery Point Objective) compliance.

- Needing to organize massive amounts of data just to generate an inspection report for management review.

This article will share my entire process of developing an automated inspection tool for Commvault backup systems.

Requirements Analysis

Core Inspection Metrics

Type	Details	threshold reference
Job Health	Success rate, number of failed jobs	Industry Standard: >95%
Client status	Online/Offline count	offline rate < 10%
Protection status	Unprotected client	Should be 0 or explicitly stated
RPO compliance	Time since last backup	Set according to business requirements
Failure analysis	Specific reason for job failure	Used for root cause analysis

Deployment Scenario Versatility

- Online environments — Third-party dependencies can be installed.
- Offline environments — Only the Python standard library is available, with no network and no pip.
- MCP Integration — Used as an MCP tool for Claude Code.

Technical Selection

API Call Strategy

Commvault provides a RESTful API, with primary endpoints including:

Dependency Strategy

Three versions were designed for different scenarios:

Version	dependencies	Applicable scenarios
`health_check_html.py`	null	Offline and production environments
`health_check_portable.py`	requests	Development environment, networked environment
`health_check_pro.py`	Commvault MCP Server	Claude Code, Openclaw integration

No-Dependency Implementation Tips

Using Python's standard library `urllib` instead of `requests`:

 
class CommvaultClient:
    def get(self, endpoint, params=None):
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        if params:
            url += "?" + urllib.parse.urlencode(params)

        request = urllib.request.Request(url)
        request.add_header('Accept', 'application/json')
        request.add_header('Authtoken', self.access_token)

        # Handle self-signed certificates
        context = ssl._create_unverified_context()
        with urllib.request.urlopen(request, context=context) as response:
            return json.loads(response.read().decode('utf-8'))

Core Function Implementation

Client Status Detection

Problem: The Commvault API job list does not directly return the online/offline status of a client.

Solution: Use backup activity as a proxy metric.

 
def get_client_status(jobs, clients, lookback_days=7):
    # Get clients with backup activity (considered online)
    clients_with_activity = set()
    for job in jobs:
        client_name = job.get('jobSummary', {}).get('subclient', {}).get('clientName')
        if client_name:
            clients_with_activity.add(client_name)

    # status
    online = [c for c in clients if c in clients_with_activity]
    offline = [c for c in clients if c not in clients_with_activity]

    return online, offline

Note: A disclaimer is added to the report: "* Online status is determined based on backup activity within the last X days."

RPO Compliance Analysis

def analyze_rpo(jobs, clients, threshold_hours=24): last_backup = {} # {client_name: timestamp} # Find the last successful backup time for each client. for job in jobs: if job['status'] == 'Completed': client = job['client'] last_backup[client] = max(last_backup.get(client, 0), job['time']) # Check for violations now = datetime.now().timestamp() violations = [] for client in clients: if client notin last_backup: violations.append({'client': client, 'hours': None, 'status': 'Never'}) else: hours = (now - last_backup[client]) / 3600 if hours > threshold_hours: violations.append({'client': client, 'hours': hours}) return violations

Failure Reason Extraction

Failure information returned by the API is located in the `pendingReason` field:

 
for job in jobs:
    if job['status'] == 'Failed':
        # pendingReason Contains detailed error messages, which may include HTML tags
        error = job.get('pendingReason', 'Unknown')
        
        error = error.replace('
', ' | ').replace('
', ' | ')
        failed_jobs.append({
            'job_id': job['jobId'],
            'client': job['client'],
            'error': error[:500]  # Limit length
        })

Error Pattern Recognition

By analyzing error messages, common issues can be automatically identified:

 
def analyze_error_patterns(failed_jobs):
    offline_errors = sum(1for j in failed_jobs
                         if any(k in j['error'] for k in
                                ['unreachable', 'offline', 'cannot connect']))
    timeout_errors = sum(1for j in failed_jobs
                         if'timeout'in j['error'].lower())
    storage_errors = sum(1for j in failed_jobs
                         if any(k in j['error'] for k in
                                ['storage', 'media agent', 'library']))

    return {
        'offline': offline_errors,
        'timeout': timeout_errors,
        'storage': storage_errors
    }

Practical Application Case

Typical Inspection Results

Below is an actual inspection result:

 
=== Health Check Summary ===
  Success Rate: 8.7%
  Total Clients: 63
  Offline Clients: 57
  Unprotected Clients: 58
  RPO Violations: 62

Failure Reason Analysis:

- Most failures are due to clients being continuously offline for more than 10,080 minutes (7 days).
- Commvault automatically terminates backup jobs for clients that have been offline for an extended period.

Recommended Actions:

- Check client power status.
- Confirm network connectivity.
- Clean up unnecessary zombie clients.

Summary

Through the development of this tool, we have achieved:

✅ One-click generation of visual inspection reports

✅ Support for offline environment deployment (no third-party dependencies)

✅ Intelligent error analysis and recommendations

✅ Multiple configuration methods to flexibly adapt to different scenarios

✅ Out-of-the-box Python scripts

Target Audience:

Commvault backup administrators

Data protection engineers

IT operations personnel who need to regularly report backup status

References:

Mobile Menu