Showing posts with label Enterprise Backup. Show all posts
Showing posts with label Enterprise Backup. Show all posts

Commvault Backup System Automated Maintenance – Setup & Best Practices

Commvault Backup System Automated Maintenance – Setup & Best Practices

What Is Commvault Automated Backup Maintenance?

Automated maintenance in Commvault helps administrators keep backup environments healthy without manual intervention.

By scheduling automated tasks, organizations can:

  • Clean expired backup jobs

  • Optimize database performance

  • Verify backup integrity

  • Manage storage efficiently

  • Improve overall backup reliability

Automation is especially important in large environments where hundreds of backup jobs run daily.

This article shares the development process of a Python-based inspection tool using the Commvault REST API. It supports one-click generation of HTML inspection reports with visual charts, suitable for various online and offline deployment scenarios.


As a practitioner in the data protection field, have you ever faced these frustrations:


- Having dozens or even hundreds of clients in your backup system makes manual daily status checks time-consuming and exhausting.

- Certain clients are offline for long periods without being noticed in time.

- Job failure reasons are scattered across different places, lacking a unified analysis.

- Difficulty in quantitatively assessing RPO (Recovery Point Objective) compliance.

- Needing to organize massive amounts of data just to generate an inspection report for management review.


This article will share my entire process of developing an automated inspection tool for Commvault backup systems.


Requirements Analysis


Core Inspection Metrics

Type
Details
threshold reference
Job Health
Success rate, number of failed jobs
Industry Standard: >95%
Client status
Online/Offline count
offline rate < 10%
Protection status
Unprotected client
Should be 0 or explicitly stated
RPO compliance
Time since last backup
Set according to business requirements
Failure analysis
Specific reason for job failure
Used for root cause analysis

Deployment Scenario Versatility

  • - Online environments — Third-party dependencies can be installed.
  • - Offline environments — Only the Python standard library is available, with no network and no pip.
  • - MCP Integration — Used as an MCP tool for Claude Code.


Technical Selection

API Call Strategy

Commvault provides a RESTful API, with primary endpoints including:



Dependency Strategy

Three versions were designed for different scenarios:

Version
dependencies
Applicable scenarios
health_check_html.py
null
Offline and production environments
health_check_portable.py
requests
Development environment, networked environment
health_check_pro.py
Commvault MCP Server
Claude Code, Openclaw integration

No-Dependency Implementation Tips

Using Python's standard library `urllib` instead of `requests`:

 
class CommvaultClient:
    def get(self, endpoint, params=None):
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        if params:
            url += "?" + urllib.parse.urlencode(params)

        request = urllib.request.Request(url)
        request.add_header('Accept', 'application/json')
        request.add_header('Authtoken', self.access_token)

        # Handle self-signed certificates
        context = ssl._create_unverified_context()
        with urllib.request.urlopen(request, context=context) as response:
            return json.loads(response.read().decode('utf-8'))


Core Function Implementation

Client Status Detection

Problem: The Commvault API job list does not directly return the online/offline status of a client.

Solution: Use backup activity as a proxy metric.

 
def get_client_status(jobs, clients, lookback_days=7):
    # Get clients with backup activity (considered online)
    clients_with_activity = set()
    for job in jobs:
        client_name = job.get('jobSummary', {}).get('subclient', {}).get('clientName')
        if client_name:
            clients_with_activity.add(client_name)

    # status
    online = [c for c in clients if c in clients_with_activity]
    offline = [c for c in clients if c not in clients_with_activity]

    return online, offline


Note: A disclaimer is added to the report: "* Online status is determined based on backup activity within the last X days."


RPO Compliance Analysis


def analyze_rpo(jobs, clients, threshold_hours=24): last_backup = {} # {client_name: timestamp} # Find the last successful backup time for each client. for job in jobs: if job['status'] == 'Completed': client = job['client'] last_backup[client] = max(last_backup.get(client, 0), job['time']) # Check for violations now = datetime.now().timestamp() violations = [] for client in clients: if client notin last_backup: violations.append({'client': client, 'hours': None, 'status': 'Never'}) else: hours = (now - last_backup[client]) / 3600 if hours > threshold_hours: violations.append({'client': client, 'hours': hours}) return violations

Failure Reason Extraction


Failure information returned by the API is located in the `pendingReason` field:

 
for job in jobs:
    if job['status'] == 'Failed':
        # pendingReason Contains detailed error messages, which may include HTML tags
        error = job.get('pendingReason', 'Unknown')
        
        error = error.replace('
', ' | ').replace('
', ' | ') failed_jobs.append({ 'job_id': job['jobId'], 'client': job['client'], 'error': error[:500] # Limit length })


Error Pattern Recognition

By analyzing error messages, common issues can be automatically identified:

 
def analyze_error_patterns(failed_jobs):
    offline_errors = sum(1for j in failed_jobs
                         if any(k in j['error'] for k in
                                ['unreachable', 'offline', 'cannot connect']))
    timeout_errors = sum(1for j in failed_jobs
                         if'timeout'in j['error'].lower())
    storage_errors = sum(1for j in failed_jobs
                         if any(k in j['error'] for k in
                                ['storage', 'media agent', 'library']))

    return {
        'offline': offline_errors,
        'timeout': timeout_errors,
        'storage': storage_errors
    }


Practical Application Case


Typical Inspection Results

Below is an actual inspection result:

 
=== Health Check Summary ===
  Success Rate: 8.7%
  Total Clients: 63
  Offline Clients: 57
  Unprotected Clients: 58
  RPO Violations: 62


Failure Reason Analysis:

  • - Most failures are due to clients being continuously offline for more than 10,080 minutes (7 days).
  • - Commvault automatically terminates backup jobs for clients that have been offline for an extended period.

Recommended Actions:

  • - Check client power status.
  • - Confirm network connectivity.
  • - Clean up unnecessary zombie clients.

Summary

Through the development of this tool, we have achieved:

✅ One-click generation of visual inspection reports
✅ Support for offline environment deployment (no third-party dependencies)
✅ Intelligent error analysis and recommendations
✅ Multiple configuration methods to flexibly adapt to different scenarios
✅ Out-of-the-box Python scripts

Target Audience:

Commvault backup administrators
Data protection engineers
IT operations personnel who need to regularly report backup status

References:

AnyBackup Failed to Back Up UIS Virtual Machine – Fix & Troubleshooting Guide

AnyBackup Failed to Back Up UIS Virtual Machine – Fix & Troubleshooting Guide


Problem Description

A user reported that the Aishu AnyBackup system failed when attempting to back up a specific business system. The failure reason was: The virtual machine does not support CBT.


Symptoms:

AnyBackup shows a backup failure. Reason: The VM does not support CBT.


Versions:

User Business System: openEuler 22.03 (LTS-SP3)

AnyBackup Version: 7.0.18.4.165

HCI Version: H3C UIS V8.0 (R0886P03)

Standard 4-node HCI.


Troubleshooting and Analysis:

1. On-site, the CBT backup button in the VM console was grayed out and could not be selected.

2. Checked the VM settings in the console; the disk format was indeed set to "Intelligent," and there were no multi-level images. However, the VM had existing snapshots.

3. Logged into the backend terminal to check disk information. The backend showed the VM was using only one disk but had snapshots, and it was currently linked to a snapshot file.

screenshot of checking Anybackup unable backup UIS


4. After cloning the original VM, the cloned VM only had a single-level image.


Why did AnyBackup fail to Back Up UIS Virtual Machine?

If AnyBackup failed to back up the UIS virtual machine, the issue is usually related to:

  • Snapshot failure

  • Permission or credential errors

  • Insufficient storage space

  • Network interruption

  • Incompatible hypervisor version

AnyBackup relies on stable connectivity and correct VM permissions to complete backup jobs successfully.

If your VM is hosted on UIS infrastructure, integration issues may cause snapshot timeout errors.

Root Cause:

The VM has multi-level images, which prevents the backup device from backing it up.


Solutions:

Power off the VM and merge the multi-level image disk files.

The specific steps are as follows:

1. Perform a safe shutdown after cloning the VM as a backup.
2. Click "Merge Image."
3. Confirm the image merge. Once the merge is successful, restart the VM.
4. Check the VM disk status; the files have been merged. Backup should now work normally.

Usual Step-by-Step Fix for AnyBackup VM Backup Error

 1. Check VM Snapshot Status

  • Log into hypervisor console

  • Verify snapshot is not locked

  • Delete old or failed snapshots

  • Retry backup task

Snapshot conflicts are one of the most common causes of backup failure.


2. Verify Credentials & Permissions

Ensure the backup account has:

  • Administrator privileges

  • Snapshot permission

  • Storage access rights

Incorrect role assignments often cause silent backup failures.


3. Check Storage Availability

Low storage space may interrupt the backup writing process.

Run storage check on:

  • Backup repository

  • Target NAS or SAN

  • Local disk path


4. Confirm Network Stability

Test connection between:

  • Backup server

  • UIS host

  • Backup repository

High packet loss or firewall restrictions can block backup transfer.

Related Readings:

This improves crawl depth and keeps users engaged longer.