Business Continuity Metrics Every Risk Manager Dreams Of…

By Carlos Escapa, PHD Virtual – Founding Member DRP Council

Managing risk in IT is a multidisciplinary science that brings together business processes, applications and infrastructure. Each area is composed of multiple layers that IT tends to manage in isolation.

At a high level, risk managers use Business Impact Analysis (BIA) and Risk Assessments (RA) to identify and prioritize the protection requirements of business processes. The most common metric used here is dollars per hour, such as loss of sales or profits. Additionally, some companies also measure accruals required to handle investigations, possible litigation and public relations. Not all business impact is directly quantifiable; damage to reputation or brand image are examples.

IT has typically measured risk bottom-up. Infrastructure metrics have been around for decades and are very well understood, such as Mean Time Between Failures (MTBF), and Mean Time to Repair or Replace (MTTR) specific components. These metrics are incorporated into service level contracts with major suppliers. Because data centers today are highly redundant and with virtualization many installations have adopted fail-in-place methodologies, MTBF and MTTR have lost relevance for small components like servers and routers, but remain critical to risk management in large components, particularly power supply, power generation, batteries and cooling systems.

Storage has a metric called Recovery Point Objective (RPO) that purportedly measures the maximum loss of data in case of a disaster. In the context of replication, it measures the time that it takes to replicate a block of data reliably across two data centers. In the context of backup processes, it measures the maximum age of blocks of data that can be restored.

RPO is not a good metric for risk managers because it does take into account application consistency, and therefore cannot guarantee protection from data loss without manual intervention (for instance, to examine transaction journals in n-tier applications with multiple data repositories). A more meaningful metric for risk managers would be Application Recovery Point Objective, which would establish the maximum age of reliable checkpoints that a business process can be rolled back to.

Recovery Time Objective (RTO) is fairly self-explanatory and can applied to multiple layers of the IT stack, from storage (time until I/O operations can start) to servers (time to boot the operating system) to applications (time until the application is available). Application Recovery Time Actual (RTA), is useful for risk managers as it establishes empirically the time to service.

Maximum Allowable Downtime (MAD) (also Maximum Tolerable Outage) refers to the length of time that a business process can be interrupted. This is a business metric based on the BIA. The MAD includes the time to diagnose the nature of a disruption, decision-making time, and the worst-case RTA – typically a failover across data centers or clouds.

Most of those metrics used to be calibrated yearly if not more often. In clouds, it is becoming much easier to measure and report on them. IT Risk Management is poised to make significant gains in terms of precision and relevance to overall Corporate Risk Management.

Posted in Blog