Reading Time: 2 minutes
Several years back, we were involved with a large Canadian organization that experienced a complete system outage. This outage was a result of two city-wide blackouts that occurred within the same 30 minutes’ time. Long story short, all of the systems went down.
Do I need to mention that the business was unhappy and that senior management was looking for heads?
Shortly after all of the systems failed and the “Major Incident” was declared, IT management was recalled to the command and control room. This “war” room had all the brains to recover the systems, but it was obvious that total chaos ensued.
Lack of IT Disaster Recovery (ITDR) documentation and procedures, staff’s unfamiliarity with the organization's critical systems and the inability of the senior management to make the right decisions all contributed to the prolonged system outage.
A false sense of security
At the surface, this organization did everything right. The DR data centre was fully operational, and the production systems were replicated to it. This secondary data centre was not near the primary site, and in theory, it was ready to go. In the end, it took more than 24 hours to recover all of the systems and continue with the business operations.
There were many lessons learned during this incident. The addition of the dedicated business and IT systems continuity teams along with the re-examination of the documentation and procedures were just two that I will mention here.
Not long after this incident, all IT operations were outsourced to the third-party provider.
What should the IT Organization do to minimize IT system outages?
There are a few key activities and approaches that organizations should implement immediately to ensure IT Systems availability and continuity. It should all start with the development of the IT Disaster Recovery plan. This plan should follow these 6 high-level steps:
- Asset definition – IT should collaborate with the business to identify mission-critical applications and IT systems within the organization - an alignment with a Business Impact Analysis (BIA) findings.
- Determine Recovery Objectives – Once critical systems are identified, the organization should develop Recovery Time Objectives (RPO) and Recovery Point Objectives (RTO). More on these can be found here.
- Determine Backup and Recovery Systems – Develop and implement adequate backup and recovery procedures and approaches for the critical systems (e.g. secondary sites, replication/mirroring, cloud backup, appliance backups, tape backups, etc.)
- Develop an ITDR plan – Document DR procedures for applications and systems, communication plans, roles and responsibilities matrix, etc.
- Test an ITDR plan – At least once a year, this plan should be tested and disaster should be simulated.
- Development of the ITDR plan continuity framework – DR plan should be revisited regularly. Potential triggers include additions of the new applications, recovery hardware changes, Business Impact Analysis changes in requirements, etc.
StratoGrid Advisory is a Business Continuity Management (BCM) Advisory firm in the Ottawa/Gatineau region that can provide you with the experience and knowledge needed to implement the IT Disaster Recovery Plan successfully and to implement a BCM Program in your organization.