A number of year back, I was involved with a large Canadian organization that experienced a complete system outage. This outage was a result of two city-wide blackouts that occurred within the same 30 minutes’ time. Long story short, all of the systems went down. Do I need to mention that the business was unhappy and that senior management was looking for heads.
Shortly after all of the systems failed and the “Major Incident” was declared, IT management was recalled to the command and control room. This “war” room had all the brains to recover the systems, but it was obvious that total chaos ensued. Lack of IT Disaster Recovery (ITDR) documentation and procedures, staff’s unfamiliarity with the organization critical systems and the inability of the senior management to make right decisions all contributed to the prolonged system outage.
At the surface, this organization did everything right. The DR data centre was fully operational, and the production systems were replicated to it. This secondary data centre was not in close proximity to the primary site, and in theory, it was ready to go. At the end, it took more than 24 hours to recover all of the systems and continue with the business operations.
There were many lessons learned during this incident. The addition of the dedicated business and IT systems continuity teams along with re-examination of the documentation and procedures were just two that I will mention here. Not long after this incident, all IT operations were outsourced to the third party provider.
So, what should the IT Organization do to minimize IT system outages?
There are few key activities and approaches which organizations should implement immediately to ensure IT Systems availability and continuity. It should all start with the development of the IT Disaster Recovery plan. This plan should follow these 6 high level steps:
- Asset definition – IT should collaborate with the business to identify mission critical applications and IT systems within the organization.
- Determine Recovery Objectives – Once critical systems are identified, the organization should develop Recovery Time Objectives (RPO) and Recovery Point Objectives (RTO). More on these can be found here.
- Determine Backup and Recovery Systems – Develop and implement adequate backup and recovery procedures and approaches for the critical systems (e.g. secondary sites, replication/mirroring, cloud backup, appliance backups, tape backups, etc.)
- Develop DR plan – Document DR procedures for applications and systems, communication plans, roles and responsibilities matrix, etc.
- Test DR plan – At least once a year, this plan should be tested and disaster should be simulated.
- Development of the DR plan continuity framework – DR plan should be revisited on a regular basis. Potential triggers include additions of the new applications, recovery hardware changes, Business Impact Analysis changes in requirements, etc.
StratoGrid Advisory Business Continuity Management practice can assist your organization to find right fit right vendor for your backup and IT disaster recovery services.
Sign up for our monthly newsletter
Some of our recent articles: