Crafting a Lean IT Business Continuity Strategy by Balancing Cost and Resilience
By Joanne Weng, Director of International Business Department, Synology
In the ever-evolving digital landscape, businesses face increasing challenges in ensuring the safety and continuity of their data. A string of disruptions experienced by major corporationsĀ has only heightened the need for robust backup and recovery mechanisms. At the heart of modern businesses lies data, and its security and risk management play a pivotal role in ensuring business continuity.
However, while the importance of backups and disaster recovery plans is universally acknowledged, executing them can become prohibitively expensive. This financial challenge underscores the necessity of prioritization and the ability to architect a lean yet resilient IT infrastructure.
A clear checklist is required
While the causes, impacts, and solutions of data-related incidents may vary, the overarching principles remain consistent. Your organization likely already has some backups to counter ransomware or equipment failure. So answer this: What recovery point objectives (RPOs) and recovery time objectives (RTO) can you achieve with your current backup plan if your production servers or cloud instances suddenly vanish? Put another way, how much money will that downtime cost the business if you need to perform a complete disaster recovery process?
If that makes you feel uneasy, and you’re in a position that should know this, it might be time to review your backup and disaster recovery (DR) plans.
Starting with the fundamentals, businesses need to map out and identify which systems are responsible for which real-world “work”. While some companies may use siloed infrastructure per department, there are likely countless dependencies that need to be mapped out. For example, it’s obvious that a directory server disruption will knock out authenticating with any services or endpoint (which is huge but expected), but what about your internal ERP system?
If you don’t already have a map of your IT infrastructure, get it done. Ensure that system dependencies are clearly documented and well understood. Next, list the primary real-world processes based on your business (e.g. product manufacturing, e-commerce, logistics) and most importantly, stack-rank them based on their financial impact if disruptions happen.
Each business will have vastly different requirements based on its structure and technology stack. However, there will always be a cost that can be associated with downtime. This process needs to be routinely reviewed and kept up to date.
Building solid foundations
Building a dependable and resilient IT infrastructure isn’t easy, but it’s also not difficult once we break it down into multiple components.
- High availability (HA) for production environments: In the event of a server failure, the HA system should automatically take over, minimizing downtime. For companies that self-host their systems, this is usually done through HA hypervisor clusters paired with similarly HA-clustered storage systems. Cloud deployments can likewise leverage load balancers and self-monitoring tools to ensure services remain online.
- On-site and off-site backups: Regular backup schedules for critical operational tools like file servers, DBs, ERP systems, core service virtual machines, and offline servers should be documented. Depending on the importance of the operational service, appropriate Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) need to be carefully defined. An out-of-date database backup is better than nothing but will still cause a significant headache. Finally, off-site backups and disaster recovery (DR) capacity shouldn’t be an afterthought. To keep costs in check, retention policies and the scale of the DR equipment or cloud instances can be lowered.
Restoring shouldn’t be stressful
In the unfortunate event of a disruption, a three-tier restoration process can ensure business continuity:
- Automatic failover: HA clusters should be designed to automatically failover, ensuring that there’s no manual intervention required during critical moments. This should be enough to take care of simple equipment failures.
- Restore from snapshots or failover to backup systems: Local snapshots and similar technology allow servers to quickly roll back an unintended change extremely quickly. If the problem persists or the problem stems from a larger issue (e.g. the entire cluster is down), full restores or failovers to another system should be considered.
- Restore from remote backups or failover to the DR site: In case of major disruptions like natural disasters, remote backup solutions come into play. Businesses can restore from these backups or, if necessary, failover to a DR site to resume operations.
Stay ahead of the curve
Effective monitoring is the backbone of a resilient infrastructure. The approach should focus on:
- Filtering out the noise: Monitoring solutions need to ensure that only critical notifications are sent out, preventing information overload and ensuring that the right people are alerted promptly when critical events inevitably happen.
- Acting quickly and decisively: Time is of the essence during disruptions. IT, DevOps, SIRT, and even PR teams need to be well coordinated for various types of events. From security breaches to data centre fires or even just mundane equipment failures, anything that might result in customer or operation disruptions will involve cross-team communications and collaboration. The only way to get better at handling these is to have documentation on what should be done, a clear chain of command, and practice drills.
In conclusion, a comprehensive backup and recovery strategy is essential for businesses aiming for uninterrupted operations. While there are many solutions available in the market, it’s crucial to find one that aligns with your business needs. Over the years, companies like Synology have demonstrated expertise in storage and data protection, with numerous success stories that attest to their capabilities. For more insights on these success stories, check out the following case studies:
Medina County Clerk of Courts Case Study
Thai Nippon Foods Case Study
Imperial War Museum Case Study
Footnotes:
Sharwood, Simon. 2023 ‘Toyota servers ran out of storage, crashed production at 14 plants in Japan’ The Register, September 7 2023.Ā https://www.theregister.com/2023/09/07/toyota_outage_storage_server/.
Sharwood, Simon. 2023 āMicrosoft admits slim staff and broken automation contributed to Azure outageā The Register, September 4 2023.
https://www.theregister.com/2023/09/04/microsoft_australia_outage_incident_report/
Claburn, Thomas. 2023 āCisco’s Duo Security suffers major authentication outageā The Register, August 31 2023.
https://www.theregister.com/2023/08/21/ciscos_duo_outage/