
Part of cyber resilience is considering what to do when the worst happens. And that worst case scenario is sadly likely to inevitable at some point. This worst-case scenario will take the form of a significant incident, a disaster from which the school or college needs to recover, and in planning for this a Disaster Recovery (DR) plan should have been created. But what should such a plan look like?
I have given this quite a bit of thought. Is this disaster recovery plan a long and detailed document or something much more simple and digestible?
On one hand we might want the long document and all the details as in the event of a disaster we will want as much information as possible to help us with first isolating and managing the incident and then later with recovery. The issue with this is that when the fire has been lit under the IT Services team due to an IT incident, the last thing anyone wants to do is wade through a long and complex document. I have seen a disaster plan which included lots of Gantt charts with estimated timelines for different parts of the recovery, but how can we predict this with any accuracy against the multitude of different potential scenarios. Additionally, the information you will actually need is likely to depend very much on the nature of the incident.
The flip side is the much more managable document which is easier to digest and look to in a crisis situation, when things are high stress but its shortness will lack some of the detail you may want. That said, a shorter document will be easier to rehearse and prepare with when running simulated and desktop incidents such that staff remember the structure and are largely able to act without needing to refer too often to the supporting DR plan. It is also more likely to be applicable across a wider range of scenarios.
The above however suggests only two options, being the detail or the brevity and ease of use, but my thinking on DR has led me to think we need to have both. We need to have a brief incident plan which should be general and fit almost all possible incidents. It should consider how an incident might be called and then which roles will need to be implemented including contact details for the various people which might fill each of the roles. It should consider the initial steps only, getting the incident team together so they can then respond to the specific nature of the incident in hand. It is the outline process for calling and the initial management of an incident.
Then we need to have the reference information to refer to which will aid in the identification, management and eventual recovery from an incident. Now most of this should already exist in proper documentation of systems and setup and of processes, however this is often missed out. When things are busy its often about setting things up, deploying technology or fixing issues, and documenting activities, configurations, etc, is often put off for another day, a day which often never happens. I think the creation of this documentation may actually be key.
Conclusion
The specifics of a DR plan will vary with your context so I don’t think there is a single solution. For me there are 3 keys factors.
- Having a basic plan which is well understood in relation to calling an “incident” and the initial phases of management of such an incident. This needs to be clear and accessible so as to be useful in a potentially high stress situation.
- Having documentation for your systems and setup to aid recovery. This is often forgotten during setup or when changes are made, however in responding to an incident detailed documentation can be key.
- Testing your processes to build familiarisation and to ensure processes work as intended, plus to adjust as needed.
DR planning is critical as we need to increasingly consider an incident as inevitable, so the better prepared we are the greater potential we have for minimising the impact of the incident on our school or college.


As we use more and more cloud services, internet access and school internet provision becomes critically important. Due to the critical nature of internet access, when looking at Internet service provision, firewalls and core switches, the two main focal areas I would consider are doubling up where finances allow or carefully examining the service level agreement along with any penalties proposed for where service levels are not met. In the case of firewalls and core switches, cold spares with a lower specification may also be an option to minimize cost but allow for quick recovery in the event of any issue. When looking at the SLAs of providers in terms of their support offering for when things go wrong consider, is it next business day on-site support or return to base for example and how long their anticipated recovery period is.
In the case of edge switches and Wi-Fi Access Points we are likely to have large numbers especially for larger sites. I would suggest that heat mapping for Wi-Fi is key at the outset of a Wi-Fi deployment, in making sure Wi-Fi will work across the site. In looking at resiliency for when things go wrong my view is an N+1 approach. This involves establishing a spare or quantity of spares based on the total number of units in use and the level of risk which is deemed acceptable. High levels of risk acceptance mean fewer spares, whereas a low level of risk acceptance may lead to a greater number of spares.
Cables break plus various small animals love to chew on cables given half a chance.