About six years ago we (the ITS Leadership Team) put in place a formal process for how we handle outages. This “incident management” process has evolved into a well-documented and more focused process that relies on specific roles, like an Incident Manager, Technical Lead, and Communications Lead.
It’s good to have a process for managing outages, but it’s even better to avoid them. ITS does post-incident reviews for the biggest issues and/or repeat issues to find the lessons for people, process or technology environment. Examples of changes that came from those debriefs include: 1) collaborating with St. Olaf to create more redundancy in our Internet path to avoid the impact of fiber cuts during construction season and 2) moving from on-campus Zimbra email to cloud-hosted Gmail. These changes resulted in a significant reduction in network and email outages.
In January 2020, ITS implemented a new ticket system (TDX), which has provided a lot of new functionality. One of its features is change control, i.e., a way to record when changes are made to the technology environment. That has proven invaluable. When something starts behaving strangely, we can check the change log and identify any unintended consequences of our actions more quickly.
In late October, ITS added its next layer of process maturity — a Change Control Board and a companion policy document based on the industry-standard ITIL guidelines. This board consists of representatives from all five groups within ITS and meets weekly to discuss both the timing of planned changes and the preparations for making the change. The Board will also review remaining lessons and actions from open “post-incident reports.”
The ITIL guidelines acknowledge that there are too many changes (and too little value) to review all changes with the Change Board. Would installing software on a single computer need to be reviewed? How about activating a single network port? Usually, no. Nonetheless, unintended consequences do happen, and we need to take further steps to reduce them.
Two of the lessons from a recent outage are a reminder to make even routine changes outside of the busiest class periods whenever possible, and to double check our sense of what we consider a routine change.
After some discussion, we decided to avoid production changes from 9:45am-3:15pm for the last two weeks of classes this term and no changes during the three days of final exams. Our goal with this “change freeze” is threefold:
- to avoid any unintended impact on teaching and coursework for the rest of this term,
- to provide an opportunity to learn about our production changes and whether we could reorganize our work to reduce the amount of change while classes are in session, and
- to refine our process for determining what is a routine change.
Given the complexity of the Carleton environment, we can’t completely eliminate unscheduled interruptions, but we take very seriously our responsibility to learn from them.