After a recent outage, we thought the incident would make a good case study to show both technical complexities and process improvement lessons for ITS.
Moodle when it started
For about 10 years Carleton’s Moodle environment was a single, monolithic server, hosted on campus, with a new instance created every year. Courses were downloaded from the prior version to bring into the current one, a practice which stopped in 2014.
Skyrocketing use of Moodle during COVID revealed that Moodle, as it was configured at the time, didn’t have enough capacity to handle the influx of users, resulting in a series of outages. In Spring ’21, ITS redesigned how Moodle worked by adding a load balancing feature — essentially, putting more than one server in place, all looking at the same underlying Moodle data, which would share the active users among them. This offered more capacity and more flexibility. If one server failed (or needed maintenance), the users would be automatically moved to a working server. During regular times, more than one Moodle server would be in use, helping to share the load.
Sharing data across three servers
In order for the data to be available to all of the servers, the servers and the data were connected through a VLAN (Virtual Local Area Network). The “virtual” approach has been an IT technique for decades, making it possible for hundreds of virtual servers to live on one physical computer. In this case, the Moodle servers were virtually connected to the backend storage. Without the VLAN, the security firewall would stop that communication in the same way that it stops intruders from moving laterally between the servers and databases on our network.
This type of virtual data sharing was not an active part of our infrastructure when the storage and server environment was put into place years ago.The system administrators who originally configured our virtual data sharing infrastructure created documentation which has been used successfully for a range of maintenance activities.
A change that revealed an old error
However, the documentation contained a mistake: it had the wrong ID number for one of the VLANs. In all these years, we had never configured virtual machines to communicate on that particular VLAN, so no one knew there was an error.
To make matters more complex, the host environment for the servers had recently been moved to a new private IP address space created as part of selling the college’s public IP address space (which is expected to yield $2M). Due to that change, the servers couldn’t just be simply moved. Their IP address change required them to be rebuilt using the old documentation.
In this case, because of the virtual sharing, we used the portion of the documentation that contained the error. When one of the Moodle servers was rebuilt in its new address space, it lost contact with its data. The people trying to use Moodle on that server were unable to log in.
The outage on October 20th
In general, moving servers in this manner is a routine change and is done frequently to balance resources as usage grows. As is standard practice, our system administrator was moving only one Moodle server at a time, not all three together. Because of this, only some of the users on Moodle were affected — those on the other Moodle server were not experiencing any problems.
The error was found, connectivity was restored in 15 minutes, and the issue was resolved within an hour, but some classes had been interrupted. That called for reaching out with apologies, as well as finding ways to keep this from happening again.
Lessons for server changes
The lessons from this incident are many, but the top two are that we need to be more careful about the timing of “routine” changes, and we need to have a broader set of ITS staff review upcoming changes to include people who might know implications and distinguishing factors about a particular change. Both of these lessons will be folded into the next phase of ITS’s change management practices, which are described in this article.