Issuing Authority: CTO
Program Coordinator
Overview
The intent of this document is to provide guidance on Carleton College’s (Carleton’s) information technology (IT) business continuity/disaster recovery capabilities and plans. The essential elements captured are the critical systems and business expectations around Recovery Time Objective (RTO) and Recovery Point Objective (RPO). This document incorporates other documents by reference, including recovery plans for business-critical systems.
Scope
This plan deals strictly with the IT part of the business. This plan should be referenced in a Carleton business continuity plan. No consideration is given for complete IT irrecoverability, where the business processes that depend on IT systems need to be recreated. This means that this plan currently relies on primary data center availability. The plan addresses premise-based services that Carleton relies on for conducting business. Cloud-based services will have their own vendor-provided plans that should be incorporated into each business area’s continuity plan.
Philosophy
Carleton selects technology solutions that support a robust infrastructure environment. Systems are designed and deployed that support resiliency, redundancy, and availability. Information Technology Services (ITS) maintains backups and retention schedules to meet Carleton’s RPO objectives and maintain geographic diversity where cost-effective. As a small liberal arts college, Carleton, however, needs to consider financially responsible solutions that staff can effectively support.
Definitions
There are a number of terms that are used throughout this document. To ensure common understanding, they are defined in this section.
- Outage — Unavailability of one or more business-critical systems in the information technology environment. This document considers three levels of severity
- Level 1 (L1) — any outage (including planned maintenance) that impacts the service
- Level 2 (L2) — an outage that requires application or data recovery for a service
- Level 3 (L3) — an outage that requires multiple systems or services to be recovered, potentially at a different location because of the nature of the outage cause
- Recovery Time Objective (RTO) — The time objective defined in collaboration with the business from a L2 or L3 outage until the service is returned to availability.
- Recovery Point Objective (RPO) — The maximum time’s worth of data as defined in collaboration with the business that can be lost in an outage.
Service Inventory
The following services have been identified as part of Carleton’s critical infrastructure, and as such, each has a service recovery document. These are listed in order of impact to Carleton and therefore priority in the event of a multiple service outage.
- DNS
- Palo Alto Firewall
- Core network services (router configs, DHCP, etc.)
- Moodle
- www.carleton.edu (main website)
- Internet service
- Identity and directory services (authentication, IDM, etc)
- OneCard
- Advance
- OnBase
- Facilities Systems
Cloud-hosted services, such as the following, are not in scope for this policy. For these services, Carleton benefits from the vendors’ investment in data redundancy in multiple data centers across the country, and economies of scale for their staffing who support reliability and availability of their services to a large number of customers.
- Workday
- Slate
- Google Workspace
Capabilities and Gaps
To support our goal of information technology business continuity, Carleton has the following capabilities.
- Full block-level backups of all production systems
- Replicated on-site to a backup data center
- Database transaction logs for critical systems are replicated outside of block-level system backups
- Snapshots of all production systems on a rolling 7-hour basis
- Implied RPO for all systems using this snapshot rotation of 1 hour if outage is discovered within the rolling window.
- Reconciliation and validation of data as part of integrated systems needs to be accounted for by the individual business units in their plan.
- Implied RPO for all systems using this snapshot rotation of 1 hour if outage is discovered within the rolling window.
- Redundancy and clustering where available
- Data center
- has full redundant power with unlimited run time on generator with fuel availability
- One card protected with authorized access
- Some critical services (such as our main website) are hosted outside of our data center.
We also recognize the following gaps and document them here as understood and accepted.
- No true geographic data center diversity
- Backups are replicated to cloud backup sites as well as a small backup data center in Cassat basement.
- In the unlikely event of a complete loss of our primary data center, the RTO and RPO objectives will not be met.
- No “live” backup data center to very quickly recover from major L3 outages.
- No backup data from cloud-hosted services
Recovery
In all levels of incidents, the Incident Management Plan will be followed, including use of clear roles and responsibilities. The time to recovery, priority, and required personnel will be guided by this and supporting documentation.
Off-site work capabilities
Depending on the nature of the disaster all members of ITS that would be engaged in working on recovery have remote work capabilities. In the event of a larger disaster involving unavailability of the data center and campus work locations, a coordination with facilities and other business units to procure temporary datacenter and work space would be required.
Testing Schedule
- Backup recovery is tested at least annually, in connection with the annual audit
- Incident management plan is tested through annual tabletop exercises
- ITS staff participate in the campus emergency response training and testing
Review and Updates
This policy will be reviewed annually and updated as necessary.
Revision History
2025.06.02 – Initial revision – kgeorge, jscannell, dstephans