HIGH-AVAILABILITY SERVICES
Overview
This document is to establish the considerations that must be made to provide effective high-availability services. It does not specifically address disaster recovery, which is a related but distinct topic.
At Information Services, we strive to allocate high-availability equipment employing the following methods. In general, these boil down to allocating servers at two physical locations. The two sites that we utilize are the secure data center located in room 180 of the Computing Center, and the data center located in the basement of Oregon Hall. High-Availability services deployed by Information Services meet the following physical, network, systems, storage, application, and procedures requirements.
Physical
- 24x7 Temperature and humidity controlled space with automated monitoring and alerting
- Access restricted and secured facility
- Multiple power feeds
- On-site Universal Power Supply with power conditioning and monitoring
- Dual power feeds to each rack
- Automatic power transfer switch for site and all dependent infrastructure equipment (switches, monitors, KVM's, out-of-band management)
- Independent out-of-band management system to ensure remote access in case of in-band management failure (IPMI 2.0 Compliant, centralized management)
- Centralized network operations center to coordinate emergency response and monitor performance
- Equipment should be sited at two physically separated locations that both meet the site requirements for high availability
Network
- Must have independent routes to multiple border routers (uonet1 and uonet2)
- Redundant paths to each border router
- Trunking of network ports to both physically separated sites
- Redundant pathways for any out-of-band management networks
- Allocation of fiber pairs through redundant and physically separated fiber pathways (north and south routes)
- Network infrastructure are monitored 24x7 with automated alerting and escalation
- Network incidents are reported to responsible parties for appropriate notification
- Network secured and monitored to detect intrusion or suspicious activity
Systems
- Systems allocated with redundant power supplies
- Systems allocated with redundant network interfaces bonded or trunked
- Systems allocated with multiple CPU's
- Systems allocated with Baseboard Management Controllers to preemptively notify of hardware anomalies
- Systems operation and errors reported 24x7 to centralized network operations center and escalated to on-call staff
- No single system provides a service (e.g. Database server, Storage Server, Application Server), rather every dependent service and function is abstracted from the underlying hardware
- Capacity is spread across servers located at separate sites, with 100% capacity available at each site to handle peak loads
- Systems performance monitoring implemented to establish performance trends and identify possible bottlenecks
- Systems should be rack-mountable and server rated to provide high uptimes
- Support contracts maintained to ensure appropriate level of vendor response in case of equipment failure
- Systems secured and monitored to detect intrusion and suspicious activity (e.g. Filesystem integrity checking, Logwatch, COPS, AFICK, syslog, etc)
- Complex configurations are tested on dedicated test servers before being applied to production
Storage
- Storage hardware is physically located at more than one locations
- Production-critical data is stored on enterprise class storage hardware (SAN or NAS) only
- Storage data is block-level copied as close to real-time as possible to more than one locations (synchronous writes preferred)
- Data should be regularly backed up using "hot" (online) backup techniques whenever possible
- Backups should be tested routinely
- Storage hardware has the capacity to handle peak IO Operations and throughput requirements at more than one location
- Storage data employs RAID systems to protect from single disk failures
- Storage systems employ hot spares and hot swap disks to recover from disk failures without downtime
- Storage controllers notify of disk malfunctions in an automated manner
- Storage system integrity is monitored by system controller and reported on a scheduled basis for review
- Storage systems performance monitoring implemented to establish performance trends and identify possible bottlenecks
- Support contracts maintained to ensure appropriate level of vendor response in case of equipment failure
Application
- Software configuration disassociated with underlying systems hardware
- Application layer content switching employed to allow for load balancing and automated fail-over capabilities (LVS, F5 Content Switches, Load balancers)
- Storage technology employed to keep data writes in sync at multiple locations (e.g. Metro-cluster, Oracle Data Guard and ASM mirroring, NFS, etc)
- Application changes thoroughly tested in a suitable test environment before being pushed into production
- Systems on supported OS with automated patching and updating
- Changes to OS configuration are made in a change management environment to ensure oversight and back-out plan
- Applications are secured and audited using common testing tools (e.g. tcpdump, nmap, iptables, wireshark, snort)
- Application performance is routinely monitored and problems remedied quickly
- Server platform and configuration should follow application vendor recommendations
- Online or "rolling" upgrade techniques should be used whenever the application allows it
Procedures
- Change management procedures ensure that changes are made at appropriate times with professional oversight
- Configurations are maintained centrally in a backed up repository
- Installation and configuration procedures are documented and are accessible to support staff
- Emergency procedures are documented and accessible to support staff
- Notification network is maintained and actively tested to ensure appropriate notification works
- Resource life cycle management ensures that obsolete equipment is replaced in a timely manner to ensure system reliability
- Standardization on commodity equipment ensures that equipment failures will be easily replaced
- Administrative access should be controlled and provided on an "as needed" basis
|