The FEMA Incident Command System responds to wide area disasters like an earthquake, fire, flood, hurricane, and tornado, while ITIL is used for digital services and applications. In large organizations, there is the facilities team and the data center team. FEMA is associated with the facilities team and ITIL with the smaller data center team. What characteristics are shared between the two and what are the main differences?
We can learn from and augment our digital incident management programs by studying the extensive incident response program within large scale disaster focused FEMA.
Over the next few posts, we will share articles that compare the two systems including terminology, communication, resource management and more. Today's first installment compares goals.
ITSM Goals vs FEMA Goals
According to ITIL (IT Infrastructure library), "the incident management process ensures that normal service operation is restored as quickly as possible, and the business impact is minimized."
The Federal Emergency Management Agency (FEMA) specifies that the Incident Command System (ICS) “is a management system designed to enable effective and efficient domestic incident management by integrating a combination of facilities, equipment, personnel, procedures, and communications operating within a common organizational structure.”
FEMA's primary purpose is to coordinate the response to a wide area disaster that has occurred in the United States and that overwhelms the resources of local and state authorities. FEMA's 2023 budget is $29.5 billion, dwarfing any data center incident management program.
Businesses use ITIL incident management to support their business goals. They design their response to satisfy clients first, then the servers and applications second. Client satisfaction is the dominant key performance indicator over system health. Why does this matter? It determines what you do first when responding to an incident. If you have a pool of five client facing web servers that are load balanced and one of them fails, what do you do first? The temptation is to restart it assuming that corrects the problem. But it is better to remove it from the server pool immediately which restores client experience sooner than performing a restart. Fix the server later. FEMA does the equivalent by prioritizing people safety over buildings and property. If a building is on fire, which is done first, extinguish the fire or extract the people?
Don’t let competing priorities derail client satisfaction. Below are examples of action pairs that both need to be done, but their order is important.
Incident Response Action Pairs in Priority Order
- Failed database application
- Remove from pool or activate hot standby
- Correct the problem [kill blocking data archiving job]
- Failed ISP
- Switch to backup ISP
- Work with ISP until service is restored
- Failed application patch
- Revert patch change
- Correct patch code and reinstall later
- Expired security certificate installed on server
- Route traffic away from impacted server to server with valid certificate
- Update certificate
The race is to take the action which restores client experience to normal quickest. I’ve done this to the chagrin of a team who then had to work all weekend to fix the actual technical error - No fun. During incidents client experience is prioritized above employee experience and this determines the order of the actions in the above situations, and you would suggest doing likewise.
A special thanks to Alain Trottier, guest writer, for sharing his thoughts today. You can find him here: www.linkedin.com/in/alaintrottier