While there are some very good sources out there on how to manage a critical incident, Google also wrote a chapter on incident management in their book, “Site Reliability Engineering”. In this chapter, the folks at Google present their approach to a well-designed critical incident management process.
1. Recursive Separation of Responsibilities: “A clear separation of responsibilities allows individuals more autonomy than they might otherwise have,since they need not second-guess their colleagues.”
This entails making sure that everyone knows what their role and responsibilities are as well as that
there are no overlaps nor incongruities. Among the distinct roles that should be assigned are:
Incident Command: The person charged with the high-level overview of managing the incident. This function assigns responsibilities according to needs and priorities and orchestrates the entire process. Operations: The person responsible for applying operational tools and is the only team member who should make the system modifications that are required for resolution. Communication: “The Public Face” of the incident, responsible for issuing periodic updates and ensuring that the incident document is up to date. Planning: The person that supports Ops by taking care of long-term issues, filing system bugs, arranging hand-offs, documenting what went wrong with the systems as well as how the wrongs were righted.
2. A Shared Live Incident State Document: “This can live in a wiki, but should ideally be editable by several people concurrently.”
Among the tips that are provided is to keep the most important information on top of the document and, of course, and to retain the document for post-mortem analysis.
3. A Clear, Live Hand-Off
When the incident commander has to go to sleep, and they eventually will need some shut-eye, a clear and accurate update must be provided for the stand-in. Furthermore, the hand-off should also be clearly communicated to the rest of the team, so they know whom to contact when needed (and to make sure they don’t wake up the incident commander catching up on the rest that is generally, desperately needed when coordinating incident resolution).
4. A Recognized Command Post: “Interested parties need to understand where they can interact with the incident commander.”
The ideal is to centralize the incident team into a designated physical situation room. Though this doesn’t always happen. Rather, this – actually – rarely happens. There are always individuals who are located in other sites, locations, and time zones. Then, we have external stakeholders such as vendors and consultants. Not to mention... there are always those who prefer to work at their desk and to communicate via email.
Needless to say, a “recognized command post” is certainly a best practice, but not one that can be easily facilitated... Until now.
Among the unique capabilities that the Exigence platform provides incident commanders and teams, is gathering all stakeholders in a virtual situation room, regardless of how distributed the team is, whether within or outside the organization, and regardless of their geography and time zone.
In the Exigence Situation Room, a virtual war room, the team can execute the full scope of Google’s best practices:
Clearly assigning roles and responsibilities Keeping a live incident document in the form of a timeline Executing a clear hand-off that is immediately communicated to all relevant parties Providing unprecedented clarity on the who, what, when, and where of the situation
But it doesn’t end there.
The Exigence Critical Incident Management Platform also performs automated task assignment and updates delivery. In fact, the platform manages the complete incident workflow, from alert to post-mortem, and enables full command and control of every critical incident, turning an unstructured situation into one that is structured, clear, and easy to manage.
We invite you to reach out to us with any questions at info@exigence.com.