No matter how you design your architecture or what technologies you implement, critical incidents will happen. When things go wrong, it is easy to get carried away and forget about the bigger picture. But your work isn’t done after you fix the immediate problem; now is the time to take a look at how the incident actually happened so that you can learn from it.
Postmortem documentation is therefore an essential piece of the puzzle for handling future incidents. While formats vary between organizations, postmortem analysis determines the how, the what, the who, the when, and the why of the incident for future reference and pattern discovery. It is an essential tool that makes learning from past incidents possible, and in a broader sense, postmortem empowers organizations to learn from failure.
When critical incidents happen, you have one mission: to mitigate the impact on your users as quickly as humanly possible. Postmortem documentation should be drafted as soon as possible after the incident has been handled and the issue is no longer impacting the user experience.
Postmortem analysis drives focus on continuous improvement, instills a culture of learning, and identifies areas for improvement that might otherwise be ignored. By documenting your incident response processes, you ensure efficiency and consistency the next time an incident happens.
According to the Google SRE handbook, "When post mortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place.”
For postmortems to work, companies need to embrace failure. Avoiding finger-pointing for outages and accidents is part and parcel of that attitude; the focus must remain on fixing broken processes, not placing blame.
The analysis must dive deeply into the who, what, where, why, and how of an incident. Responders must give a detailed account of their actions, timelines, assumptions, expectations, and understanding of the root causes of the incident. It is crucial to start the work as soon as possible after the incident is handled, with the events still fresh in mind.
Postmortems are a collaborative effort that thrives on feedback. From the engineering department to the business side, there are a lot of moving parts involved when handling a breach or a critical incident. Routine postmortem analysis ensures that every person in your organization is laser-focused on learning from every single critical incident that hits.
After the threat has been contained and neutralized, all the parties involved must contribute to the analysis to improve the end-to-end incident response pipeline.
Collect all tickets, logs, timelines, reports, and other relevant notes before the postmortem session. No detail is too small when it comes to figuring out the root cause of a critical incident.
Once the postmortem analysis is complete, share it with as many people as possible. Make sure to put clear procedures in place that will help your team recognize and act upon established corrective and preventive actions the next time a similar incident hits. Train your employees to recognize this newly encountered incident or breach, if a similar event were to happen again. In the end, the lack of understanding of the contributing causes to the incident all but guarantees that it will repeat.
Postmortems are the secret sauce that turns incidents into learning experiences. Instead of dealing with the individual failures, a greater focus should be given to creating a culture of continuous learning. But, that’s easier said than done. Some of the most common pitfalls are:
When conducted properly, postmortems create the right mindset. They have the potential to make the company as a whole more resilient and efficient.
While fully analyzing a critical incident after it has been solved can be tedious, having a fully integrated, holistic platform that can document and enhance collaboration between all the parties involved can prepare the organization for what lies ahead. Failure isn't a disaster, it's a learning opportunity for the whole company and should be treated as such.
The Exigence platform provides just that. It coordinates all stakeholders and systems, simplifies the post-mortem, and always leverages lessons learned for doing it even better next time. For more information and to schedule a demo, contact us.