According to a recent report by IBM, the damage caused by major IT incidents is greater than ever. An incident that results from a data breach will cost the organization an average of $3.86 million, with the average time to breach containment coming in at 280 days!
And according to the ITIC, hourly downtime costs come in at over $300,000, with some at even $1 million per hour.
Clearly, reigning in incidents, resolving them as quickly and efficiently as possible, and learning from past mistakes to optimize the resolution of future events – is top priority for anyone whose day-to-day involves major incidents.
Insights from the ITIL
Aligning with industry standards for efficient resolution has long been the strategy in focus for addressing this objective, with the ITIL serving as the preferred source for insightful methodologies and processes.
When it comes specifically to “learning from past mistakes,” nothing serves up knowledge and insights better than the right incident report.
It facilitates the incident review – including the unfolding of the incident itself and the how well (or not so well) the processes were executed, as well as the post-mortem, root cause analysis, and risk mitigation for future incidents.
According to the ITIL, the incident report should explain the following:
- What was the incident about?
- When did it occur?
- Where did it occur?
- How much time did it take to resolve?
- Who resolved?
- Who was involved in handling the incident?
- What troubleshooting steps were taken?
Download your free Major Incident Reporting Template
Answering just these questions, though, is not enough. Namely, the report should be comprehensive enough not only to determine the ‘what,’ ‘when,’ and ‘who.’
It should cover a much broader set of incident parameters, as follows:
This part of the report provides a holistic overview of all the incident parameters that require analysis. These are needed for the team to arrive at conclusions that will enable it to optimize resolution for forthcoming incidents.
Among these parameters are:
- Which services had been impacted and which related services had not?
- What were the symptoms, including errors and their impact on performance?
- What is the baseline state of performance and the delta during the incident?
- Which geographies and time windows had been affected?
- What was the interruption consistency?
- What was the correlation of impact on affected processes with that on other business processes?
- Which escalation steps had been taken?
- Which steps were taken that had proven to be helpful towards a more speedy and efficient resolution?
- What was the documentation that was created to support major incident management, and with whom was this documentation shared?
- Which actions are mandatory for restoring specific services during such an incident?
- What were the costs involved with these actions?
For each of these parameters it is important to also note who were the stakeholders involved.
One of the most critical pillars upon which the success or failure of major incident management lies is communications.
Accordingly, it is mandatory to document and report the effectiveness of each of the communication channels that were involved throughout the incident lifecycle. These include emails, conference calls, and Slack, for example.
Moreover, it must also be noted which stakeholder was or was not available and the steps that were taken to overcome the challenge of reaching them (and whether or not it was successful).
This is key to understanding how to ensure seamless communication, which is one of the key capabilities required for accelerating resolution and learning for future optimization.
Not only should ‘who resolved’ and ‘who was involved’ be noted – but who actually fulfilled each critical role and which critical roles were actively involved, including:
The service desk, typically the first to be made aware of a potential or actual major incident. They are one of the main points of contact for supporting affected users.
Technical resolution groups, who bring the essential technical skills, knowledge, and tools required for implementing a resolution to the major incident.
Technical lead, the senior technical stakeholder who reports to the overall major incident manager. They are responsible for centralizing and controlling technical analyses, fixes, and other related efforts.
Service continuity manager, who owns service continuity and may invoke a disaster recovery workflow when the service outage cannot be recovered in a timely fashion.
Service manager/director liaises with customer service and PR to assure that all the proper steps are being taken to avoid any damage to customer satisfaction.
Change manager, responsible for ensuring that standardized methods and procedures are adhered to when making changes to the IT infrastructure.
Major incident manager, who – clearly – is the stakeholder responsible for overall, end-to-end major incident management in the organization.
Initial Call, Rollcall
It can often be very difficult to get every incident stakeholder onboard and aligned in a timely manner. This is especially true with individuals typically being spread across multiple geographies and time zones.
However, few phases are more critical to incident management effectiveness than making sure that the right people are reached and briefed immediately. Then, you can start to ensure that they know exactly what they need to do to power resolution acceleration.
This is why it is so important to document the initial call regarding each participant, their team affiliation, and the details of their onboarding.
In this part of the report each meeting is documented, including tactical information such as meeting date, meeting name, participants, key findings, and action items assigned (and to whom).
By doing so, it can be determined if indeed all the right people participated in each meeting, which action items were optimally assigned, and that all incident stakeholders were coordinated in an efficient and organized manner.
This is the only way to determine whether the meetings – and not just the overall flow of incident management, were effective. Meeting notes are necessary to measure impact and performance for each phase in the resolution of the incident.
Download your Free Major Incident Management Template
The single source of truth
With an accurate and comprehensive incident record that includes actionable takeaways, future collaboration will be much more effective among. This is true for the various incident stakeholders both within and from outside the organizations.
Such a report will serve as the single source of truth and will guide future processes from discovery to maintenance and KPI analysis.
While every organization and its IT department does have its own varying priorities and processes, a robust major incident report can serve them all equally in their aim to ensure that SLAs are met, and that potential risk is mitigated.
Exigence presents: Your go-to incident report template
This is why we decided to put together your go-to incident report template with a view to comprehensiveness in addressing the diverse needs of any organization.
It includes clear instructions on how to provide detailed information regarding multiple considerations in each of the above noted incident categories.
This way, by tracking and documenting the full scope of the incident, organizations can learn, understand, and optimize.
In addition, if you’re looking to see how the reporting process can be taken to a whole new level through automation, we invite you to set up a brief demo of how the Exigence platform. With it, you can create the ever-strategic major incident report with just the click of a button.
To book the demo click here.