Measuring Site Reliability Maturity
So, your company has heard of Site Reliability Engineering and thinks it would be highly beneficial to do “all of the SRE things”. But where on earth should you begin in such an endeavour and, of equal importance, how do you measure your progress?
This article provides an overview of the Site Reliability Engineering Maturity rating system that the British Telecom SRE team formulated. We wanted to rate our own performance in a frank and honest way to gain an insight on how to plot our path to SRE elite status. This system really helped focus the team on our SRE goals and aided us in understanding where we were on the journey of reaching those goals. Disclaimer: The author is based in the UK and so apologises in advance for British spelling!
I am assuming you are reading this article because one or more of the following may ring true for you:
- You are wondering how best to lay out your SRE roadmap.
- You may have a new SRE team that is a little green around the gills.
- You may have limited spare SRE resources or budget.
- You probably have a number of burgeoning cloud accounts to support.
- You have an army of developers clawing at the cloud door, eager to experiment with cloud solutions. You would like them to be able to self serve in a secure and cost efficient manner.
If so, you find yourself in a similar situation to that which our SRE team at BT did several months ago. We had the added complication of setting up our SRE function at the beginning of the Covid pandemic, two weeks before the first lockdown was imposed, to be precise. Since then, the team has not been in the office together so we needed a way to ensure we stayed focused on our team goals.
You have a Shiny New SRE team… Now what?
We had to decide how our SRE team could add value quickly, and we had to devise some standards that would allow developers to work at pace in the cloud, but… in a safe and cost efficient way.
Firstly, we felt it was important to define what SRE meant to us, and identify the problems we wanted it to help address in our organisation. And, like its sibling term ‘DevOps,’ the definition of ‘Site Reliability Engineering’ can become a little pliable. While, as we all know, Google literally wrote the book on SRE, we felt it should not be viewed as a bible; our SRE practices must be specific to our organisation. SRE is not a one-size-fits-all discipline and we wanted it to be intimately aligned to our ways of working and the systems we are building.
As it was early days still for our SRE team, we thought that we should run a maturity report on ourselves every 6 months to help us understand the direction we need to travel in and start formulating some plans of action. The findings of these maturity reports help to shape and provide input to the SRE roadmap. In fact, the outputs of the report come in the form of Epics which are planted into our 12 month roadmap view.
By adapting this assessment to your own organisational needs, I hope you will get a better understanding of whereabouts on the SRE path your team or organisation stands. This will shed some light on what your next move needs to be, so SRE can be a force in improving your organisational capabilities and performance.
The Maturity Report
Remember… being honest with oneself is critical in this exercise. Don’t be concerned about achieving a low score. Self reflection and impartiality are our tools for achieving an accurate view of where we are and the path to where we want to be.
We broke the maturity review down into two SRE activity types: SRE activities specific to our organisation and “traditional” SRE activities.
SRE Activities Specific to the Organisation
These scorings demonstrates a maturity level for organisation specific goals of the SRE team. You may not see these activities in the Google SRE handbook, but they are responsibilities the SRE team believe will add value for the business. For us at BT, we felt these goals will also provide a solid foundation for adopting more traditional SRE responsibilities in future. The organisational activities shown here are an example, and should be amended to reflect your own needs.
Traditional SRE Activities
This score demonstrates a maturity level by which your organisation can be measured against traditional SRE goals and activities. Again, this is not an exhaustive list by any means and it may be that not all relate to the SRE plans for your organisation.
The tables below shows how the two SRE activity types (organisational and traditional) can be scored (please note this data is an example and any resemblance to an organisation living or deceased is purely coincidental):
Ratings Explained
The below shows us the method of SRE maturity scoring and how that translates into an overall SRE maturity score:
1. Not performed or planned
This activity does not take place and there are no plans to introduce it within the SRE backlog of work.
2. Inception phase (planned or being developed)
This activity is currently a work in progress or is in the planning phase i.e. there are stories in the backlog or current sprint.
3. Active / manual steps
This activity currently takes place but involves manual steps or is undertaken when the team “has time” (i.e. it is not to a planned schedule).
4. Active / automated
This activity currently takes place and is automated.
5. Continuous Improvement phase
This activity takes place and is subject to continuous improvement i.e. the activity is regularly reviewed for efficiency and reliability.
Overall Score
This is a blended score of organisational and traditional SRE goals and activities. This score helps us gauge where we are on our journey towards an “Elite” SRE maturity level. It also guides us on which areas require more focus.
How to Summarise the Report?
That is the easy bit done. Now how can we summarise our maturity report and act upon our Findings?
Hopefully your maturity level was pretty much what your team expected. The scoring will give you an insgith into which SRE disciplines either have not been touched at all yet, require more work or investigation or perhaps closer collaboration with other teams. A summary of the actions for these findings could be as follows. Again, these are purely examples:
- The activity types scoring as one will need Epics raised to begin the process of how to address and evolve each are. Prioritisation will be based on the existing organisational SRE goals.
- The activity types scoring two will require additional stories raised to progress the activity maturity level to a higher score. Prioritisation will be based on the existing organisational SRE goals.
- The SRE roadmap should be amended to reflect which activities should take priority over the coming 6 months. Senior management input should be sought to help determine which activities are of most value to the business at this time.
- If you have a low “traditional” SRE score, perhaps SRE training should be sought to help the team become more productive in terms of developing and implementing traditional SRE activities.
- There may be events such as game days / DR tests etc that need to be captured in an SRE calendar. Get these events mapped out in advance for the next six months.
- Select two high value activities to pursue immediately that relate to Observability. These activities will be essential for development teams to have a better understanding of how their apps are behaving in production. This might include:
- Ensuring development teams are provided with observability of production performance in real time.
- Ensuring development teams are provided with observability of production incidents. - Etc…
At BT, we felt the best way to track and act upon our findings was to generate Epics for the upcoming 6 month period. We chose a score we wanted to reach at our next review and based our Epics upon the tasks we would need to undertake to achieve that score. I hope this helps you on your SRE journey. Please feel free to reach out to learn more!