Should production support always be separate from sprint execution?
When I was running engineering and operations, one of the things that we did was dedicate 20% of the developers' time to proactive repair maintenance for any of the user stories that come out of production incidents. We called them improvement sprints and that time was baked into their day-to-day work. That was the only way that we were able to reduce while increasing the reliability on our platform and this was aside from all the innovation.
It was exciting for a lot of my developers because they were happy to be squashing so many bugs that could potentially come out in production. So we do need to have two different teams but at the same time, we still want to bring that agility of the combined approach and have visibility into problems all the way up to development so they can come in and fix them proactively.
That's a very good point because the IT operations side won’t be in a position to go back in and do lots of development, so it will go back to the bug fixes that need to be done by the development side of the house. But I think the key point is that the combined approach is also an incentive for developers to get it right the first time, otherwise they'll have to fix the mess they created.
It's okay to balance the two. What I'm proposing for now is separation because my team is not able to focus enough or have the discipline for 80/20. We're splitting the two but with the promise that we will have options to rotate in and out, since we're growing 50% plus year over year. So what I've said is, "As we grow and as you excel, you have an opportunity to go into the scrum teams, and then we’ll get new people in on the production support side." Some people just want to be in production support and that's fine too.
In an enterprise, if you're embracing things like scaled agile framework (SAFe) or LeSS, you have that improvement sprint that allows you to get all those user stories in so that you can methodically start working on those. But it becomes mechanical after a while.
People want to work on things that are exciting and actually bring delight. Gamifying it is something that our developers and operators really love. Every quarter we used to have an entire day of games where we would pull the user stories, squash as many bugs as possible, and then celebrate over beer. So try to gamify production support, otherwise, it becomes a chore and people will not embrace the true spirit of why we're doing it. That's something I would caution.
It gets complicated after that. But a few things to consider in the answer.
If it is a questions of developers wanting to push code and ops wanting to preserve stability then you have a problem. This is what DevOps was trying to solve. This is not a good approach.
If you thing ops takes too much time away from projects. That isn't a good answer. All work is work. You want to work on the most valuable work. That may be ops or feature development. Creating a bucked just based on the source of the work means that you are expending resource on less valuable things reducing the total value delivered.
BUT... unplanned work is very disruptive. Production support has a lot of false alarms and busy work.
1. I create a team to catch the noise. I try to automate the noise away but until I do I don't want a bunch of it to hit my development teams.
2. Once we get passed the noise, if there is a small amount of work incidents then I can have the development support it. I generally reserve capacity for this each sprint based on the expected amount then have the product owner prioritize anything in excess.
For more complex environments....
3. For modern applications, I create an SRE team that is focused on the operational/scaling/instrumentation/automation side of things. Developers often aren't good with these things. This teams start by converting all of the operational decision into business facing metric that the business will care about if they are failing. I use this to balance work between running down technical debt and feature development. Otherwise the SRE team focuses on toil reduction, automation, scalability, latency planning, etc. and can change the code to increase resilience.
4. The SRE and Development teams are a single resource bucket. If resilience is below business requirements, resources shift to the SRE team until below. If resilience is above targets then resources shift back to the feature team to accelerate development.
In both cases though, the development team own accountability for the resilience and product performance of their application. All people involved with development/operations are measure by speed, throughput, ops cost per unit of value, product stability, customer and employee satisfaction. So we don't have different measurement systems.
Content you might like
Yes, we schedule these as separate meetings37%
No, we discuss them at the same time during scheduled performance reviews51%
No, but we’re working to implement a process for both discussions10%
Not sure2%
Other (I’ll comment below)0%
Limited understanding of benefits17%
Organizational silos60%
Unclear communication54%
Employee skepticism54%
Resistance to existing workflows15%
Unclear roles19%
Job security concerns4%
So as you start to scale you have to separate that out. You have an IT operations group that runs your infrastructure, network, and applications on a day-to-day basis. You have your solution delivery piece that is all project-related. Apart from avoiding any disruption of projects, the key reason for doing that is actually different mindsets. The mindset of a group running IT operations is around metrics of uptime, number of incidents, number of problems solved—solving the root cause, the Pareto principle of common things that are happening, etc. The metrics of managing a project are around scope, budgets, timeline, etc. Because they're very different mindsets, if you have the two together then you have an inherent conflict.
I would agree with that. Developers are focused on pushing as many new features as possible, while the IT operations team wants to maintain stability, resiliency and scalability.