How can engineering management effectively address the escalating risks of system failure caused by growing system complexity and technical debt?
Sort by:
This may sound obvious, but consider reducing complexity. A great measure of software quality is: how easy is it to change? Sometimes engineers tend to build overly complex systems in anticipation of future scale or business requirements that never materialize. An important part of addressing technical debt is to recognize where this has happened and whether it's really needed.
Sometimes microservices are not worth the complexity; a simpler monolithic application where multiple copies of data are not needed could be faster, more reliable, and easier to change. Over normalization of a database may be an obstacle to efficiency. Too much code abstraction or code re-use could mean changes are harder to make or cause bugs elsewhere in the system. The simpler an application can be and still meet its requirements, the better.
Engineering management can mitigate the risks associated with system failure caused by growing complexity and the demand for quicker delivery by making the case to their business leaders and clients that their product, processes, and/or services get certified against recognized standards.
For example, the most rapidly growing risk area for many IT and telecom systems is cyber threat. As AI tools become more available, cyber bad actors will see their capabilities explode. Its very important for manufacturers and service providers to ensure their products and processes get certified against things such as the ISO 27000 family of standards or the NIST 800-53. Certification against these standards requires independent auditors to ensure that products and services are as safe as they can be. It keeps engineers and designers “honest” and can act as a marketing advantage to nervous clients who are daily reading about the latest DOS or Ransomware attacks.
An effective way to address system complexity and technical debt is to first acknowledge and prioritize the issue. Like the woodcutter and axe story, failing to take a step back to address system complexity increases the risk of system failure. To manage competing priorities, it's crucial to allocate some bandwidth for addressing technical debt in every scrum.
You'll need to understand your teams' ability to manage complexity and scale.
Some companies successfully manage size and complexity that would crush others.
Ensuring you have some level of system observability is needed, and keeping track of risks is advised.
If your team is at all mature, you would likely be able to start with a survey asking what your largest risks are.
Perhaps you only have one person that understands critical areas, or knows how to deploy or change critical infrastructure or services.
Ensure you have those risks covered.
Test your response processes so the team knows how to work in high stress situations (downtime, outage, DR).
Track other cycle times to see if you are getting slower over time (deployment frequency, change failure rate, downtime, etc...)
Also survey your team on their attitude and morale as an unhealthy technical system often leads to frustration in the team. Let them fix the things that cause the most frustration.