As leaders in engineering, we are ultimately responsible for the quality of our product, the most important factor for user happiness and business success. Although feature deliverables are a regular focus, the complexity of modern software means that small performance deficiencies can lead to substantially larger impacts. This reality required my teams to persist through architectural endeavors, and the tightest scrutiny I experienced on my teams was often in team culture.
Leading a group of highly skilled and intellectual engineers requires more than just engineering competence; it also means guiding systems and people to be flexible. If you are leading a large organization or gearing up to lead for the first time, scaling these behaviors is a challenge I vividly remember.
I have managed connected engineers building (time-critical) financial systems, and with our systems, if we had an issue for more than 30 minutes, where even brief downtime could have serious consequences. I want to share something important (while keeping the underlying meaning unchanged) that you should find useful wherever your software’s stability and performance is essential. This article presents ideas in four categories – focusing on a reliability culture, the deploy trade-off, resilient teams, and sustaining progress.
Establishing reliability as a developing mindset in engineering leadership
I recall the moment when we launched our new product globally and had a significant look at our primary trading system. In the early days of launch, we noticed increased latency in some of our services. A new engineer who had just joined the team did not hesitate to walk the incident team through what had recently changed. This was not just random luck; it was a consequence of months of culture-building and collective practice we had gone through.
Reliability is not a one-off event; I discovered that it has to be a part of the culture and is constantly practiced by the leader. For us, this was:
Making performance visible
Visibility is important to us — we put our primary metrics for p95, p99 latency error rates, and SLOs in team dashboards. We reviewed the performance and reliability trends together in standups and team meetings and not just after an incident. We are now making explicit the reliability goals. Visibility makes it real.
Analyzing beyond available metrics
We are being much more direct when analyzing why we are breaching SLOs – is it code quality, dependency, or infrastructure? We used the error budget burn rates during the quarter to actively discuss the right trade-off between features and stability.
Design for failure
I pushed the teams during their design reviews to analyze failure scenarios (e.g. what if this dependency fails?) and when the design will not scale and needs a different architectural consideration. We are all in for implementing fault tolerant models (circuit breakers, retries), so the design holds up to what we think it is intended to.
Building a team culture of accountability: Reliability needs to be everyone’s job. We established supported on-call rotations, training and safe channels (Slack retros) to raise reliability concerns and we made sure the reliability work was included in their performance metrics.
Questions to ask yourself
- How visible are your performance metrics?
- Does your team consider reliability concerns before starting the project?
Explain how you separate reliability work from feature work.
Guide your way through the technical trade-offs, providing context as needed
“Should we include mandatory risk checks, adding 15ms, or can we stay at our original target of 50ms latency?”
This decision was not made in isolation. We used our framework to quantify and discuss our impact, cost, risk and goals together to come up with a hybrid solution.
Building software is a continuous balancing act. My role evolved into facilitating, in an explicit, transparent, collaborative and agile way, the decisions about the trade-offs
Creating structures for decision-making: Establishing a lightweight framework
When I facilitated the team, I intentionally asked the team to evaluate options against key criteria: user impact, implementation cost, operational risk, scalability, and goal alignment. Documenting the decisions and rationale using simple architecture decision records (ADRs) helped the team capture the ‘why’ and helped avoid re-litigation. I remember the discussions about consistency models; we agreed that eventual consistency could be acceptable in certain cases, but strong consistency was critical for any financial system.
Facilitating collaborative decision making
I facilitated so that we respectfully captured a wide range of perspectives, including engineering, product, design and security. Active listening and creating a safe space for dissent helped the team converge to a shared understanding of the disagreements, even if it was not necessarily a consensus.
Evolution of decisions
The team I was facilitating was thinking through financial transactions and documenting the decision to select an async anti-cheat approach in the ADR for better performance. When the team faced new cheating patterns months later, the ADR meant the team could intelligently add synchronous checks without needing a complete redesign.
For leaders experimenting with frameworks
When I first shared new frameworks, I focused on making trade-offs visible and talked about decisions documented when possible. It turned out I made the transition to framework implementation a good deal smoother than I thought – even small documentation helped the team appreciate we were contributing to the design rationale.
Balancing safety and speed
I helped optimize the mandatory checks legally required of them, guided deliberation and decisions based on risk acceptance and business context, often supporting a hybrid between the two approaches. Fast recovery was deemed a primary emphasis when it wouldn’t be possible to prevent failures.
Questions to ask yourself
- How explicitly are performance trade-off discussions happening for your team?
- Are you documenting the design rationale so you all can refer back?
- Do you consistently consider different viewpoints within a team?
Hire, coach and invest in developing highly impactful, resilient teams
During an outage, I observed a new senior engineer managing a failure, methodically diagnosing the cause, and communicating with a calm demeanor. This was by design–the hiring process had screened for temperament, and the engineers who regularly integrated incident drills had built confidence as a team. Developing this sort of systematic approach took focused attention on selection and development.
Complex systems require engineers to build through deliberate approaches to selection hiring, coaching, and evolving practice with an emphasis on human factors.
Hiring for performance-critical environments
I focused heavily on temperament and approaches to work. I was using behavioral questions; I was looking just as much for the way that candidates communicated under pressure, as information I might be able to glean from their data points. I was looking for systems thinking skills, debugging confidence, and proactive failure identification skills.
Build a sense of trust
Conducting blameless post-mortems was imperative to focus on improving the systems without getting into blame avoidance or blame games. Building trust required consistency from me: admitting mistakes, getting feedback, going through exercises suggesting improvements, and responding in a constructive way. At the heart of this was creating the conditions for the team to feel safe taking interpersonal risks, so it was my role to steer conversation towards systemic factors that contributed to failures (“What process or procedures change could prevent this?”) and I was regularly looking for the opportunity to discuss, or later analyze, patterns across incidents so I could work towards higher order improvements.
Making ‘war games’ effective
Starting out: We started with simple exercises: talking through scenarios, using runbooks to find holes and gaps, and focusing on roles and communications.
Maturing: As we matured and accepted “war games,” I developed a regular rhythm for conducting these monthly. I started taking advantage of chaos engineering tools in staging, where we began to exploit chaotic tooling to simulate realistic failures (deliberately failing databases and bad pull requests). I learnt that the importance of an immediate 30-minute blameless debrief afterwards was vital – this made identifying improvements to procedures and runbooks, monitoring or how people inherently communicate obvious, often leading to MTTR reductions in real incidents.
Learning from mistakes: In the first complex war game, we overwhelmed the team completely. We started to be simpler and increased the complexity as the team started to gain confidence in learning moments.
Success metrics: We measured as we went along metrics on the incidents (MTTD/MTTR) and examined the velocity of action items from the post-mortems and checked in on team feedback on their confidence. The most effective thing was to maintain a ritual of coordination, or hindsight, upon real incidents after past incidents.
Questions to ask yourself
- How comfortable are team members sharing reliability concerns?
- Does your team look for ways to prevent incidents through your reviews or look for ways to blame others?
- How often does your team practice responding to failure?
Evolving: reliably and sustainably
“Remember when performance dashboards were ‘nice to have’? Now they are the first place everyone looks.” The earlier comment confirmed we have successfully created habits in our reliability practices. I am still filling in good mental spaces for sustained habits for reliability, receive acknowledgement of the performance we have achieved, and pay attention to formal performance increases within the span of a tracked measurement of success.
I learned that building resilience is not enough; I will have to create sustained and evolving. I employed multiple strategies:
Evolve how we view targets and metrics for measurement
We periodically revisited the reliability target SLOs with stakeholders correctly. I learned it was important to schedule these formal reviews quarterly so that your SLOs are relevant. I also learned it is important to consult with the maturity of the teams on what reliability looks like from the starting maturity of the CoE, as reliability is an implied target to the familiarity of your team’s capacity.
Sustain how we think about technical debt management
We formalized the cadence for routine addressing of our tech debt, which equals extended reliability in our sprints. I learnt you have to state that you are allocating a budget 20-25% of the sprint engineering capacity for tackling proactive action.
Institutionalize evolving your learning loops
We tracked action items from our post-mortems, made shared outcomes post-mortems, and quantified how well the improvements have been to the team and audience. There was a priority in ensuring teams had reasonably communicated back their progress on continuous learning, therefore ensuring they did not suffer from post-mortem fatigue.
Signalling leadership consistency
I learned that if importance was to be recognized in reliability in resource constraints, I would have to consistently signal success in how the teams were improving reliability capacity alongside feature development. I could not simply opt in to establishing a trust for the notorious features in our behaviors, which required context of our currency for improvement or designing performance.
For teams just starting out, my advice is to take a staged approach. Pick one or two practices they can begin, evolve their plan for how they will evolve the practice and some metrics for the team to realize early value.
Questions to ask yourself
- What have you changed from reliability practices recently?
- What explicit actions do you take to ensure reliability is a priority when you have a lot on your plate?
- How effectively does your team make note of and apply lessons learned from incidents?
Building for today and tomorrow
In my experience, leading top engineering teams requires a set of skills. Building a strong technical culture, focusing on people, guiding teams through difficult times, and establishing durable practices.
The practices that we reinforce will become even more important as we begin to use and implement AI and machine learning. We will need to manage the complexity, ensure reliability and security, and also make the developer experience better when maintaining good performance.
Getting underway: Your next three actions
If I were to start this journey again today, I would think about these immediate steps:
- Evaluate your current state. Collect/quantify your current state regarding reliability metrics, safety, and decision-making processes.
- Choose one area of focus. Choose one lesson from this article that would immediately benefit the team.
- Establishing visibility. Make one change to elevate the visibility of your performance/reliability data and make it visible to the team daily (dashboard discussion metric).
I believe we not only create resilient systems by encouraging learning and supporting your team but also empower organizations to meet the future with confidence.
Rahul Chandel is an engineering leader with over 15 years of experience in software engineering, distributed systems, cloud computing, blockchain technologies, payment systems, and large-scale trading platforms. He has led high-performing teams at Coinbase, Twilio, and Citrix, driving innovation, scalability, and operational excellence across mission-critical systems. Rahul is passionate about fostering innovation and designing systems that thrive under real-world pressure.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?
Read More from This Article: Leading high-performance engineering teams: Lessons from mission-critical systems
Source: News