In large, complex organizations, sometimes the only metric that seems to matter is mean time to innocence (MTTI). When a system breaks down, MTTI is the tongue-in-cheek measure of how long it takes to prove that the breakdown was not your fault. Somehow, MTTI never makes it into the slide deck for the quarterly board meeting.
With the explosion of tools available today—observability platforms for gathering system telemetry, CI/CD pipelines with test suite timings and application build times, and real user monitoring to track performance for the end user—organizations are blessed with a wealth of metrics. And cursed with a lot of noise.
Every team has its own set of metrics. While every metric might matter to that team, only a few of those metrics may have significant value to other teams and the organization at large. We’re left with two challenges:
1. Metrics within a team are often siloed. Nobody outside the team has access to them or even knows that they exist.
2. Even if we can break down the silos, it’s unclear which metrics actually matter.
Breaking down silos is a complex topic for another post. In this one, we’ll focus on the easier challenge: highlighting the metrics that matter. What metrics does a technology organization need to ensure that, in the big picture, things are working well? Are we good to push that change, or could the update make things worse?
Availability Metrics
Humans like big, simple metrics: the Dow Jones, heartbeats per minute, number of shoulder massages you get per week. To get the big picture in IT, we also have simple, easily-understandable metrics.
Uptime
As a percentage of availability, uptime is the simplest metric of all. We would all guess that anything less than 99% is considered poor. But chasing those last few nines can get expensive. Complex systems designed to avoid failure can cause failure in their own right, and the cost of implementing 99.999% availability—or “five nines”—may not be worth it.
Mean Time Between Failures (MTBF)
MTBF is the average time between failures in a system. The beauty of MTBF is that you can actually watch your boss start to twitch as you approach MTBF: Will the system fail before the MTBF? After? Perhaps it’s less stressful to throw the breakers intentionally, just to enjoy another 87 days!
Mean Time To Recovery (MTTR)
MTTR is the average time to fix a failure and can be thought of as the flip side of MTBF. Both Martin Fowler and Jez Humble have quoted the phrase, “If it hurts, do it more often,” and that principle seems like it could apply to MTTR as well. Rather than avoiding changes—and generally treating your systems with kid gloves to try and keep MTBF high—why not get better at recovery? Work to reduce your MTTR. Paradoxically, you could enjoy more uptime by caring about it less.
Development Metrics
For years, an important improvement metric used by developers was Product Owner Glares Per Day. Development in the 21st century has given us new ways to understand developer productivity, and a growing body of research points to the metrics we need to focus on.
Deployment Frequency
The outstanding work of Nicole Forsgren, Jez Humble, and Gene Kim in Accelerate demonstrates that teams that can deploy frequently experience fewer change failures than teams that deploy infrequently. It would be a brave move to try and game this metric by deploying every hour from your CI/CD pipeline. However, capturing and understanding this metric will help your team investigate its impediments.
Cycle Time
Cycle time is measured from the time a ticket is created to the healthy deployment of the resulting fix in production. If you needed to fix an HTML tag, how long would it take to get that single change deployed? If you need to start calling meetings about the deployment outages, you know that the value of that metric, for your organization, is too high.
Change Failure Rate
Of all your organization’s deployments, how many need to be rolled back or followed up with an emergency bugfix? This is your change failure rate, and it’s an excellent metric to try to improve. Improving your change failure rate helps developers to proceed more confidently. This will improve the deployment frequency rate im turn.
Error Rate
How many errors per hour does your code create at runtime? Is that better or worse since the last deployment? This is a great metric to expose to stakeholders: Since many demos only show the UI of an application, it’s helpful to see what is blowing up behind the scenes.
Platform Team Metrics
Metrics often originate from the platform team because metrics help raise the maturity level of their team and other teams. So, which metrics are most helpfu? While uptime and error rate matter here too, monthly active users and latency are also important.
Monthly Active Users
Being able to plan capacity for infrastructure is a gift. Monthly active users is the metric that can make this happen. Developers need to understand the load their code will have at runtime, and the marketing team will be incredibly thankful for those metrics.
Latency
Just like ordering coffee at Starbucks, sometimes you need to wait a little while. The more you value your coffee, the longer you might be willing to wait. But your patience has limits.
For application requests, latency can destroy the end-user experience. What’s worse than latency is unpredictable latency: If a request takes 100ms one time but 30s another time, then the impact on systems that create the request will be multiplied.
UX Metrics
Senior and non-technical leadership tend to focus on what they can see in demos. They can be prone to nitpicking the frontend because that is what’s visible to them and the end users. So, how does a UX team nudge leadership to focus on the achievements of the UX instead of the placement of pixels?
Conversion Rate
The organization always has a goal for the end user: register an account, log in, place an order, buy some coins. It’s important to track these goals and see how users perform. Test different versions of your application with A/B testing. An improvement in conversion rate can mean the difference between profit and loss.
Time on Task
Even if you’re not making an application for employees, the amount of time spent on a task matters. If your users are being distracted by colleagues, children, or pets, it helps if their interactions with you are as efficient as possible. If your end user can complete an order before they need to help the kids with their homework or get Bob unstuck, that’s one less shopping cart abandoned.
Net Promoter Score (NPS)
NPS comes from asking an incredibly simple question: On a scale of 1 to 10, how likely is it that you would recommend this website (or application or system) to a friend or colleague? Embedding this survey into checkout processes or receipt emails is easy. Given enough volume of response, you can work out if a recent change compromised the experience of using a product or service.
If you can compare NPS scores for different versions of your application, then that’s even more helpful. For example, maybe the navigation that the marketing manager insisted on really is less intuitive than the previous version. NPS comparisons can help identify these impacts on the end user.
Security Metrics
Security is a discipline that touches everything and everyone—from the developer inadvertently creating an SQL injection flaw because Jenna can’t let the product launch slip, to Bob allowing the physical pen tester into the data center because they smiled and asked him about his day. Fortunately, several security metrics can help an organization get a handle on threats.
Number of Vulnerabilities
Security teams are used to playing whack-a-mole with vulnerabilities. Vulnerabilities are built into new code, discovered in old code, and sometimes inserted deliberately by unscrupulous developers. Tackling the discovery of vulnerabilities is a great way to show management that the security team is on the job squashing threats. This metric can also show, for example, how pushing the devs to hit that summer deadline caused dozens of vulnerabilities to crop up.
Mean Time To Detect (MTTD)
MTTD measures how long an issue had been in production before it was discovered. An organization should always be striving to improve how it handles security incidents. Detecting an incident is the first priority. The more time an adversary has inside your systems, the harder it will be to say that the incident is closed.
Mean Time To Acknowledge (MTTA)
Sometimes, the smallest signal that something is wrong turns out to be the red-alert indicator that a system has been compromised. MTTA measures the average time between the triggering of an alert and the start of work to address that issue. If a junior team member raises concerns but is told to put those on ice until after the big release, then MTTA goes up. As MTTA goes up, potential security incidents have more time to escalate.
Mean Time To Contain (MTTC)
MTTC is the average time, per incident, it takes to detect, acknowledge, and resolve a security incident. Ultimately, this is the end-to-end metric for the overall handling of an incident.
Signal, Not Noise
Amidst the noise of countless metrics available to teams today, we’ve highlighted specific metrics at different points in the application stack. We’ve looked at availability metrics for the IT team, followed by metrics for the developer, platform, UX, and security teams. Metrics are a fantastic tool for turning chaos into managed systems, but they’re not a free ride.
First, setting up your systems to gather metrics can require a significant amount of work. However, data gathering tools and automation can help free up teams from the task of collecting metrics.
Second, metrics can be gamed, and metrics can be confounded by other metrics. It’s always worth checking out the full story before making business decisions solely based on metrics. Sometimes, the appearance of rigor in data-driven decision-making is just that.
At the end of the day, the goal for your organization is to track down those metrics that truly matter, and then build processes for illuminating and improving them.
Source: cisco.com