Blog

Strategic Planning: The path to improved Infrastructure system stability and reliability

July 2, 2025
The horrible year

The year 2023 was an annus horribilis for the Infrastructure team, as physical failures and maintenance upgrades led to three P1 incidents.  At the end of 2023, the spotlight was on the Infrastructure team, with a clear need for improvement.  This is the story of the improvement journey taken and the outcomes achieved.

Clear objectives

Omise uses OKRs to make plans for each year.  OKRs - Objectives and Key Results - define what you want to achieve and how to measure and facilitate the achievement of such.  For example, a golfer might set an objective to: “improve my golf handicap by five by the end of the year” with key results of: “increase ‘greens hit in regulation’ by three per round”, “reduce the number of three putts to three per round”, etc.  If you meet those key results, you should have met your objective.

At the end of 2023, the Infrastructure team was given two clear objectives with clear key results:

  • Improve system stability
    • Achieve 99.99% availability
    • P1/P2 incidents reduced by 50%
  • Reduce client impacts from infrastructure changes
    • Change success rate > 80%
    • Incidents caused by infrastructure changes reduced by more than 10%

Bridging the gap

With our OKRs in place, we knew our goals for the year, but while they tell you where you want to be, they don’t give any guidance on how to get there.  Indeed, Google's OKR Playbook says key results “must describe outcomes, not activities”.  

For example, if we were to meet the key result “P1/P2 incidents reduced by 50%“ how would we do that?

We knew from previous experience using the OGSM model that the answer lies in developing a framework that combines strategic direction with tactical execution.  In this case, we chose to use the terms “Which paths to take?” and “How to overcome obstacles?”.

A planning workshop

With our structure in place, we could start our team’s planning workshop.  To “get the ball rolling”, we started by reviewing the “Driving forces”, the “business needs” that required us to reach our objectives, and why the specified key results were important.  

For example, what was driving “Improve system stability“ / ”P1/P2 incidents reduced by 50%”?  One part was that our Canary process didn’t cover enough deployments, and another part was that our revert process wasn’t fast enough.

With those driving forces in place to remind us that we were no longer facing a “blank piece of paper”, ideas started flowing.  We came up with over two dozen paths to take or ways to overcome obstacles.  

The strategic plan

There was a lot to consider, so we took a break, came back with fresh eyes, and looked at what we had.  It was then that we realised everything fitted into two main themes: “Don’t go wrong” and “Recover quickly.”

Within those strategies, we placed the matching tactics and then voted on those that would give the most “bang for the buck” or were key to other improvements.  Finally, our plan looked like this:

Strategies / Tactics
Don’t go wrong

Implement Canary deployment for all applications

Define Pre/Post checks in all Infrastructure team runbooks

Refresh the Production Readiness Review process and authority

Recover quickly

Implement an alternate revert for deployments; empower developers to revert

Audit and refresh our alerts and monitors

Implement Smoke testing

Making it happen

In the week following the workshop, we created Project Plans with milestones and deadlines, but it was when we created Epics and Tasks that we saw how to implement everything.  We then rolled up our sleeves and got to work, implementing everything over the course of 2024.

The results

As the saying goes, the proof is in the pudding, so did we meet our OKRs?  Did we achieve the Key Results to show we met our Objectives?  In short, yes.

As a reminder, our ORKs were:

  • Improve system stability
    • Achieve 99.99% availability
    • P1/P2 incidents reduced by 50%
  • Reduce client impacts from infrastructure changes
    • Change success rate > 80%
    • Incidents caused by infrastructure changes reduced by more than 10%

In 2023, a total of eleven P1/P2 infrastructure incidents occurred; in 2024, this number decreased to three.  We achieved ~70% reduction in P1/P2 incidents.  If we count those related to infrastructure maintenance, there was a ~60% reduction.

In 2024, there were eight infrastructure software upgrades, with only one failure.  We achieved a success rate of ~90% in infrastructure changes.  If you count all tasks under “Infrastructure maintenance,” only two failed out of 36, a success rate of ~94%.

Take aways

The main “take away” was the importance of the planning process. Whether you call them “Which paths to take” / “How to overcome obstacles” or “Strategies” / “Tactics” or something else, developing a framework that combines strategic direction with tactical execution is key to creating a plan that everyone can have confidence in and sign up to.

Other take aways include the usefulness of considering “Driving forces“ when you are stuck, and the usefulness of Epics, Tasks, and Sub-tasks over Project Plans for getting stuff done.

If you are facing ambitious OKRs and need to bridge the gap to results and operational excellence, we hope our experience can inspire and guide you, too.