SREs: The Engine Driving the Future Forward
The concept of Site Reliability Engineers (SREs) originated in the halls of Google, and it was their secret sauce to manage one of the world's largest software platforms: google.com. SREs are meant to redefine the relationship between software developers, operations staff, and the business – to help everyone work together building sturdy, flexible systems, with constant improvement and automation as core principles.
As Honeywell has transformed into an industrial software company that is revolutionizing the industrial world with Honeywell Forge-- Enterprise Performance Management (EPM)Software-as-a-Service, purpose-built on a native edge-to-cloud, data-driven architecture to accelerate our customers’ digital transformation – one common element emerged across our customers, be they buildings, commercial aircrafts and oil and gas refineries, etc.: they emit a ton of IIoT data.
For our customers, the reliability of their software systems is critical. If problems occur, operations across the enterprise are impacted and the implications can be enormous. That’s why we have Site Reliability Engineers (SRE’s) within Honeywell Connected Enterprises. Our SRE team is responsible for building, managing, and maintaining the reliability of our production systems.
Today, we are not worried about high touch, high performance servers, as distributed software architecture is far more resilient to failure than high touch servers. Architecting and expecting failure is how we achieve high reliability. Hence, our focus has completely shifted from hardware to software-defined infrastructure and from inconsistent and error-prone manual deployment processes to consistent, reliable, and repeatable, fully-automated one-click deployments.
SRE @ HONEYWELL
Honeywell Connected Enterprise’s SRE team is responsible for building, managing, and maintaining the programmable global cloud infrastructure, in accordance with customer provided requirements, with the goal of enhancing thereliability, availablity and resilientce of the managed products and services. They bring software engineering principles to infrastructure and operations problems, with the goal of creating highly scalable and reliable systems.
The team’s main responsibilities include: writing the infrastructure code and building the production environments and establishing service level thresholds, often manifested as service-level objectives (SLOs), which help inform if a release gets greenlighted. With error budget adoption, there should be better uptime, leading to more rope developers launching cool new features and SREs getting more sleep! In turn, a more symbiotic relationship can form between the functions, which is a far cry from the old days of developer and operations antagonism.
An SRE function will typically be measured on a set of key software reliability metrics, including performance, availability, efficiency, observability, capacity management and emergency response. These are often expressed as SLOs – the measures of the products they have been charged with – and are a focus of our SREs.
The Focus Areas:
Honeywell Connected Enterprise has a 3-prong approach to solving reliability challenges for our software without impacting the agility at which we produce features and products for our customers.
1. Shift Left - On the left-hand side of the product development, our SREs are embedded into the release train and work very closely with the product teams to ensure every line of code and every user story has been evaluated for reliability. We make sure that our products have the resiliency we need and that they are emitting the right metrics for us to more efficiently and effectively find and fix issues in production.
2. Shift Right - On the right-hand side of SDLC, we are streamlining our incident management processes, automating the change and emergency response processes, building an observability platform and tools and accelerators that help detect and restore issues quickly, doing one-click deployments, creating a central repo for all our post-mortem data and building the intelligent RCCAs.
3. In the middle, we are building global capabilities that are designed to make our overall operations more efficient and to reduce the toil by building capabilities like the self-healing engine, anomaly detection systems, global observability and fleet management platforms, capacity engineering service and other tools and accelerators.
The Workload: 50/50 rule
Unexpected work is to be expected for SREs. SREs often do not have the luxury of planning their entire day. That being said, we don’t want to just survive, we want to be in a continuous cycle of improvement. That’s why we have the 50/50 rule where 50% of our SRE team’s capacity is designed to be spent on operations work and 50% of the capacity is designed to be spent on proactive work. This helps solve two problems:
1. The team is investing time to improve our operational toil by automating the manual cumbersome work.
2. And, they are looking for opportunities and new viewpoints to solve problems in a creative long-term way, rather than just dealing with them as they come up.
The Methodology: Reliability Maturity
One of the biggest challenges we face is making sure our products are ready for commercial release. If you read Google’s SRE book, you will find a lot about the production readiness reviews (PRR) and its limitations. We firmly believe in the early SRE engagement model for new services where we are part of the design reviews in early development stages. For existing services, we do Release readiness reviews to ensure reliability. Both the approaches use the same in-house created reliability maturity model that helps define product readiness. At the beginning of production development, we look at critical super themes like SLO, SLI, SLA, error budgets, observability, high availability, emergency response, PSR, security, one click deployments, etc. and rate each and every service that makes up a product to ensure we meet our objectives. We provide the stories with the definition of “done” for each item as part of product development. SREs also enable the productengineering teams by providing templates, code snippets, tools, and frameworks to get to what good looks like for each item we are going to evaluate as part of reliability readiness for the product.
As we grow to meet this demand, we are focused on ensuring we have a common framework in managing requirements, tracking, and overseeing projects, as well as automating our integrations and deployments in a repeatable manner. For that we have built an SRE platform that brings all the tools, tech, processes together to unify us with a single goal of making our products reliable for our customers.
The Team: Building the team is as important as building the software
Hiring and building the team is one of the biggest challenges. It’s a competitive talent market in a relatively new field. Here are the skills and experience we look for in an SRE:
• First and foremost, software development skills and mind set
• Cloud and system administration
• Infrastructure as code
• Observability - knowing what can go wrong in a complex distributed system coupled with strong desire to prevent it
• Performance and test engineering
• And of course, passion!
For Honeywell Connected Enterprises, we are aiming to develop our talent for the futureand we provide training aimed at enhancing our SRE’s skills.
As the saying goes, culture beats strategy on any given day. Our focus is on creating a strong mix of institutional knowledge combined with a fresh perspective to help drive change. Our goal is to build an SRE team of individuals who write software to replace previously manual work, even when the solution is complicated. As Honeywell Connected Enterprises continues to grow, SREs are a critical component of keeping the engine running and driving the future forward.