Site Reliability Engineering

Site Reliability Engineering is as much a methodology as a culture. Similar to the concept of DevOps.

Changing Culture of Teams to Incorporate SRE Practices

Psychological Profiles of Team Members

There are four basic psychological profiles into which you can general categorize your team members.

  1. Navigators: Those that want to move forward and help you succeed
  2. Critics: Those that have valid fears about change and have passion and energy about their thoughts and positions
  3. Victims: Those that see change as an attack on them personally
  4. Bystanders: Those that are generally apathetic. For them, you need to work to figure out what there feelings are regarding this change and see if/how you can engage them in the process

The best ways to manage potentially negative emotional responses to change

  1. Involve people in the change
  2. Set realistic expectations
  3. Identify opportunities for co-creation and coach instead of providing a complete solution. Ask questions that lead people to the conclusions that you are looking for to give them the opportunity to discover the answer themselves and then own the ideas
  4. Simplify messaging and focus on key concepts on a group-by-group basis
  5. Ensure that communications are engaging and that training is interactive
  6. Allow people time to build new habits

Measure Everything

  1. Reliability: error budget, SLI, SLO, indicators of user happines
  2. Toil: how much time is spent on toil
  3. Monitoring: monitor symptoms not causes
    1. Four Golden Signals
      1. Latency
      2. Traffic
      3. Errors
      4. Saturation

SRE Skills

  • Operations and Software Engineering
  • Monitoring principals
  • Production automation
  • System architecture
  • Troubleshooting and debugging
  • Culture of trust
  • Incident management and communication

Key Concepts

  • SLI: Service Level Indicator. INDICATOR. A Quantitative measurement, typically a metric, that expresses the health of a given component in the system.
  • SLO: Service Level Objective. GOAL. A target value for a services availability or performance as measured by the SLIs
  • SLA: Service Level Agreement. PROMISE. A guarantee the defines the results of missing your SLOs

Stakeholders

  • Product Managers
  • Executives
  • Customers
  • Developers and SREs

The key is the focus on the user experience

Choosing a good SLI

  • Define a quality target
  • How do users interact with the product
  • SLI = good events/valid event = %

SLO = applied SLI, that must include a target and a time window

SLE = environments

SLU = updates – communications to “customers”

SLY = why?

MontiorMetric
Time BasedCounts
OK time/total time = %Good events/total events = %

Gamedays and Chaos Monkey: purposefully breaking stuff while watching it