avatarJIN

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4930

Abstract

was 600, the number of error responses was 10, and the downtime was 10 minutes</li><li>The total number of requests from the current March was 400, and the number of error responses was 15.</li></ul><p id="ae48">Then what are the SLI, SLO, and SLA I calculated?</p><p id="05d8">SLI, 1 — (20+10+15)/(500+600+400) = 97%</p><p id="310b">SLO, 1 — (10/(742460))=99.991%</p><p id="48bb">SLA, If the service provider cannot meet the term of the agreement that the SLO does not reach 99.999%, how much is the compensation according to the signed SLA agreement.</p><p id="129d"><b>The application</b></p><p id="53e0">It is <a href="https://cloud.google.com/spanner/sla"><b>the term of the agreement</b></a><b> </b>under which Google has agreed to provide Google Cloud Platform to customers.</p><figure id="7423"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yNNo8TVwN3s04aN8Ejnj7A.png"><figcaption><a href="https://cloud.google.com/spanner/sla">https://cloud.google.com/spanner/sla</a></figcaption></figure><p id="dfec">Ideally, the SLI should directly measure a specific quality of service. But, in many cases, the direct measurement may be very difficult to be observed and obtained. So, only some kind of indicator can be used. Latency is the most direct monitoring indicator. Durability is also an important metric for the data storage systems to monitor how long data can be kept intact. While 100% availability is impossible to achieve, a near 100% availability metric is an achievable goal. The operations expert often uses the number 9 to describe availability. For example, 99% availability is called “2 nines” and 99.99% availability is called “4 nines”. The current availability indicator for Google cloud computing services is “3.5 nines” — 99.95% availability.</p><p id="f032">Choosing a target SLO is not a purely technical activity, as there are also product and business-level decisions involved here. The choice of SLI and SLO should directly reflect the product and business-level decisions. Site reliability engineers (SREs) should discuss and provide advice on feasibility and risk. That’s why it is important to understand the various indicators and limitations of the system. Only enough SLOs should be selected to cover system properties.</p><p id="ec6f">SLI and SLO are very useful when making decisions about system operation and maintenance.</p><ol><li>Monitor and measure the SLI of the system</li><li>Compare SLI and SLO to decide if action is required</li><li>If an action needs to be performed, then it is up to decide what exactly needs to be performed in order to meet the goal</li><li>perform these operations</li></ol><p id="8118">For example, if in step 2, the request latency is rising, the SLO will be exceeded within a few hours with no operations. The third step will test whether the server is not enough CPU resources, and add some CPU to spread the load. Without SLO, we don’t know if (or when) the action needs to be performed.</p><p id="048e">SLA requires the business and legal departments to choose the appropriate consequence clause. The role of the site reliability engineer is to help the business and legal departments understand the probability and difficulty of meeting the SLA’s SLO. Google guarantees that the service’s annual availability time is≥99.99%. Also, Google guarantees the first response within 1 hour of the user’s request for technical support, including phone calls, emails, etc. The term also comes with a lot of reward and compensation details.</p><figure id="c08e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JLy9xj0BnTeDTgChkE1m_A.png"><figcaption><a href="https://cloud.google.com/spanner/sla">https://cloud.google.com/spanner/sla</a></figcaption></figure><p id="c654"><b>References</b></p><div id="01c4" class="link-block"> <a href="https://www.blameless.com/sre/service-level-objectives"> <div> <div> <h2>Complete Guide to Service Level Objectives (SLOs) That Work</h2> <div><h3>Wondering what Service Level Objectives (SLOs) are? In this article, we will explain service level objectives and how…</h3></div> <div><p>www.blameless.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*gr_6TcFyK9BrelmT)"></div> </div> </div> </a> </div><div id="65a2" class="link-block"> <a href="https://cloud.google.com/blog/products/devops-sre"> <div> <div> <h2>DevOps & SRE | Google Cloud Blog</h2> <div><h3>Find all the latest news about Google Cloud and DevOps & SRE with customer stories, product announcements, solutions…</h3></div> <div><p>cloud.google.com</p></div> </div> <div>

Options

            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0NjIv80gvPbHT4tp)"></div>
          </div>
        </div>
      </a>
    </div><div id="ec09" class="link-block">
      <a href="https://sre.google/books/">
        <div>
          <div>
            <h2>Google - Site Reliability Engineering</h2>
            <div><h3>Discover Site Reliability Engineering, learn about building and maintaining reliable engineering systems, and read…</h3></div>
            <div><p>sre. google</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*URlqXik_tw7OxPSg)"></div>
          </div>
        </div>
      </a>
    </div><div id="f719" class="link-block">
      <a href="https://wsgzao.github.io/post/sre-vs-devops/">
        <div>
          <div>
            <h2>SRE 和 DevOps</h2>
            <div><h3>SRE vs. DevOps: competing standards or close friends?</h3></div>
            <div><p>wsgzao.github.io</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*tDe5N2C6azw9Rg4v)"></div>
          </div>
        </div>
      </a>
    </div><div id="dbe5" class="link-block">
      <a href="https://blog.alswl.com/2018/09/devops-and-sre/">
        <div>
          <div>
            <h2>DevOps 和 SRE</h2>
            <div><h3>最近有一位朋友和我聊职业发展方向问题,聊了不少 DevOps 和 SRE 话题。 我几年前刚接触这两个概念时也常常将之混淆,可惜当时没有人来解答我困惑。 现在这虽然已经极为流行,但是我发现我这位朋友对这两个职位还存在一些误区。…</h3></div>
            <div><p>blog.alswl.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Te7t-QKl_KQnK57F)"></div>
          </div>
        </div>
      </a>
    </div><div id="57d3" class="link-block">
      <a href="https://www.bmc.com/blogs/service-level-indicator-metrics/">
        <div>
          <div>
            <h2>A Primer on Service Level Indicator (SLI) Metrics</h2>
            <div><h3>Keeping track of customer satisfaction is crucial for any company aiming to provide exceptional service now and even…</h3></div>
            <div><p>www.bmc.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*_YCaLEp204QXVyfx)"></div>
          </div>
        </div>
      </a>
    </div><p id="1b44"><b><i>If you’ve found any of my articles helpful or useful then please consider throwing a coffee my way to help support my work or give me patronage😊, by using</i></b></p><p id="df12"><a href="https://www.patreon.com/jinlowmedium"><b>Patreon</b></a></p><p id="08e1"><a href="https://ko-fi.com/jinlowmedium"><b>Ko-fi.com</b></a></p><p id="a391"><a href="https://www.buymeacoffee.com/jinlowmedium"><b>buymeacoffee</b></a></p><p id="5f77"><i>Last but not least, if you are not a Medium Member yet and plan to become one, I kindly ask you to do so using the following link. I will receive a portion of your membership fee at no additional cost to you.</i></p><div id="4e17" class="link-block">
      <a href="https://jinlow.medium.com/membership">
        <div>
          <div>
            <h2>Join Medium with my referral link - JIN</h2>
            <div><h3>As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…</h3></div>
            <div><p>jinlow.medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*ZgpD5OhdKin1kocN)"></div>
          </div>
        </div>
      </a>
    </div><p id="4bd4">It is my first affiliate program, if you like to further enhance your system knowledge, you can click the links and buy the course. Honestly speaking, I will receive 20% of your course fees at no additional cost to you. You will have unlimited access to our courses. There is no time expiry and you will have access to all future updates free of cost.</p><div id="ab15" class="link-block">
      <a href="https://designgurus.org/link/LX551Y">
        <div>
          <div>
            <h2>Design Gurus</h2>
            <div><h3>Once bought, you will have unlimited access to our courses. There is no time limit and you will have access to all…</h3></div>
            <div><p>designgurus.org</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*RRNJF9yszsIP_sgl)"></div>
          </div>
        </div>
      </a>
    </div></article></body>
Photo by James Lee on Unsplash

The fundamental knowledge of System Design — (4) — System Availability

System Availability = Availability = Uptime ÷ (Uptime + downtime)

It is the fourth series of the fundamentals knowledge of system design. You can read my previous articles.

https://www.blameless.com/sre/service-level-objectives
https://opsani.com/resources/site-reliability-engineering-service-level-agreement-terms-explained-sla-slo-sli/

SLI = Service Level Indicator

It is the most important metric for business.

  • Uptime of the service
  • Number of transactions
  • Latency
  • Error rate
  • Throughput
  • Response time
  • Durability

SLO = Service Level Objective

It is built around SLI. It refers to a target value or target range of service level. Usually a percentage and tied to a time frame.

90% (1 nine of Uptime) = 10% downtime, which means 3 out of the last 30 days

99% (2 nines of uptime) = 1% downtime, or 7.2 hours of downtime in the last 30 days

99.9% (3 nines of uptime) =0.1% downtime, or 43.2 minutes of downtime in the last 30 days

SLA = Service Level Agreement

An agreement is issued by enterprises to customers.

  • Refund service fee
  • Provide free service for a period of time
https://ophir.wordpress.com/2011/01/31/does-sla-really-mean-anything/

Case Study

Assume I have a website http://xxx.com. From the launch on January 1, 2022, to March 15, 2022, the requested data is as follows:

  • The total number of requests from the whole of January was 500, the number of error responses was 20
  • The total number of requests from the whole of February was 600, the number of error responses was 10, and the downtime was 10 minutes
  • The total number of requests from the current March was 400, and the number of error responses was 15.

Then what are the SLI, SLO, and SLA I calculated?

SLI, 1 — (20+10+15)/(500+600+400) = 97%

SLO, 1 — (10/(74*24*60))=99.991%

SLA, If the service provider cannot meet the term of the agreement that the SLO does not reach 99.999%, how much is the compensation according to the signed SLA agreement.

The application

It is the term of the agreement under which Google has agreed to provide Google Cloud Platform to customers.

https://cloud.google.com/spanner/sla

Ideally, the SLI should directly measure a specific quality of service. But, in many cases, the direct measurement may be very difficult to be observed and obtained. So, only some kind of indicator can be used. Latency is the most direct monitoring indicator. Durability is also an important metric for the data storage systems to monitor how long data can be kept intact. While 100% availability is impossible to achieve, a near 100% availability metric is an achievable goal. The operations expert often uses the number 9 to describe availability. For example, 99% availability is called “2 nines” and 99.99% availability is called “4 nines”. The current availability indicator for Google cloud computing services is “3.5 nines” — 99.95% availability.

Choosing a target SLO is not a purely technical activity, as there are also product and business-level decisions involved here. The choice of SLI and SLO should directly reflect the product and business-level decisions. Site reliability engineers (SREs) should discuss and provide advice on feasibility and risk. That’s why it is important to understand the various indicators and limitations of the system. Only enough SLOs should be selected to cover system properties.

SLI and SLO are very useful when making decisions about system operation and maintenance.

  1. Monitor and measure the SLI of the system
  2. Compare SLI and SLO to decide if action is required
  3. If an action needs to be performed, then it is up to decide what exactly needs to be performed in order to meet the goal
  4. perform these operations

For example, if in step 2, the request latency is rising, the SLO will be exceeded within a few hours with no operations. The third step will test whether the server is not enough CPU resources, and add some CPU to spread the load. Without SLO, we don’t know if (or when) the action needs to be performed.

SLA requires the business and legal departments to choose the appropriate consequence clause. The role of the site reliability engineer is to help the business and legal departments understand the probability and difficulty of meeting the SLA’s SLO. Google guarantees that the service’s annual availability time is≥99.99%. Also, Google guarantees the first response within 1 hour of the user’s request for technical support, including phone calls, emails, etc. The term also comes with a lot of reward and compensation details.

https://cloud.google.com/spanner/sla

References

If you’ve found any of my articles helpful or useful then please consider throwing a coffee my way to help support my work or give me patronage😊, by using

Patreon

Ko-fi.com

buymeacoffee

Last but not least, if you are not a Medium Member yet and plan to become one, I kindly ask you to do so using the following link. I will receive a portion of your membership fee at no additional cost to you.

It is my first affiliate program, if you like to further enhance your system knowledge, you can click the links and buy the course. Honestly speaking, I will receive 20% of your course fees at no additional cost to you. You will have unlimited access to our courses. There is no time expiry and you will have access to all future updates free of cost.

System Design Interview
System Architecture
Software
Quality
Availability
Recommended from ReadMedium