The fundamental knowledge of System Design — (4) — System Availability
System Availability = Availability = Uptime ÷ (Uptime + downtime)
It is the fourth series of the fundamentals knowledge of system design. You can read my previous articles.


SLI = Service Level Indicator
It is the most important metric for business.
- Uptime of the service
- Number of transactions
- Latency
- Error rate
- Throughput
- Response time
- Durability
SLO = Service Level Objective
It is built around SLI. It refers to a target value or target range of service level. Usually a percentage and tied to a time frame.
90% (1 nine of Uptime) = 10% downtime, which means 3 out of the last 30 days
99% (2 nines of uptime) = 1% downtime, or 7.2 hours of downtime in the last 30 days
99.9% (3 nines of uptime) =0.1% downtime, or 43.2 minutes of downtime in the last 30 days
SLA = Service Level Agreement
An agreement is issued by enterprises to customers.
- Refund service fee
- Provide free service for a period of time

Case Study
Assume I have a website http://xxx.com. From the launch on January 1, 2022, to March 15, 2022, the requested data is as follows:
- The total number of requests from the whole of January was 500, the number of error responses was 20
- The total number of requests from the whole of February was 600, the number of error responses was 10, and the downtime was 10 minutes
- The total number of requests from the current March was 400, and the number of error responses was 15.
Then what are the SLI, SLO, and SLA I calculated?
SLI, 1 — (20+10+15)/(500+600+400) = 97%
SLO, 1 — (10/(74*24*60))=99.991%
SLA, If the service provider cannot meet the term of the agreement that the SLO does not reach 99.999%, how much is the compensation according to the signed SLA agreement.
The application
It is the term of the agreement under which Google has agreed to provide Google Cloud Platform to customers.

Ideally, the SLI should directly measure a specific quality of service. But, in many cases, the direct measurement may be very difficult to be observed and obtained. So, only some kind of indicator can be used. Latency is the most direct monitoring indicator. Durability is also an important metric for the data storage systems to monitor how long data can be kept intact. While 100% availability is impossible to achieve, a near 100% availability metric is an achievable goal. The operations expert often uses the number 9 to describe availability. For example, 99% availability is called “2 nines” and 99.99% availability is called “4 nines”. The current availability indicator for Google cloud computing services is “3.5 nines” — 99.95% availability.
Choosing a target SLO is not a purely technical activity, as there are also product and business-level decisions involved here. The choice of SLI and SLO should directly reflect the product and business-level decisions. Site reliability engineers (SREs) should discuss and provide advice on feasibility and risk. That’s why it is important to understand the various indicators and limitations of the system. Only enough SLOs should be selected to cover system properties.
SLI and SLO are very useful when making decisions about system operation and maintenance.
- Monitor and measure the SLI of the system
- Compare SLI and SLO to decide if action is required
- If an action needs to be performed, then it is up to decide what exactly needs to be performed in order to meet the goal
- perform these operations
For example, if in step 2, the request latency is rising, the SLO will be exceeded within a few hours with no operations. The third step will test whether the server is not enough CPU resources, and add some CPU to spread the load. Without SLO, we don’t know if (or when) the action needs to be performed.
SLA requires the business and legal departments to choose the appropriate consequence clause. The role of the site reliability engineer is to help the business and legal departments understand the probability and difficulty of meeting the SLA’s SLO. Google guarantees that the service’s annual availability time is≥99.99%. Also, Google guarantees the first response within 1 hour of the user’s request for technical support, including phone calls, emails, etc. The term also comes with a lot of reward and compensation details.

References
If you’ve found any of my articles helpful or useful then please consider throwing a coffee my way to help support my work or give me patronage😊, by using
Last but not least, if you are not a Medium Member yet and plan to become one, I kindly ask you to do so using the following link. I will receive a portion of your membership fee at no additional cost to you.
It is my first affiliate program, if you like to further enhance your system knowledge, you can click the links and buy the course. Honestly speaking, I will receive 20% of your course fees at no additional cost to you. You will have unlimited access to our courses. There is no time expiry and you will have access to all future updates free of cost.