Why are latency outliers so important to mitigate?
In a 2015 paper, Meta presented some real-life implementation details of the social graph used by Facebook.1 The authors start with a hypothetical case of two posts by Alice that friends Bob and Carol each comment on and like. When Alice picks up her phone and opens Facebook, the news feed needs to identify who her friends are and what their posts are, as well as to set queries to notify Alice that Bob and Carol liked and commented on her post.
First, let’s consider a solution (Figure 1) where a worker executes n subqueries. In this case, execution time would be O(n) and is roughly equal to average (not worst-case) lookup latency.
Hyperscalers take a different approach (Figure 2). They take n workers each doing one lookup, with an aggregating of n results. In this case, execution time is O(1), but execution time becomes the worst-case latency of any of the n nodes.
The Meta paper goes on to discuss that it’s actually worse with multiple fork/join subqueries on thousands of subqueries with dozens of deep critical path-of-fork joins (Figure 3). Even a single outlier — in this case, at three nines — would impact nearly every single query performance. This outcome justifies looking at latencies at least four nines, if not further at six or seven nines (six-nines latency is shown in Figure 6).
This situation happens well beyond Meta and its social graph, and it includes many database-intensive applications. A good discussion in another Micron blog looks at a YCSB database application against various storage solutions, including Micron 7450.2,3
At a minimum, it’s good to look at read-intensive workloads that has a “firestorm” of writes — such as 70% reads, 30% writes — and is deeply queued to ensure full pressure to the NAND array and controller. Examining read tail latencies against various pressures is also important as those will be closer to the typical daily server experience.
What causes latency variations and mitigations?
Using the nomenclature of CPU architecture, an SSD is both deeply pipelined (many stages) and a super scaler (many parallel stages). Focusing on pipeline stalls is key for the performance of CPUs and SSDs. In the case of SSDs (Figure 4), the pipeline stalls can come from many sources that we’ll consider below.
Some of the lowest order impacts to this idealized latency result from attempting to read data where the die or NAND bus are busy servicing the request of another pipeline stage, an occurrence commonly called plane, die or channel collisions. When a read conflicts with a read, it’s common to let the pipe stall on the later read and complete the in-progress read first.
Integrating in both host writes (30% in this case) and their associated garbage collection creates not only additional reads but also programs and erases to the NAND. Program latencies can be as long as five to 10 times that of NAND reads, and NAND erases can be an order of magnitude higher than the NAND reads. Figure 5 gives a somewhat humorous view of GC impacts to host activity, while Figure 6 details what the read pipeline stalls would look like without suspends.
This is where pipeline “suspends” for the programs and erases come in, allowing the servicing of host reads. Working closely, NAND component engineers, system on chip engineers and firmware engineers have invented a program and erase suspend to help mitigate these latencies. Today we see well less than a 2 mS latency impact at five nines to the above workload, a result that is at least five times better.
How can pipeline stalls be resolved?
Let’s head back to the freeway visual in Figure 5. An SSD with deeply queued reads and writes with the associated garbage collection is like a multilane freeway. Latency outliers on a freeway (aka traffic jams) are very similar to latency outliers in an SSD. A couple of latency outlier strategies can best be understood through the analogy of preventing freeway traffic jams.
Analogy 1: Don’t let freight trains block busy freeways (I’m not joking)
Obviously, a freight train blocking all lanes of traffic on a busy freeway at rush hour would guarantee a traffic jam (Figure 7). Well, it has taken time for SSD designers to fully appreciate that fact. Although it seems obvious, more than once the OCP data center NVMe™ SSD specification specifically call out the need to not have a freight train cross a busy freeway, likely stemming from experience with prior designs by the OCP authors:
- Smart IO shall not block any host IO (SLOG-6)
- Other periodically monitored logs (LMLOG-4, and TEL-5) shall constrain IO blocking to a small figure ~1 mS.
Analogy 2: Employ freeway on-ramp regulation
Another useful tool to prevent traffic jams is freeway on-ramp metering (Figure 8). We have probably all experienced this approach, which seems counterintuitive as a way to increase productivity. Data from the U.S. Department of Transportation shows a traffic time reduction (latency) of 20% or more.5 SSDs follow the same concept to prevent congestion that leads to latency outliers. We throttle not only writes but also overall I/O, including garbage collection, to ensure peak performance without the frustrating delay of an internal traffic jam on the SSD.
Why is Micron committed to being best in class for tail latency mitigation?
As mentioned before, the industry struggled with latency outliers six to seven years ago before the advent of program and erase suspends. That approach plus ingress throttling — as well as optimization of NAND and controller interactions — have evolved Micron SSDs (Figure 9) to the degree that, today, we consider them to be best in class.
So next time you pick up your mobile device and surf social media, admire the great end-user experience you get in terms of responsiveness. Admire the wonder of literally thousands of servers running thousands of parallel queries that are deeply fork-joined. Even if you’re running a server farm on a much smaller scale, having consistent and predictable performance is key to ensuring consistent service to your customers. This performance is where Micron’s mainstream data center SSDs shine.
1 Challenges to adopting stronger consistency at scale, Meta Research 2015(opens in a new tab)
2 Identifying latency outliers in workload testing | Micron 2023(opens in a new tab)
3 Comparing Micron 7450, Samsung PM9A3 and Solidigm D5-P5430 | Micron 2023(opens in a new tab)
4 Avoiding costly read latency variations in SSDs through I/O determinism, FMS 2017, Wells(opens in a new tab)
5 Ramp metering: A proven, cost-effective operational strategy, U.S. Department of Transportation 2014(opens in a new tab)