DESIGN TOOLS
Storage

Tackling SSD endurance made me a better product manager

Rahul Jairaj | August 2024

Three years ago, my first month into my new product management role, I received a frantic call from an engineer — “Rahul, we have a problem. We can’t meet the TBW spec on one of our capacities. Can you talk with the customer and get a waiver?”

My first reaction was panic since I had only a cursory sense of what TBW — or terabytes written, a measure of SSD endurance — was, and I certainly didn’t feel ready to have a drawn-out call with a customer about it. After a few internal huddles and debriefings, I not only gained a better sense of the issue but I also “fell in love” with the problem, savoring the time spent examining and solving it.

Over the next few weeks, we were able to deliver a solution to our customer. The product in question was the Micron 3500 SSD (Figure 1). And once we solved the issue, this product was rated one of the best client SSDs ever made. In fact, according to Jon Coulter from Tweak Town, “It's simply the best OEM SSD ever made.”

Tweak Town Review – Link

Since then, almost every generation of product has had us reexploring the SSD endurance and making tradeoffs to meet our customers’ needs.

Micron client SSD portfolio Figure 1: Micron client SSD portfolio


SSD endurance

To me one of the most exciting parts of being a technical product manager is learning something new every day. One such topic is SSD endurance — essentially, how long an SSD will last.

SSD endurance is measured a few different ways: DWPD (drive writes per day) for enterprise SSDs and TBW for client SSDs. DWPD is a measure of what percentage of the SSD’s capacity can be written each day. For example, a 1TB enterprise SSD with a 0.3 DWPD means that each day the user can write ~300GB (30%) of data on the SSD until the end of its warranty period. TBW is a measure of how many terabytes of data (1000s of gigabytes) can be written before the drive stops working. Each capacity has its own TBW value (Figure 2).

Figure 2: TBW values for the Micron 3500, with mean time between failures (MTBF) of 2 million hours Figure 2: TBW values for the Micron 3500, with mean time between failures (MTBF) of 2 million hours


What Figure 2 tells us is that, for a 512GB SSD, you can write at least 300 terabytes of data before it stops working. To understand how much 300 terabytes is, consider this small illustration.

If you were to write and overwrite 100GB of data on your computer every day for three years straight, you would only hit ~107 TBW before the warranty on the drive expired. That’s about a third of the rated endurance of the drive. And can you imagine writing 100GB of data every day in that time? Most of us wouldn’t come close to that number in a month!

The TBW spec for a given drive is determined by the following simplified equation:

TBW specification formula

As you can see, the larger the SSD capacity, the greater the TBW. And the same goes for the program/erase (P/E) count. However, TBW is inversely correlated to the write amplification factor. WAF is simply how many times the user data is rewritten and moved inside the SSD. Several factors affect WAF, the most critical being the workload you subject the SSD to. For typical client workloads, that number is low and hovers around 3 to 4 WAF.

The other metric associated with SSD endurance is MTBF, or mean time between failures. MTBF is a measure of the average time between drive failures; it’s not an absolute metric. MTBF for an SSD is a complicated metric to calculate and depends on the reliability of all the individual SSD components. That said, Micron’s client SSDs are usually rated at 2 million hours of MTBF. This rating means that, on average, a Micron client SSD will fail approximately every 230 years. That’s a very low failure rate!

As you dive deeper into each one of these variables, you quickly realize that the final TBW is a tradeoff between a multitude of factors —SSD performance, NAND defectivity, NAND media type (SLC, TLC or QLC), SSD workload, NAND block size, NAND valid block count, static SLC P/E cycle count, super block architecture, and on and on.

I don’t profess to be an expert in all these varied topics, but I rely on specialized engineering teams who are, in fact, the real experts to solve product and customer specific challenges. My job is to frame the problem space accurately. As a result, the learning we gain from each problem we solve grows our arsenal of options for future customers’ specific needs.

Options for the future

As you might have heard, AI PCs are here and changing the game. My colleague Prasad Alluri, Micron’s vice president and general manager for client storage in the Storage Business Unit, wrote about this transformation at length in his blog titled AI in PC: Why not?

One of the many unknowns with this new revolution is rethinking the workload and its impact to SSD endurance. As product managers, we must plan for these eventualities. Thanks to solving the endurance problem on several product generations, we now understand tradeoffs to help increase the SSD endurance by up to 10 times if it’s needed for running complex vision-language models (VLMs) locally on the PC. Like I’ve said before, the best is yet to come, and we’re all excited about what the future holds.

Enduring lessons

When I started my product management journey, every product or customer issue filled me with existential dread — but now those very issues give me a sense of purpose and excitement. Solving a complicated problem is an opportunity to learn and innovate. These past three years, I have learned many valuable lessons on the job. I hope they can help you as well.

  1. Stay calm — panicking gets you nowhere.  
  2. It’s OK not to have all the answers.  
  3. Ask for help — collaborate and trust your team. 
  4. Be comfortable with ambiguity and churn — strive to create clarity.  
  5. Sometimes the obvious solution isn’t the right path to take  
  6. Get curious about and fall in love with solving the problem, not the solution alone. 
  7. Strive to learn every day and welcome every problem. 
  8. Teach and disseminate what you’ve learned because those activities ensure collective success.  

For those who stuck around until the end, you may wonder what the solution to the Micron 3500 TBW issue was.

We learned that, on our Micron G8 NAND, there had been several new process innovations to scale to these new heights. Given the newness, we had made very pessimistic defectivity projections that prevented us from reaching our 2 million MTBF targets for a given TBW specification. We worked with our customer’s people and realized their spec required a lower MTBF target. So, we were able to meet their endurance asks without delay. Finally, when the product came out, our defectivity was significantly lower than projected so we met our original 2 million MTFB targets without concern or compromise. Being open to the problem led to a win-win solution.

DIRECTOR, TPM - CLIENT

Rahul Jairaj

Rahul Mitchell Jairaj is the Director of Technical Product Managment for Micron's Client SSD Business Unit. He has spent his career working on NAND flash storage at Micron from components engineering to SSD product management. He holds a Masters degree in Semiconductor Device Physics from Clemson University and a bachelor's in electrical engineering. Outside work, Rahul is passionate about collecting fossils and amateur microscopy.