How AI is transforming the PC landscape and what it means for memory and storage
AI is ubiquitous. You cannot get through a day without hearing or seeing AI in action. From smart assistants to self-driving cars, AI is changing the way we interact with the world. But what about the PC? Can AI make your PC smarter, faster, and more personalized? In this blog, we will explore how AI is transforming the PC landscape and what it means for memory and storage. At CES 24, all the buzz was about AI – more than 50% of the coverage at the show was related to AI.
AI is powered by large language models (LLMs), models that are developed using the vast amount of unlabeled text that humans have been accumulating. The natural language queries that return human like responses are built on neural networks with billions of parameters and in some cases multiple networks linked together to generate content. Some of the most popular examples of LLMs are ChatGPT and DALL-E, which can produce realistic and creative text and images based on user input. These LLMs are impressive, but they also require a lot of computing power and data to run. That is why most of them are hosted in the cloud, where they can access the massive hardware infrastructure and network bandwidth needed.
However, the cloud is not the only place where AI can happen. There are many reasons why moving some of the AI processing to the edge, i.e., the devices on the user end, can be beneficial. For instance, edge AI can reduce latency, improve privacy, save network costs, and enable offline functionality. Imagine if you could use your PC to generate high-quality content, edit photos and videos, transcribe speech, filter noise, recognize faces, and more, without relying on the cloud. Wouldn’t that be awesome?
Why the PC?
Of course, PCs are not the only device that can benefit from edge AI. Smartphones, tablets, smartwatches, and other gadgets can also leverage AI to enhance their features and performance. But the PC has some unique advantages that make it a suitable platform for edge AI. First, PCs have a large screen, which can display more information and provide a better user experience. Second, PCs have a large battery, which can support longer and more intensive AI tasks. Third, PCs have powerful compute, which can handle more complex and demanding AI models.
These advantages are not going unnoticed by the chip makers and software developers. Companies like Intel, AMD, Qualcomm, Mediatek, and Nvidia are embedding increasingly powerful neural processing engines and/or integrated graphics in their PC CPUs and chipsets, which can deliver tens of TOPS (trillions of operations per second) of AI performance. Microsoft has also stated that Windows 11 OS will be released this year with optimizations that take advantage of these embedded AI engines in CPUs. That should not be a surprise considering the push Microsoft is giving for Copilot, a feature that uses AI to help users write code, debug errors, and suggest improvements. Some of these players are also working with ISVs to enable AI optimized applications – enhanced video conference experience, photo editing capabilities, voice to text conversion, background ambient and noise suppression, and facial recognition just to name a few. Whether these under development applications are going to impress anyone or that killer application is yet to come is still a speculation. But the key questions remain. How can we run AI models on PC efficiently and effectively? And …
What does it mean for the hardware capabilities of the PC?
One of the main challenges of running AI models on PC is the model size. AI models, especially LLMs, can have billions or even trillions of parameters, which require a lot of memory and storage to store and load. For example, our internal experiments show that a 70 billion parameter Llama2 model with 4-bit precision, a state-of-the-art LLM for natural language generation, takes about 42GB of memory for loading and inferencing, with output speed of 1.4 token/second. This is a large amount of memory that is not available on a typical PC. This, in essence, states the problem and sets the direction for the future. There will be function specific models that will enable size reduction while maintaining accuracy. There is likely going to be a bifurcation that will happen – large 70 billion type models can be used with premium systems with large memory and storage and can run fine-tuned applications like chat completions and optimized for dialogue use cases. In addition, a local on-device personal assistant may also need a large parameter model. A less than 10B parameter model can be used with mainstream devices, conceivably consume smaller incremental memory to host the model (~2GB) and can be used with applications like language tasks, including text completion, finishing lists, and tasks like classification.
Model size clearly has an implication to memory – at least the size of the PC memory. Bandwidth and energy efficiency are equally important. With PC (specifically mobile) transitioning to LPDDR from DDR, it helps on both of these dimensions. For example, LPDDR5X consumes 44-54% less power during active use and 86% less power during self-refresh compared to DDR5 and LPDDR5 bandwidth compares 6.4Gb/s with DDR5’s 4.8Gb/s. All of this points to a quicker transition to LPDDR5 if AI were to penetrate PC quickly. There are research and development efforts to improve energy efficiency by moving some of the processing into the memory. That is likely going to take a long time, if ever. Industry needs to converge onto a common set of primitives to offload to memory and that determines the software stack that needs to be developed. A given set of primitives may not be optimal for all applications. So, let us say that for the moment processing in memory for PC has more questions than answers.
The bigger question is where will the sweet spot AI models land? If the model sizes remain relatively large, is there a way to reduce the reliance on memory and push part of the model into storage? If that happens, the model rotation will need to be accommodated by increased storage bandwidth. This may increase the proliferation of Gen5 PCIe storage into mainstream PC or perhaps accelerate the introduction of Gen6 PCIe storage. In a recent paper published by Apple on this same topic1, “LLM in a flash: Efficient Large Language Model Inference with Limited Memory by Alizadeh et al” proposes a method to run large language models (LLMs) on devices that exceed the available DRAM capacity. The authors suggest storing the model parameters on flash memory and bringing them on demand to DRAM. They also propose methods to optimize data transfer volume and enhance read throughput to significantly enhance inference speeds. The paper’s primary metric for evaluating various flash loading strategies is latency, dissected into three distinct components: the I/O cost of loading from flash, the overhead of managing memory with newly loaded data, and the compute cost for inference operations. In summary, the paper provides a solution to the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them on demand to DRAM.
AI capabilities will evolve. Current embedded NPU integration into CPU and discrete GPUs is a start. AI accelerator cards from Kinara, Memryx, and Hailo are an alternate implementation for offloading AI workloads in PC. Another way models may evolve are function specific models that are smaller and optimized to specific functions. These models will need to be rotated from storage to memory on demand but the implications to storage are similar to running a large model.
Some advantages of discrete NPU are:
- They can handle complex AI models and tasks with lower power consumption and heat generation than CPU and GPU.
- They can provide faster and more accurate AI performance for image recognition, generative AI, chatbots, and other applications.
- They can complement the existing CPU and GPU capabilities and enhance the overall AI experience for users.
Lenovo, in its ThinkCentre Neo Ultra desktop, which will launch in June 2024, claims that these cards offer more power-efficient and capable AI processing than the current CPU and GPU solutions.2
TOPS alone, as a figure of merit, can be misleading. At the end, what matters is the number of inferences in a unit time, accuracy, and energy efficiency. So for a generative AI, it can be number of tokens per second or completing stable diffusion in less than a few seconds. Measuring these in an industry acceptable way will require benchmark developments. Case in point: I have visited all the booths of the CPU vendors, discrete NPU players demos at CES. Every demo claimed superiority of their implementation in one aspect or the other.
There is certainly a lot of enthusiasm around the introduction of AI into the PC space. PC OEMs view this as a stimulus to refresh PCs and an increased share of higher value content in them. Intel is touting enablement of 100M unit PCs by 2025, that is almost 30% of the overall PC TAM. Whatever the adoption rate may be, as a consumer, there is something to look forward to in 2024.
References
- 1 2312.11514.pdf (arxiv.org)
- 2 PC World article on Kinara and Halo
- www.micron.com/AI