设计工具

Invalid input. Special characters are not supported.

AI

Inference = IOPS: Why AI’s next frontier runs on storage

Jeremy Werner | May 2025

Inference used to be the quiet follow-up act to training, an afterthought even. But everything has changed seemingly overnight. Today, inference is the main event in AI infrastructure — and storage is stepping into the spotlight.

Every time you ask a chatbot a question, generate an image or run a “Copiloted” task, inference is doing the work. These aren’t predictable, repeatable processes like training. Inference is on demand, in real time, and shaped entirely by user behavior. That makes it a lot messier — and much harder to optimize.

Imagine navigating through a busy city during rush hour. Every driver has a unique destination, and the traffic patterns are constantly changing. You need to make real-time decisions based on the current conditions, adjusting your route to avoid congestion and reach your destination efficiently. This unpredictability and need for quick adjustments mirror the randomness of inference in AI. Each of your interactions triggers a unique set of processes and computations, demanding high performance and responsiveness from the system.

Inference = IOPS

The reality is this: Unlike training workloads, inference workloads don’t run in a straight line. They loop back, refine and reprocess. That means each interaction triggers a flurry of reads, writes and lookups. Those input/output operations per second (IOPS) add up fast. Inference doesn’t just need high capacity, it also needs high performance. Compute gets most of the headlines, but it’s storage that’s constantly “feeding the beast.”

And as these models scale — serving billions of users like you in near real time — the pressure on infrastructure grows exponentially. AI innovation must move at the speed of light, but it can only move as fast as its slowest component.

Yann LeCun, Meta’s chief AI scientist, said it well, “Most of the infrastructure cost for AI is for inference: serving AI assistants to billions of people.”

That scale translates directly into a need for faster, more responsive storage systems — not just high capacity but also high IOPS. Inference applications can drive hundreds or even thousands of times the concurrent I/O of historical CPU-based computing applications.

Inference = IOPS

At Micron, we’re seeing this shift play out in real-world deployments. Customers running large language models (LLMs) and other inference-heavy workloads are looking for ways to reduce tail latency and boost responsiveness under unpredictable loads.

That’s where drives like the Micron 9550 - and our next-gen PCIe Gen6 NVMe SSDs - are making a real difference. These aren’t general-purpose storage devices. They’re engineered specifically for data-intensive, low-latency environments like AI inference.

Inference = IOPS

NVIDIA’s Jensen Huang recently pointed out, “The amount of computation we need … as a result of agentic AI, as a result of reasoning, is easily a 100 times more than we thought we needed this time last year.”

It’s not just the models getting smarter. The infrastructure needs to keep up — across the stack. And that upkeep includes storage, especially in systems where inference happens across a swarm of GPUs, accelerators and memory tiers.

As use cases grow — chatbots, search, Copilots and embedded AI at the edge — the entire I/O pipeline is being reevaluated. What’s the point of a blazing-fast compute fabric if your storage can’t keep pace?

Inference = IOPS

The era of inference is upon us, driving the demand for IOPS — and Micron is leading the charge.

企业副总裁兼存储业务部门总经理

Jeremy Werner

Jeremy 是一位拥有 20 余年经验的优秀存储技术领导者。他在美光的职责范围非常广泛,包括全球服务器、存储、超大规模和客户端市场的产品规划、营销和客户支持。此前,他曾在 KIOXIA America 公司担任过 SSD 业务总经理,还在初创公司 MetaRAM、Tidal Systems 和 SandForce 担任了 10 年的销售和营销职务。Jeremy 拥有康奈尔大学电子工程理学学士学位,拥有或正在申请的专利超过 25 项。