Cerebras' New Inference Service May Lower Enterprise AI Costs

AI Models/Tools

AWS Unveils Strands, a Simpler Way to Build AI Agents

May 23, 2025

What used to take months to build an AI agent now takes days or weeks.

Google to Reinvent Search as an Integrated AI Assistant

May 20, 2025

3D chats, Warby Parker smart glasses and other AI upgrades are coming soon

Darth Vader Has Something to Say to You, Young Padawan

May 16, 2025

Voice AI lets Star Wars and Fortnite fans chat with Darth Vader in real time in Battle Royale.

A Nigerian Startup Aims to Democratize Access to Legal Services

May 15, 2025

Agbo Obinnaya, cofounder of Case Radar AI, seeks to provide all-in-one legal services powered by AI.

Is the AI Model Leaderboard Rigged? This Paper Says Yes

May 13, 2025

In 'The Leaderboard Illusion,' researchers from Stanford, MIT and others say AI models from Google, OpenAI and Meta have an advantage.

FDA Taps Generative AI to Speed Up Drug Review Process

May 8, 2025

In a historic move, the FDA is deploying generative AI across all its centers to speed up the drug review process

Why Benchmarks and Tests Aren’t Good Ways to Evaluate LLMs

April 18, 2025

An opinion piece from the research scientist of the ETS Research Institute

TLDR

AI supercomputer startup Cerebras unveiled an inference service it said is much faster and much cheaper than the GPU-based solutions offered by cloud giants.
The secret is its giant, plate-sized chip that can store an entire language model. In contrast, data on GPUs have to travel to external memory and back for the compute, slowing the process.
The speed claims are backed by a third party benchmark firm, Artificial Analysis.

One of the main sticking points against faster AI deployment is the cost. Now, an AI supercomputer startup has unveiled an AI inference service that it said is not only much faster, but also a lot cheaper.

Silicon Valley-based Cerebras Systems said its AI inference service is faster than all of its competitors. Inference is the process by which new data is fed into an already trained AI model, to yield new insights or accomplish new tasks. This is the bread-and-butter of the AI compute market; it is the fastest growing part of AI compute and makes up about 40% of the total AI hardware market.

Cerebras said its system is 20 times faster than Nvidia’s GPU-based solutions that are offered by the largest cloud providers. Also, prices start at 10 cents per million tokens. (A million tokens equals roughly 750,000 words.) That is 100 times higher price-to-performance ratio for AI workloads, it claimed.

“We are now the world’s fastest at inference. We are at the most accurate level, and we’re the cheapest,” said Cerebras CEO Andrew Feldman, in a podcast with Weights & Biases.

The startup said it delivers 1,800 tokens (roughly 1,350 words) per second for Meta’s Llama 3.1, 8 billion parameter model, and 450 tokens per second for Llama 3.1 70B.

Artificial Analysis, a benchmarking firm, said it has verified these claims. Cerebras has “taken the lead” in AI inference benchmarks and has reached a “new record,” according to Artificial Analysis.

High-speed AI inference at above 1,000 tokens per second is akin to the development of broadband internet, the startup said, implying that today’s speeds could be compared to dial-up internet. Inference speed may also accelerate the development of AI agents – bots that do tasks for people, instead of just telling you what to do – because speed matters when making decisions in real time.

The faster speeds come at a fifth of the price, Cerebras added. Inference on both models costs 10 cents per million tokens for the 8B Llama model, and 60 cents per million tokens for the 70B model.

Cerebras’ secret is its Wafer-Scale Engine 3 chip (WSE-3), the largest in the world at the size of a plate. This chip powers its CS-3 system, which can be clustered together to make AI supercomputers that run the large language models.

Cerebras said its chip is big enough that it can store an entire model, due to its 44GB of SRAM. In contrast, GPUs don’t have as much memory and thus the data travels to external memory and back – making compute slower. The startup also said WSE-3 has 21 petabytes per second of aggregate memory bandwidth, or 7,000 times that of Nvidia’s H100 AI chip.

Moreover, Cerebras said it uses 16-bit model weights, which leads to better accuracy. It claims that others get over the memory bandwidth problem by using 8-bit weights, but outputs can be less accurate.

CerebrasÂ inference is available viaÂ chatÂ andÂ APIÂ access, but only for Llama 3.1 8B and 70B models at present. It will be adding support for other models in coming weeks.

Author

Deborah Yao

View all posts

Cerebras’ New Inference Service May Lower Enterprise AI Costs

Author