AWS, Cerebras Partner to Speed Up AI Inference Workloads

AWS and AI chipmaker Cerebras Systems are collaborating to deliver faster generative AI inference through a new architecture that splits workloads across specialized hardware.

The system will combine AWS Trainium-powered servers with Cerebras CS-3 systems and Amazon’s Elastic Fabric Adapter networking, and will be deployed in AWS data centers and exposed through Amazon Bedrock. Inference is the stage where new data is fed into a trained AI model to generate outputs.

The companies said the approach separates two stages of inference: “prefill,” which processes prompts, and “decode,” which generates tokens sequentially. Trainium will handle prefill while Cerebras’ CS-3, based on its wafer-scale processor, will handle decode operations that often dominate inference time.

AWS said the design could improve inference performance by an order of magnitude for some workloads, though those claims have not yet been independently benchmarked.

Code is increasingly being written by AI agents rather than humans, leading to 15 times more tokens generated per query compared to conversational chats, according to Cerebras. That is why there is a need for faster inference; Cerebras says it is the inference speed leader as its chips can process up to 3,000 tokens per second.

Cerebras markets the CS-3 as one of the fastest inference systems available, designed around its wafer-scale engine with large on-chip memory bandwidth. AWS, meanwhile, has been expanding its custom AI silicon lineup with Trainium chips to compete with Nvidia-based infrastructure.

The companies said the joint system will support open-source models and Amazon’s Nova models later this year.

Read the AWS press release and Cerebras blog post.