DeepSeek researchers including founder Wenfeng Liang has written a paper that tackles a growing problem in large language model design: how to scale architectural complexity without breaking training stability.
Recent approaches such as Hyper-Connections showed promise by widening residual streams, but at large scales they introduce severe numerical instability and system overhead.
In their paper, the authors propose Manifold-Constrained Hyper-Connections, or mHC, which restores the identity-mapping property of residual networks by constraining residual mixing onto a mathematically defined manifold.
By combining this theoretical fix with low-level infrastructure optimizations, the paper shows stable training and consistent performance gains in models up to 27 billion parameters, offering a practical new direction for large-scale model architecture design.
DeepSeek first made waves with the debut of its top-performing V3 AI model that it said only cost $5.6 million and 2,000 of the slower Nvidia H800 chips to train. That compares with hundreds of millions of dollars spent by U.S. AI companies for training, using tens of thousands of Nvidia GPUs.
The Chinese startup later released R1, its reasoning model. DeepSeek’s open-source AI models, with open weights, are now among the most popular options to closed models such as OpenAI’s GPT series and Google’s Gemini.








Be First to Comment