Kimi Replaces Residual Connections with Attention in Transformers
Kimi's research introduces a method to use attention mechanisms to determine which layers in a transformer model are important, replacing traditional residual connections. This approach shows a consistent 1.25× compute advantage across various model sizes.
Read full story →