DeepSeek-V2 is a groundbreaking open-source language model based on the Mixture-of-Experts architecture, boasting an impressive 236 billion parameters, with only 21 billion activated for each input token. It addresses the computational cost challenge by introducing innovative architectural designs and training methodologies to achieve a balance between performance and efficiency. The model’s architecture builds on the powerful Transformer foundation and incorporates Multi-head Latent Attention (MLA) and DeepSeekMoE to enhance efficiency. MLA compresses the Key-Value (KV) cache into a smaller latent vector, reducing the memory footprint and computation required during text generation. DeepSeekMoE selectively activates relevant experts for each token, achieving significant cost savings during training. The model undergoes supervised fine-tuning and reinforcement learning to align with human expectations and preferences. DeepSeek-V2 excels in various benchmarks, demonstrating strong performance across diverse domains and languages, including English and Chinese. While it shares some limitations with other LLMs, its strengths, ongoing development, and open-source nature make it a valuable resource for researchers and developers.
