DeepSeek R1: A Game Changer in AI?
A new piece of AI has been announced, and it's causing quite a stir. We don't often dedicate videos to the release of a new AI model, simply because there are so many, and frankly, many aren't that interesting. However, DeepSeek and DeepSeek R1 are different. They're very interesting and, I believe, are genuinely threatening the monopoly that certain companies have on this technology.
For those unfamiliar, a large language model is a very large, Transformer-based neural network that predicts the next word in a sequence. Neural networks are standard in machine learning. Transformers have been a big deal in generative AI since about 2017. Generative AI can be broadly split into diffusion models (for image generation) and Transformers (for text generation). It's important to remember that all modern AI of this kind is fundamentally text generation.
These models are typically trained by feeding them massive amounts of text from the internet, learning to predict the next word. This process requires enormous computational resources, often hundreds of thousands of GPUs. Companies like OpenAI tend to keep their models closed, while others, like Meta (Facebook), have a more open approach, releasing models like Llama for free use.
However, even released models are often out of reach for most individuals due to the immense resources required for training. This limits progress, as fewer people can experiment and improve upon existing models. Openness, in my opinion, is crucial for advancement.
Recently, a small company in China released DeepSeek, a model with several variants. These models are changing the game. They've demonstrated that training is possible with more limited hardware (still expensive, but significantly less so) and with greater data efficiency, addressing a major challenge.
DeepSeek V3
V3 is their flagship model, comparable to ChatGPT. It's another large Transformer, trained on vast amounts of text. It performs well and can be used online. DeepSeek claims V3, with similar performance to Llama or ChatGPT, was trained for around $5 million in hardware electricity costs. Compare this to some of the largest models, which can cost upwards of hundreds of millions, potentially even a billion dollars.
V3 achieves these savings through several techniques. One is called "mixture of experts." Traditional large models attempt to be a "jack of all trades," handling various tasks. However, this is expensive, and performance in one area can sometimes impact another. Mixture of experts focuses different parts of the network on specific tasks. When a prompt comes in, the initial stages route it to the relevant part of the network, activating only a small portion (e.g., 30 billion parameters out of 670 billion), drastically reducing computational cost and energy consumption. This allows for more efficient distribution across data centers.
Another technique is distillation. Large models are used to train smaller, more specialized models. Essentially, the large model answers questions in a specific field, and these answers are used to train a smaller model to perform the same task. This can achieve similar performance with far fewer parameters (e.g., 8 billion), making it possible to run on standard hardware.
DeepSeek has also made mathematical savings in the computations required within the network. By optimizing matrix multiplications, they've significantly reduced the computational load.
DeepSeek R1
While V3 is impressive, R1 is what's generating the most excitement. R1 performs something called "Chain of Thought." Imagine a long division problem. It's easier to solve by writing down the steps. Chain of Thought applies this principle to large language models. Instead of directly providing an answer to a complex problem, the model generates a step-by-step reasoning process, leading to the solution. This is particularly useful for problems requiring multiple steps.
OpenAI pioneered Chain of Thought, but their implementation is closed. R1's Chain of Thought is fully public. They've released the models, the code, and the complete internal monologue. They've also trained it with significantly less data.
Traditionally, training Chain of Thought involves creating a dataset with questions, the corresponding Chain of Thought, and the final answer. This requires a massive amount of carefully crafted examples. R1 takes a different approach, training only with the answers. They use reinforcement learning, rewarding the model for correct answers and for generating internal monologues of the correct format. Over time, the monologues improve, and the model learns to solve problems effectively.
This approach requires much less data, making it easier to train. It democratizes access to this technology. Previously, only a handful of large companies could achieve this. Now, organizations the size of a university could potentially train these models.
DeepSeek has not only released a performant model but also revealed their training methods, which is very unusual. This has caused quite a stir in Silicon Valley.
This kind of openness challenges the business models of companies that rely on closed AI models and vast GPU resources. It levels the playing field, allowing more individuals and organizations to participate in AI development. We may be witnessing the beginning of the end of closed-source AI.
Ultimately, with enough data and model size, AI will likely move beyond simple pattern recognition and achieve more general intelligence.