OpenAI launches o3-mini model
DeepSeek AI's ultra-efficient R1 model caused a global tech stock market crash, evaporating nearly a trillion dollars in market capitalization and fundamentally altering the competitive landscape of the AI industry. To counter this challenge, OpenAI urgently released its o3-mini model last Friday to compete with DeepSeek's R1. DeepSeek's R1 model shocked the entire AI industry with its top-tier performance at a remarkably low computational cost.
OpenAI announced in its official blog post: "We're introducing OpenAI o3-mini, the latest and most cost-effective model in our reasoning family, now available on ChatGPT and the API platform." "This powerful and fast model, previewed in December 2024, pushes the boundaries of small model performance… while maintaining the low cost and low latency of OpenAI o1-mini."
To promote its new reasoning model series, OpenAI offered reasoning capabilities to users for free for the first time and increased the daily message limit for paid users from 50 to 150 (a threefold increase).
The o3-mini Model: Relatively Weak in Creativity, but Stronger in Problem Solving
Unlike GPT-4o and the GPT series of models, the "o" series AI models focus on reasoning tasks. They are relatively weak in creativity but have built-in chain-of-thought reasoning capabilities, making them better at solving complex problems, correcting erroneous analyses, and writing better-structured code. OpenAI primarily has two AI model series: Generative Pre-trained Transformer (GPT) and "Omni" (o).
The GPT series is like the artists in the family, excelling at role-playing, conversation, creative writing, summarization, explanation, brainstorming, and chatting; while the "o" series is like the scientists in the family, less adept at storytelling but skilled in coding, solving mathematical equations, analyzing complex problems, step-by-step planning and reasoning, and comparative research paper analysis.
The new o3-mini model comes in low, medium, and high versions. Users can choose different versions based on their needs to obtain more accurate answers, but developers will need to pay more "reasoning" fees (charged by token).
In terms of efficiency, OpenAI o3-mini underperforms OpenAI o1-mini in general knowledge and multilingual chain of thought, but scores higher on other tasks such as coding and factuality. The medium and large versions of o3-mini surpass OpenAI o1-mini in all benchmark tests.
DeepSeek's R1 model, with its extremely low computational power yet surpassing the performance of OpenAI's flagship models, triggered a tech stock sell-off, resulting in nearly a trillion dollars in market cap loss in the US market, with Nvidia alone losing $600 billion. Investors became concerned about the future demand for its high-priced AI chips.
DeepSeek's success stems from its innovative approach to model architecture. Unlike US companies that tend to invest more in computational power, the DeepSeek team focused on optimizing the model's information processing workflow to improve efficiency. With the Chinese tech giant Alibaba launching the even more powerful Qwen2.5 Max model (based on the same underlying model as DeepSeek), competitive pressure is further intensifying, signaling the arrival of a new wave of Chinese AI innovation.
OpenAI o3-mini: 24% Faster Than Its Predecessor
OpenAI o3-mini attempts to regain the lead. The new model is 24% faster than its predecessor, performing on par with or even exceeding the old model in key benchmark tests, while also being more cost-effective.
In terms of pricing, OpenAI o3-mini is priced at $0.55 per million input tokens and $4.40 per million output tokens. While higher than DeepSeek R1's $0.14 and $2.19, it has narrowed the price gap with DeepSeek and significantly reduced the cost compared to OpenAI o1. This is likely a key factor in its success. OpenAI o3-mini is a closed-source model, while DeepSeek R1 is open-source, but for users willing to pay for use on managed servers, the attractiveness of o3-mini will depend on its application scenarios.
In the AIME math problem benchmark test, the OpenAI o3-mini medium version scored 79.6, DeepSeek R1 scored 79.8, second only to the OpenAI o3-mini large version (87.3).
In other benchmark tests, such as the GPQA score measuring capabilities in different scientific fields, DeepSeek R1 scored 71.5, o3-mini low version scored 70.6, and o3-mini high version scored 79.7; in the Codeforces coding task benchmark test, R1 was at the 96.3rd percentile, o3-mini low version at the 93rd percentile, and o3-mini high version at the 97th percentile. Therefore, the differences between the models vary depending on the task.
OpenAI o3-mini vs. DeepSeek R1: A Test Comparison
We conducted several tests to compare the performance of the two models. One test was a spy game based on the Github BIG-bench dataset to assess multi-step reasoning ability. OpenAI o3-mini performed poorly in this test, reaching the wrong conclusion and misidentifying the culprit, while DeepSeek R1 correctly identified the culprit.
However, o3-mini performed well in logic language tasks that did not involve mathematics. For example, when asked to write five sentences ending with a specific word, o3-mini was able to understand the task, evaluate the results, and provide the correct answer, taking four seconds and correcting one wrong answer itself.
In mathematics, o3-mini excelled, quickly solving some problems considered extremely difficult. For example, a complex problem that took DeepSeek R1 275 seconds to solve was completed by o3-mini in just 33 seconds.
In conclusion, OpenAI's o3-mini model shows some competitiveness, but the challenge from DeepSeek R1 remains, and the competition between the two in the AI field will continue to heat up.