The recent release of OpenAI o1 has drawn a lot of attention to large reasoning models (LRMs), and is inspiring new models aimed at solving complex problems that classical language models often struggle with. Based on the success of o1 and the concept of LRMs, Alibaba researchers have introduced Marco O. 1which enhances reasoning abilities and tackles problems with open-ended solutions where there are no clear criteria and quantifiable rewards.
OpenAI o1 uses “inference time scaling” by giving “think time” to improve the inference ability of the model. Essentially, the model uses more compute cycles during inference to generate more tokens and evaluate its answers, which improves its performance on tasks that require reasoning. o1 is known for its impressive reasoning abilities, especially in standard answers such as mathematics, physics and coding.
However, many applications involve open-ended problems that lack clear solutions and tangible rewards. “We aim to push the boundaries of LLMs further, enhancing their reasoning abilities to tackle complex, real-world challenges,” the Alibaba researchers write.
Marco-o1 is a refined version of Alibaba’s Qwen2-7B-Instruct that integrates advanced techniques such as Chain of Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS) and reasoning processing strategies.
The researchers trained MarcoO1 on a collection of datasets, including Open-O1 CoT dataset; Marco-o1 CoT dataset, a synthetic dataset produced using MCTS; and the Marco-o1 instruction dataset, a collection of data following customized instructions for reasoning tasks.
MCTS is a search algorithm that has proven to be useful in complex problem solving scenarios. It intelligently searches for different solutions by repeatedly sampling possibilities, simulating outcomes and gradually building a decision tree. This has proven to be very useful in complex AI problems, such as beating the game Go.
Marco-o1 leverages MCTS to explore multiple reasoning paths as it generates response tokens. The model uses the confidence scores of candidate response tokens to build its decision tree and explore different branches. This enables the model to consider a wider range of possibilities and reach more informed and significant conclusions, especially in scenarios with open-ended solutions. The researchers also introduced a flexible reasoning processing strategy that allows them to adjust the granularity of MCTS steps by specifying the number of tokens generated at each node in the tree. It provides a trade-off between accuracy and computational cost, giving users the flexibility to balance performance and efficiency.
Another important innovation in Marco-o1 is the introduction of reflection mode. During the reasoning process, the model periodically prompts itself with the phrase, “Wait! Maybe I made some mistakes! I need to start over.” This causes the model to reevaluate its reasoning steps, identify potential errors, and refine its thought process.
“This approach allows the model to act as its own critic, pointing out potential errors in its reasoning,” the researchers write. “By explicitly encouraging the model to question its initial conclusions, we encourage it to redefine and refine its thought process.”
To evaluate Marco-o1’s performance, the researchers conducted experiments on a variety of tasks, including the MGSM benchmark, a dataset of multilingual grade school math problems. Marco-o1 significantly outperformed the basic Qwen2-7B model, especially when the MCTS component was adjusted for single-token granularity.
However, the primary purpose of Marco-o1 was to address reasoning challenges in open-ended situations. To that end, the researchers tested the model to translate spoken and spoken expressions, a task that requires understanding the subtle nuances of language, culture, and context. Experiments showed that Marko-O1 was able to capture and translate these expressions more effectively than traditional translation tools. For example, the model has correctly translated the Chinese colloquialism, which literally means, “This shoe offers a feeling of walking,” into the English equivalent, “The sole of this shoe is comfortable.” ” The model’s reasoning chain shows how it evaluates different possible meanings and arrives at the correct translation.
This paradigm can be useful for tasks such as product design and strategy, which require deep and contextual understanding and lack well-defined benchmarks and metrics.
A new wave of reasoning models
Since the release of o1, AI labs have raced to release reasoning models. Last week, Chinese AI lab DeepSeek released the R1-Lite-Preview, its o1 competitor, which is currently only available through the company’s online chat interface. The R1-Lite-Preview reportedly beats the o1 on several key benchmarks.
The open source community is also catching up to the private model market, releasing models and datasets that take advantage of inference time scaling laws. Released by the Alibaba team. Marco O. 1 On a face hugging with a Partial reasoning dataset which researchers can use to train their inference models. Another recently released model is LLaVA-o1, developed by researchers at several universities in China, which brings the inference-time reasoning paradigm to open-source vision language models (VLMs).
The release of these models comes amid uncertainty about the future of model scaling laws. Various reports suggest that the returns on training big models are diminishing and may be hitting a wall. But what is certain is that we are just beginning to explore the possibilities of inference time scaling.
Credit : venturebeat.com