Meta Chain-of-Thought: Unlocking System 2 Reasoning in LLMs
Abstract
Join Our Research
With the scaling of pre-training improvements plateauing , we at SynthLabs believe that the next iteration of AI development and capabilities will come from post-training and Reinforcement Learning (RL) based on synthetic data! Last year we developed a novel fine-tuning technique, Direct Principle Feedback (DPF), showcasing how we can steer LLM behavior using only synthetic data.
Our recent research on Generative Reward Models (GenRM) demonstrated that combining human and AI feedback creates more robust and generalizable reward models. We believe that advanced reasoning capabilities are the next major frontier for AI development and that synthetic data will play the key role in tackling those challenges.
If you're excited about open research and tackling fundamental questions in AI, we want to hear from you.
Reasoning in LLMs and Meta Chain Of Thought
An example of how to derive the final CoT solution from Meta-CoT using Game of 24.
Despite significant improvements in capabilities current generations of LLMs still struggle with challenging reasoning problems, such as advanced math and planning. We propose a framework to explain these empirical observations based on task complexity. Standard model training pipelines now include corpora of reasoning data such as math, science and logic. However, we argue that the solutions to problems presented in these datasets do not follow the true data generation process. For example, complex proofs presented in math textbooks are a product of a long, complex and exhaustive reasoning process that is not represented in the text data. This makes the learning problem very challenging, as the model needs to internalize the whole data-generating process that produces the final solution, which can be arbitrarily complex, in its activations.
For an example consider the windmill problem from IMO 2011 below:
The solution is only a few sentences long and does not require any pre-existing knowledge or techniques, but is widely considered one of the most challenging problems in the competition (and perhaps the most difficult that year). The reason for this is that the solution does follow an “algorithmic thinking” where standard approaches to breaking down the problem such as convex hull configurations or Hamiltonian graphs do not yield a correct solution. Indeed, the few contestants who solved the problem spent their time trying out different configurations until finding the correct one. Even if one reads the provided solution, it is unclear why the construction works until the very end. If we imagine a model trained with next-token prediction to solve such tasks, it would have to internalize the entire exploration trial-and-error approach to finding this construction by the time it starts generating the solution.
The natural question to ask then is if this is a fundamental limitation of auto-regressive transformer models that cannot be overcome? Many leading researchers have made that point. We fundamentally do not believe this is the case, after all, human brains do sequential processing with finite compute capacity as well. Instead we argue that this is an issue of algorithms and training data. If we can train the model on the problem solving process itself, rather than the final solution output, the model will be able to internalize how to think about reasoning tasks not just what to think. We argue that the current generation of reasoning models implements the problem solving process.
Top: Performance of current frontier models by size on the HARP mathematics benchmark by difficulty level and topic. The OpenAI O1 series significantly out-performs prior generation models across the board. Bottom: Average number of tokens generated by each model grouped by difficulty level, as well as average number of tokens in human-generated solutions (using GPT4 tokenizer). Sources: Figures 3 and 4, respectively, from Yue et al., 2024.
We present results from Omni-Math, a new challenging benchmark of high-school level math Olympiads. The standard LLM models all produce solutions with similar accuracy to the OpenAI O1 model at lower difficulties, but significantly outperforms them at more challenging problems. At the same time standard models all produce solutions with comparable lengths to human-written ones, which are represented in their training data. While this is also true for o1 at lower difficulty problems (which may follow more algorithmic solution templates), it progressively uses more compute which scales with problem difficulty, consistent with how mathematicians spend more time and effort to actually produce the correct solution. This is a generalization of the standard CoT, where the final solution CoT itself is the output of some complex reasoning process, which we will hence refer to as Meta-CoT.
Learning How To Think Not What to Think
We argue that in challenging reasoning domains, where a Generator-Verifier gap exists, the Meta-CoT process fundamentally is represented as a search process. This certainly represent the way a mathematician might think about a problem - investigate different approaches, truncated dead ends, re-evaluate intermediate results, try alternatives and make progress towards a goal based on his intuition.
Indeed search has proven to yield significant capabilities improvements in Games, achieving super-human performance on many challenging tournaments (AlphaZero Go) Recently, similar findings with simpler search approaches using LLMs in combination with trained verifiers have shown promise in challenging reasoning domains such as math.
So, is search with LLMs enough then? We argue that this is not the final goal. To begin with, human cognition still works in an autoregressive stream-of-consciousness way, unlike the tree search approaches we use in AI. Hence, it should be possible to internalize a search procedure inside an LLM. Moreover, such an approach would have two main additional benefits:
- First it would be significantly more efficient since the model will have all the visited paths and ideas in-context, hence would not (or at least to a lesser extend) repeat those, which is a persistent phenomenon in the language tree search approach.
- Secondly, we believe internalizing a thinking process inside an LLM allows us to post-train with reinforcement learning. However, this becomes a meta-RL problem in which we search over all thinking processes that the model can implement. We believe superintelligence isn't about discovering new things; it's about discovering new ways to discover.
We propose training autoregressive LLMs to internalize a search procedure using standard next-token prediction on the search process itself - i.e. the sequence of all visited nodes during a search procedure.
It's Not RL, It's Meta-RL
The paradigm of the RL2 formulation of meta-reinforcement learning. For each new task (prompt), the agent explores its environment over multiple episodes, keeping all the experience in context and maximizes rewards over the entire interaction. Source: Figure 1 in Duan et al., 2016
In the previous section we outlined how a LLM should implement a thinking procedure (search) about how to produce a solution, rather than directly generate it. If we focus on problems where we have ground-truth verifiable outcome, such as math, science or code, we could follow standard RL post-training procedure to further improve model performance. However, there are subtle differences here over standard post-training. Since the model implements a thinking procedure in-context, any RL gradient update essentially implements a new thinking algorithm, potentially discovering novel approaches to thinking. We will make this argument more rigorous bellow. Given a problem (task) q and a policy (LLM) that produces a solution the standard RL problem optimizes:
where is some reward function, such as verification reaching the right solution or passing unit tests in coding domains. In contrast a meta-RL procedure considers a series of tasks, which in our case would be reasoning problems and an adaptation algorithm of the model , that solves the optimization problem
The exact adaptation procedure varies by algorithm design, but the RL algorithm, and particularly the E-RL version are especially suitable for applications with LLMs, due to their powerful in-context learning abilities and large parameter count, making alternatives like MAML impractical. In the E-RL paradigm, the policy can interact with an MDP (problem ) for episodes (solution attempts) and the optimization objective is
where here the adaptation procedure which is a stochastic update and requires RL to optimize. There are a few extra points to make in this case:
- If we allow the model to terminate an episode early and reset to a random state in context, then instead of episodic exploration it can implement any general tree search algorithm in context as discussed in the prior section.
- Technically the Meta-RL formulation requires environment rewards or feedback, which in most reasoning problems, such as math, are not immediately available. However, in the presence of a generator-verifier gap, the model still benefits from online exploration if it can internalize self-verification.
Now notice that if we optimize for the parameters of the above policy , then this becomes a search procedure over adaptation algorithms . That is, we can pre-train the model with some algorithm that implements exploration in a reasoning space, but the post-trained model can discover novel exploration approaches that significantly outperform whatever human-led reasoning strategy was implemented in the pre-trained model. This can in theory allow the reasoning model to solve novel classes of problems that are not solvable under symbolic search approaches, which we build into the model. Whether this is realized in modern LLMs with advanced reasoning training remains an open problem empirically.
Ongoing Work
Open research on reasoning is currently bottle-necked by both access to large, high-quality datasets and performant RL training infrastructure. To fix this, we are actively composing the "Big Math" dataset: 1,000,000 high-quality, diverse math problems with verifiable answers from both existing datasets as well as novel sources. In addition, we are developing efficient, scalable online RL infrastructure in NeoX to support asynchronous training and inference.
Big Math Project
The RL training pipeline we propose requires a large dataset of problems with verifiable answers. While there are existing datasets such as MATH, they are limited in scope with only ~10,000 problems in total. To overcome this data limitation, we have been working on "Big Math", an effort to aggregate 1,000,000 high-quality verifiable math problems.
We have found a set of three criteria to create such a high-quality dataset. (1) each problem must be uniquely verifiable and admit only a single correct answer (2) each problem must be open-ended so it is not easily guessable (e.g. no multiple-choice questions) (3) the problem should have a closed-form solution that is easily evaluated automatically (e.g. scalars or expressions, no proofs)
We have been sourcing math problems by filtering down existing datasets including MATH (math competition problems), Numina Math, HARP, Omni-Math as well as Open Math Instruct2. These datasets all together including math problems ranging from grade-level math, competition mathematics problems from AMC and AIME to olympiad-level, as well as synthetically produced math problems. Applying all of these criteria brings the total problems from 1.5 million without filtering to approximately 700k. However, this does not account for how useful the problems are for training itself.
Online RL Training Infrastructure
For the type of RL training we want to do, we need infrastructure that can scale to multiple nodes easily while supporting high-throughput inference and allowing for interleaving both inference and training in a flexible and efficient manner.
We are actively working on adding such capabilities using a fork of the open-source framework GPT-NeoX, a distributed training library for LLMs. We've seen first hand that a fully parallel setup requires coordination to make sure that both the training and inference processes do not fall out of sync from each other, ensuring that the RL training remains on-policy. We are working on having efficient, separate training and inference setups that can handle inference-heavy workloads (e.g. using MCTS) while running asynchronously. If you are interested in distributed training and RL infrastructure, we are looking for collaborators and engineers to help us develop and scale this.
Open Questions
We outlined the Meta-CoT framework above and how it helps interpret the problem of getting an LLM to reason as a meta-RL problem. However, one major question is whether meta-RL is sufficient to achieve super-intelligence - e.g. can running RL on a model capable of in-context search make it discover novel algorithms (emergent capabilities) which solve problems that symbolic search strategies were incapable of doing before? Another natural question is whether running online search on a reasoning model can still improve performance, or if after the search trace distillation plus RL training its performance is saturated?
The scaling laws of search for LLMs have not been thoroughly evaluated to understand how scaling a verifier and policy jointly together works. Search itself remains under explored as well - given a trained verifier, how do different search strategies such as BFS/DFS, A*, and MCTS perform, and what is their effect on the reasoning model at both train and test time?
Additionally, we are looking at how tools and external environments can be integrated into the model's reasoning trace. Current reasoning models do not generalize outside of their training domains (e.g. math and code) - how can the in-context search process be extended for novel domains that are non-trivial to verify and evaluate?
If you are interested in any of these questions, we are looking for researchers, engineers, and collaborators to join us on our mission to answer these in the open.