Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through dynamic integration of external knowledge. To further improve RAG’s flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges: (1) success of each step depends on both high-quality planning and accurate searching, (2) lack of supervision for intermediate reasoning steps, and (3) exponentially large candidate space. We propose DecoupleSearch, a novel framework that decouples planning and searching processes via dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree where nodes represent both planning and searching steps, assessed via Monte Carlo Tree Search. During inference, Hierarchical Beam Search iteratively refines candidates with dual value models. Extensive experiments across policy models of varying sizes demonstrate the effectiveness of our method.
🔍 We introduce DecoupleSearch, a novel Agentic RAG framework that decouples planning-searching processes with dual value models, enabling independent optimization of plan reasoning and search grounding.
🤖 We propose improving the success rate of each step by fully exploring the planning and searching spaces. We utilize MCTS to accurately assess planning and searching quality, while Hierarchical Beam Search is employed to efficiently prune the exponential candidate space.
📊 Extensive experiments on five datasets across policy models of different parameter sizes demonstrate the effectiveness of our method.
MCTS Annotation For each simulation, the algorithm performs four steps:
Selection The \(i\)-th simulation begins with \(s_0\), representing the input query. The algorithm selects nodes according to the Upper Confidence Bound for Trees (UCT) criterion: \[ \mathrm{UCT}(s_t) \;=\; V_s(s_t) \;+\; w \sqrt{\frac{\ln N(\mathrm{parent}(s_t))}{N(s_t)}} \]
Expansion After selecting the node to be expanded, the LLM generates the next plan and query based on the reasoning status. For simplicity, assume the chosen node \(s_t\) corresponds to the intermediate reasoning trajectory \(\tau_{t-1}\). The expansion process is as follows: \[ p_t, q_t = \mathrm{LLM}(\tau_{t-1}) \] \[ d_t = \mathrm{Retrieve}(q_t) \]
Simulation The simulation evaluates the quality of planning and searching at each step and assigns reward values. For intermediate nodes, the LLM assesses the quality of planning and searching, assigning a value between \(-1\) and \(1\), where \(1\) indicates high quality and \(-1\) indicates low quality: \[ R_p(s_t), R_s(s_t) = \mathrm{LLM}\bigl(\tau_{t-1},\,p_t,\,q_t\bigr) \]
Backpropagation At the end of the \(i\)-th simulation, each edge along the path from the leaf node \(s_t\) to the root undergoes a backward pass update. The updates to their values and visiting counts are executed as follows: \[ N(s) \;\leftarrow\; N(s) + 1 \] \[ V_p(s) \;\leftarrow\; V_p(s) + \frac{1}{N(s)}\bigl(R_p(s) - V_p(s)\bigr) \] \[ V_s(s) \;\leftarrow\; V_s(s) + \frac{1}{N(s)}\bigl(R_s(s) - V_s(s)\bigr) \]
Model Training In our framework, the policy model \(\pi_{\theta}\) is initialized with a pre-trained LLM. We extend this model to derive the planning value model \(V_\phi\) and searching value model \(V_\psi\) by adding two auxiliary linear layers with a \(\texttt{Tanh}\) activation function. These layers operate alongside the traditional softmax layer responsible for token prediction.
To construct the training signals for the policy model and the value models, we sample solution paths from the tree constructed through multiple rounds of MCTS. These paths are denoted as \(\mathbf{x}^+\) (correct solutions) and \(\mathbf{x}^-\) (incorrect solutions). We then apply a multi-task loss function to jointly update all the models:
\[
\small
\begin{split}
\mathcal{L} =\; &- \log \pi_{\theta}\bigl(\mathbf{x}^+ \mid \mathbf{q}\bigr)
+ \beta \sum_{t=1}^{T(\mathbf{x})}
\Bigl(
\bigl\lVert V_{\phi}(\mathbf{s}_t) - V_p(\mathbf{s}_t) \bigr\rVert^2
\;+\;
\bigl\lVert V_{\psi}(\mathbf{s}_t) - V_s(\mathbf{s}_t) \bigr\rVert^2
\Bigr)
\end{split}
\]
Hierarchical Beam Search At each step, the policy model first samples multiple possible plans, which are ranked and filtered by the planning value model.
Based on the most promising plan, the policy model generates multiple search queries, which are used to retrieve relevant documents. The retrieved documents are then evaluated by the searching value model to select the most valuable result.
This iterative process continues until the maximum depth is reached or no further paths can be expanded.
Finally, the answers are evaluated by the planning value model, and the answer with the highest value is selected as the output.
Main Results The above table presents the results of experiments conducted on five QA datasets using two model backbones.
First, our method achieves superior performance on all datasets across different policy models. Notably, with Qwen2.5-7B-Instruct-1M as the policy model, DecoupleSearch achieves a 25.8% relative average improvement over the best-performing baseline. This improvement is attributed to the combined planning and searching beam search, which thoroughly explores both spaces and significantly increases the likelihood of identifying the correct reasoning path.
Second, agentic RAG baselines outperform both prompting and advanced RAG methods. This is primarily due to the flexibility agentic RAG provides, allowing the policy model to dynamically decide what to retrieve and when to retrieve. Such dynamic retrieval is especially important for complex, multi-step queries, as demonstrated by strong performance on datasets like Bamboogle.
Third, larger policy models generally yield better performance. However, after applying Hierarchical Beam Search, DecoupleSearch with the 7B policy model becomes comparable to the 14B model, highlighting that inference-time scaling can enable smaller models to achieve competitive results.
Ablation Study We analyze the effectiveness of planning expansion and searching expansion by removing these components and observing the resulting performance changes.
Removing either expansion degrades performance. Excluding planning or searching expansion leads to a clear drop in accuracy, underscoring the need to explore both the planning and searching spaces thoroughly.
Planning expansion is more critical. Omitting the planning expansion causes a larger performance decline, since the plan defines the subsequent search space—suboptimal plans make it difficult to retrieve high-quality results.
Scaling with Planning and Searching We conduct experiments on the HotpotQA, 2WikiMultihopQA, and MusiQue datasets, varying these parameters within the range of 1 to 5.
Model performance peaks at a planning expansion size of 3. Smaller values limit exploration of plan candidates, while larger values overwhelm the planning value model’s ranking capacity, causing accuracy to drop.
Larger searching expansion sizes generally improve performance. Increasing \(B_2\) raises the chance of retrieving critical evidence, and the searching value model can effectively rank results by directly evaluating document relevance.
Effectiveness of Value Models We evaluate the accuracy of our planning and searching value models by comparing their beam ranking against random selection.
Ranking by the learned value model outperforms random selection. For both planning expansion and searching expansion, using the value model to rank candidates yields higher performance than random choice, confirming that both value heads accurately assess quality.
Performance gains are more pronounced for searching expansion. Evaluating search results is relatively easier—values directly reflect answer presence—whereas assessing plans lacks clear patterns, making plan ranking more challenging.
Prioritize searching expansion when resources are limited. With constrained compute, allocating more beams to searching expansion is often more robust, given the higher reliability of the searching value model.
Case Study We present a case study from the MusiQue dataset.
The policy model first proposes different plans (e.g. look up her spouse vs. her lineage), and the planning‐value model ranks the spouse search highest. It then issues “Who is Gulcicek Hatun’s husband?” to find Murad I, prunes off unrelated hits, and follows up with “Who was Murad I’s father?” to identify Orhan Ghazi—again discarding low‐value results. In just a few targeted steps, it pinpoints the father-in-law, demonstrating how planning expands the search space while value scoring zeroes in on the best leads.