We propose a novel approach to learn relational policies for classical planning based on learning to rank actions. We introduce a new graph representation that explicitly captures action information and propose a Graph Neural Network (GNN) architecture augmented with Gated Recurrent Units (GRUs) to learn action rankings. Unlike value-function based approaches that must learn a globally consistent function, our action ranking method only needs to learn locally consistent ranking. Our model is trained on data generated from small problem instances that are easily solved by planners and is applied to significantly larger instances where planning is computationally prohibitive. Experimental results across standard planning benchmarks demonstrate that our action-ranking approach not only achieves better generalization to larger problems than those used in training but also outperforms multiple baselines (value function and action ranking) methods in terms of success rate and plan quality.
Traditional planners use search and heuristics. While they can find optimal solutions, they often struggle with scalability and become computationally expensive or too slow for large, complex problems.
Unlike standard Reinforcement Learning, which can require millions of examples, the relational structure of PDDL (Planning Domain Definition Language) allows Graph Neural Networks (GNNs) to learn generalizable policies from just thousands of examples generated from small, easily-solved problems.
The core idea is to train a model on small problem instances (e.g., 6-9 block problems) and then apply that learned policy to solve significantly larger instances (e.g., 40-block problems) where classical planners fail.
Many approaches try to learn a value function, V(s), that estimates the cost-to-goal from any state. This is extremely difficult because the function must be globally consistent, and since optimal planning is NP-hard, these value functions are hard to learn and often don't generalize well to larger problems.
Models like Action Schema Networks (ASNets) use a fixed-depth receptive field. This design limits their ability to reason about long-range dependencies between objects and actions, which is crucial in complex problems.
Previous GNN approaches (like GPL) focus on state representations but do not explicitly model how actions themselves relate to objects. This misses critical information about action applicability and relationships.
Instead of learning a complex, globally consistent value function V(s) for all states, GABAR learns a simpler, locally consistent function to just rank the applicable actions in the *current* state. This is a more tractable and generalizable learning problem.
Our novel graph representation includes 'action nodes' that are explicitly connected to the 'object nodes' they take as arguments. This directly encodes action applicability and relational structure into the GNN.
A Gated Recurrent Unit (GRU)-based decoder constructs the final action step-by-step. It first selects an action schema (e.g., 'stack') and then sequentially selects each parameter (e.g., 'blockA', then 'blockB'), conditioning each choice on the previous ones. This handles complex dependencies between parameters.
The graph consists of Object, Predicate, Action, and one Global node. Edges represent predicate-object and action-object relationships, encoding the state, goal, and all potential actions.
The GNN performs L rounds of message passing (we use 9) where nodes and edges update their embeddings. Attention mechanisms weight the importance of different messages, and the global node aggregates graph-level information for rapid propagation.
The final global embedding is fed into a GRU decoder. It first selects the highest-scoring action schema, then autoregressively selects each object parameter. We use a beam search (width k=2) to explore multiple high-scoring actions in parallel.
GABAR's coverage (success rate) drops minimally as problems get harder: 95.5% on Easy, 92.2% on Medium, and 89.2% on Hard. In contrast, baseline methods like GPL and GRAPL see their coverage collapse to 6.5% and 22.1% on hard problems, respectively.
In the Visitall domain, the model was trained on problems with 9-36 cells (up to 49 in validation). It successfully solved test problems with up to 400 cells, more than 8 times larger than the largest training instances, demonstrating powerful generalization.
On hard test instances, LLMs like OpenAI-O3 and Gemini-2.5-Pro solve almost no problems (0.4% and 1.5% coverage, respectively). GABAR's architecture, which explicitly learns structural relationships, maintains 89.2% coverage, highlighting the limitations of pure pattern matching for complex reasoning.
| Domain | Diff | Baselines | GABAR | Ablations | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPL | ASNets | GRAPL | GABAR | GABAR-ACT | GABAR-CD | GABAR-RANK | |||||||||
| C↑ | P↑ | C↑ | P↑ | C↑ | P↑ | C↑ | P↑ | C↑ | P↑ | C↑ | P↑ | C↑ | P↑ | ||
| Blocks | E | 100 | 1.1 | 100 | 1.6 | 64 | 0.65 | 100 | 1.5 | 44 | 0.65 | 100 | 0.92 | 29 | 0.79 |
| M | 45 | 0.68 | 100 | 1.5 | 48 | 0.44 | 100 | 1.6 | 14 | 0.49 | 92 | 0.81 | 21 | 0.71 | |
| H | 10 | 0.33 | 92 | 1.4 | 38 | 0.28 | 100 | 1.7 | 4 | 0.35 | 81 | 0.80 | 9 | 0.61 | |
| Miconic | E | 97 | 0.97 | 100 | 1.0 | 68 | 0.56 | 100 | 1.0 | 35 | 0.55 | 97 | 0.88 | 42 | 0.67 |
| M | 37 | 0.56 | 100 | 0.98 | 65 | 0.54 | 100 | 0.97 | 18 | 0.33 | 94 | 0.86 | 29 | 0.37 | |
| H | 19 | 0.29 | 90 | 0.92 | 60 | 0.49 | 100 | 0.95 | 2 | 0.27 | 88 | 0.83 | 16 | 0.29 | |
| Spanner | E | 73 | 1.1 | 78 | 0.86 | 22 | 0.65 | 94 | 1.1 | 31 | 0.65 | 87 | 0.98 | 57 | 0.82 |
| M | 42 | 0.56 | 60 | 0.69 | 5 | 0.55 | 93 | 0.99 | 11 | 0.27 | 81 | 0.93 | 42 | 0.77 | |
| H | 3 | 0.18 | 42 | 0.61 | 0 | - | 89 | 0.91 | 0 | - | 62 | 0.79 | 12 | 0.45 | |
| Gripper | E | 100 | 1.0 | 78 | 0.98 | 26 | 0.95 | 100 | 1.1 | 31 | 0.56 | 95 | 1.0 | 55 | 0.58 |
| M | 56 | 0.85 | 54 | 0.91 | 12 | 0.67 | 100 | 0.99 | 23 | 0.40 | 92 | 0.93 | 43 | 0.41 | |
| H | 21 | 0.74 | 42 | 0.88 | 0 | - | 100 | 0.96 | 9 | 0.28 | 87 | 0.86 | 21 | 0.33 | |
| Visitall | E | 69 | 1.3 | 94 | 0.96 | 92 | 1.1 | 93 | 1.1 | 72 | 1.2 | 91 | 1.1 | 52 | 0.64 |
| M | 15 | 0.76 | 86 | 0.93 | 88 | 1.0 | 91 | 1.0 | 64 | 0.93 | 89 | 1.1 | 46 | 0.56 | |
| H | 0 | 0 | 64 | 0.81 | 78 | 0.99 | 88 | 1.1 | 44 | 0.67 | 83 | 1.2 | 39 | 0.54 | |
| Grid | E | 74 | 0.89 | 52 | 0.81 | 20 | 0.38 | 100 | 0.91 | 21 | 0.56 | 79 | 0.87 | 17 | 0.54 |
| M | 17 | 0.61 | 45 | 0.66 | 3 | 0.28 | 97 | 0.85 | 8 | 0.46 | 71 | 0.65 | 12 | 0.28 | |
| H | 0 | 0 | 21 | 0.60 | 0 | - | 92 | 0.74 | 0 | - | 54 | 0.53 | 0 | - | |
| Logistics | E | 56 | 0.61 | 39 | 0.71 | 32 | 0.81 | 90 | 0.75 | 12 | 0.64 | 31 | 0.86 | 41 | 0.65 |
| M | 7 | 0.21 | 22 | 0.55 | 9 | 0.45 | 76 | 0.65 | 3 | 0.49 | 25 | 0.54 | 21 | 0.49 | |
| H | 0 | 0 | 4 | 0.39 | 0 | - | 71 | 0.59 | 0 | - | 6 | 0.35 | 0 | - | |
| Rovers | E | 64 | 0.99 | 67 | 0.96 | 21 | 0.35 | 87 | 1.0 | 22 | 0.75 | 44 | 0.81 | 33 | 0.67 |
| M | 9 | 0.32 | 56 | 0.87 | 5 | 0.19 | 82 | 0.96 | 6 | 0.66 | 37 | 0.63 | 9 | 0.56 | |
| H | 0 | 0 | 31 | 0.64 | 0 | - | 77 | 0.97 | 0 | - | 19 | 0.57 | 0 | - | |
| Combined | E | 79.1 | 0.98 | 76 | 0.98 | 43.5 | 0.67 | 95.5 | 1.04 | 33.5 | 0.69 | 78 | 0.93 | 40.2 | 0.67 |
| M | 28.5 | 0.56 | 65.4 | 0.88 | 29.3 | 0.51* | 92.2 | 1.01 | 18.4 | 0.50 | 72.7 | 0.80 | 27.8 | 0.51 | |
| H | 6.5 | 0.39* | 48.5 | 0.78 | 22.1 | 0.58* | 89.2 | 0.99 | 7.4 | 0.39* | 60 | 0.73 | 12.1 | 0.44* | |
| Domain | Diff | OpenAI-O3 | Gemini-2.5-Pro | GABAR | |||
|---|---|---|---|---|---|---|---|
| C↑ | P↑ | C↑ | P↑ | C↑ | P↑ | ||
| Blocks | E | 73 | 1.03 | 81 | 1.1 | 100 | 1.5 |
| M | 41 | 0.95 | 47 | 0.86 | 100 | 1.6 | |
| H | 4 | 0.61 | 12 | 0.81 | 100 | 1.7 | |
| Miconic | E | 56 | 0.81 | 79 | 0.86 | 100 | 1.0 |
| M | 12 | 0.69 | 36 | 0.58 | 100 | 0.97 | |
| H | 0 | - | 12 | 0.51 | 100 | 0.95 | |
| Spanner | E | 38 | 0.81 | 42 | 0.75 | 94 | 1.1 |
| M | 13 | 0.77 | 10 | 0.64 | 93 | 0.99 | |
| H | 0 | - | 0 | - | 89 | 0.91 | |
| Combined | E | 33.4 | 0.85 | 44.0 | 0.8 | 95.5 | 1.04 |
| M | 11.6 | 0.77* | 17.1 | 0.68* | 92.2 | 1.01 | |
| H | 0.4 | 0.61* | 1.5 | 0.51* | 89.2 | 0.99 | |
The tables below show that GABAR not only solves *more* problems (Coverage, C) but also finds solutions that are high quality (Plan Quality Ratio, P) and efficient (Plan Length, PL). Baselines that solve fewer problems often appear to have low plan lengths simply because they are only solving the easiest instances.
| Domain | Diff | GPL | ASNets | GRAPL | GABAR | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C↑ | P↑ | PL↓ | C↑ | P↑ | PL↓ | C↑ | P↑ | PL↓ | C↑ | P↑ | PL↓ | ||
| Blocks | E | 100 | 1.1 | 55 | 100 | 1.6 | 38 | 64 | 0.65 | 79 | 100 | 1.5 | 41 |
| M | 45 | 0.68 | 137 | 100 | 1.5 | 79 | 48 | 0.44 | 214 | 100 | 1.6 | 76 | |
| H | 10 | 0.33 | 423 | 92 | 1.4 | 156 | 38 | 0.28 | 585 | 100 | 1.7 | 125 | |
| Miconic | E | 97 | 0.97 | 167 | 100 | 1.0 | 162 | 68 | 0.56 | 252 | 100 | 1.0 | 160 |
| M | 37 | 0.56 | 235 | 100 | 0.98 | 180 | 65 | 0.54 | 280 | 100 | 0.97 | 181 | |
| H | 19 | 0.29 | 448 | 90 | 0.92 | 209 | 60 | 0.49 | 329 | 100 | 0.95 | 202 | |
| Spanner | E | 73 | 1.1 | 27 | 78 | 0.86 | 36 | 22 | 0.65 | 35 | 94 | 1.1 | 31 |
| M | 42 | 0.56 | 61 | 60 | 0.69 | 55 | 5 | 0.55 | 50 | 93 | 0.99 | 44 | |
| H | 3 | 0.18 | 208 | 42 | 0.61 | 79 | - | - | - | 89 | 0.91 | 67 | |
| Gripper | E | 100 | 1.0 | 82 | 78 | 0.98 | 76 | 26 | 0.95 | 61 | 100 | 1.1 | 78 |
| M | 56 | 0.85 | 128 | 54 | 0.91 | 118 | 12 | 0.67 | 128 | 100 | 0.99 | 133 | |
| H | 21 | 0.74 | 139 | 42 | 0.88 | 131 | - | - | - | 100 | 0.96 | 156 | |
| Visitall | E | 69 | 1.3 | 79 | 94 | 0.96 | 121 | 92 | 1.1 | 106 | 93 | 1.1 | 103 |
| M | 15 | 0.76 | 188 | 86 | 0.93 | 194 | 88 | 1.0 | 181 | 91 | 1.0 | 179 | |
| H | - | - | - | 64 | 0.81 | 333 | 78 | 0.99 | 272 | 88 | 1.1 | 243 | |
| Grid | E | 74 | 0.89 | 41 | 52 | 0.81 | 47 | 20 | 0.38 | 77 | 100 | 0.91 | 45 |
| M | 17 | 0.61 | 66 | 45 | 0.66 | 73 | 3 | 0.28 | 143 | 97 | 0.85 | 63 | |
| H | - | - | - | 21 | 0.60 | 101 | - | - | - | 92 | 0.74 | 98 | |
| Logistics | E | 56 | 0.61 | 117 | 39 | 0.71 | 101 | 32 | 0.81 | 88 | 90 | 0.75 | 127 |
| M | 7 | 0.21 | 305 | 22 | 0.55 | 138 | 9 | 0.45 | 169 | 76 | 0.65 | 159 | |
| H | - | - | - | 4 | 0.39 | 217 | - | - | - | 71 | 0.59 | 232 | |
| Rovers | E | 64 | 0.99 | 17 | 67 | 0.96 | 19 | 21 | 0.35 | 44 | 87 | 1.0 | 21 |
| M | 9 | 0.32 | 78 | 56 | 0.87 | 36 | 5 | 0.19 | 125 | 82 | 0.96 | 33 | |
| H | - | - | - | 31 | 0.64 | 55 | - | - | - | 77 | 0.97 | 45 | |
| Combined | E | 79.1 | 0.98 | 69 | 76.0 | 0.98 | 73 | 43.5 | 0.67 | 91 | 95.5 | 1.04 | 76 |
| M | 28.5 | 0.56 | 148 | 65.4 | 0.88 | 111 | 29.3 | 0.51 | 172 | 92.2 | 1.01 | 108 | |
| H | 6.5 | 0.39* | 265 | 48.5 | 0.78 | 158 | 22.1 | 0.58* | 276 | 89.2 | 0.99 | 147 | |
Our GABAR-ACT ablation removed the explicit action nodes from the graph. This caused a catastrophic performance drop, with coverage on hard problems falling from 89.2% to just 7.4%. This confirms that explicitly modeling action-object relationships is essential.
The GABAR-CD ablation selected all action parameters independently instead of sequentially. This failed to capture dependencies (e.g., in Logistics), and coverage on hard problems dropped from 89.2% to 60.0%.
The GABAR-RANK ablation used our graph but swapped the ranking objective for a traditional value-learning objective. Performance collapsed to just 12.1% on hard problems, proving that learning to rank is a more robust and generalizable strategy.
In our GABAR-G ablation (results in paper appendix), removing the global node (which aggregates graph-wide information) caused coverage on hard problems to drop from 89.2% to 42.5%. The global node is critical for rapid information propagation in large graphs.
We generate data by solving small problems (e.g., 6-9 blocks, 5-15 balls) with an optimal planner. The first action of the optimal plan is used as the training label. Training takes 1-2 hours per domain on an RTX 3080.
Node features are one-hot encodings for type (object, predicate, action) and specific schema/predicate. Edge features encode type (predicate-object, action-object) and argument/parameter position.
We use a hidden dimensionality of 64, 9 rounds of GNN message passing, and a batch size of 16. The decoder uses a beam width of 2. We use the Adam optimizer with a learning rate of 0.0005.
We generate between 3,348 (Blocks) and 6,880 (Grid) training examples (state-action pairs) per domain. All problems are generated using standard PDDL generators.
GABAR enables solving large, real-world-sized planning problems by executing a learned policy directly, bypassing the expensive search required by traditional planners.
Our results show that learning a simple, local action-ranking policy is a more tractable and scalable approach than the dominant paradigm of learning complex, global value functions.
This work demonstrates that combining the explicit relational structure of PDDL with the right GNN architecture can achieve sample efficiency and generalization that pure, data-hungry RL methods often cannot.
@inproceedings{mangannavargraph,
title={Graph Neural Network Based Action Ranking for Planning},
author={Mangannavar, Rajesh Devaraddi and Lee, Stefan and Fern, Alan and Tadepalli, Prasad},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}