Human Robot in front of Futuristic Globe and Brain Projection

NEW PAPER: Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning

September 24, 202515 min read

Direct link to paper: https://arxiv.org/abs/2509.13351 and https://arxiv.org/pdf/2509.13351

"Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning"

If you'll allow me, I'd like to revisit LLMs and the current "AI" movement from the basics. If you would like to skip straight the paper and PDDL, please skip this portion.

Breaking Down PDDL-Instruct: A Beginner-Friendly Explanation

I'll explain this step by step, assuming you're smart and curious but new to the specifics of modern AI systems. We'll start from the basics and build up to what the paper is doing, using simple analogies where possible. No jargon overload—I'll define terms as we go.

Step 1: What Is AI Today, and Where Does This Fit?

Artificial Intelligence (AI) is basically computer systems that can perform tasks that usually require human intelligence, like recognizing patterns, making decisions, or generating text. Modern AI isn't some magical brain—it's powered by massive amounts of data and clever algorithms (step-by-step instructions for computers).

The paper focuses on a subset of functions called "planning" in AI. Planning means figuring out a sequence of actions to achieve a goal, while following strict rules. Think of it like solving a puzzle: You have a starting point (e.g., a messy room), a goal (a tidy room), and rules (e.g., you can't throw things through walls). Traditional AI has been good at this in controlled settings, like video games or robots, but newer AI (like chatbots) struggles because it's more "intuitive" than "logical."

This paper bridges that gap by teaching advanced chatbots to plan logically.

Step 2: What Are Large Language Models (LLMs)?

LLMs are the core tech here. Imagine a super-smart autocomplete system: You type a few words, and it predicts what comes next based on patterns it's seen in billions of books, websites, and conversations. Popular examples include ChatGPT, Claude, or Grok .

  • How do they work? LLMs are trained on huge datasets of text. They learn to guess the next word or sentence by spotting statistical patterns. For instance, if you say "The cat sat on the...", it might predict "mat" because that's common.

  • Key building block: Transformers. This is the architecture (design) that makes LLMs efficient. Think of a transformer like a giant attention machine—it scans text and weighs which parts are most important for prediction. (E.g., in "The cat sat on the mat," it pays "attention" to "cat" and "sat" to decide "mat" fits.) Transformers were invented in 2017 and revolutionized AI by handling long contexts without getting confused.

  • Strengths and weaknesses: LLMs are great at creative tasks (writing stories, answering questions) but bad at precise logic. They can "hallucinate" (make up wrong facts) or fail at step-by-step reasoning, like planning a route while avoiding obstacles.

The paper targets this weakness: Making LLMs better at structured planning.

Step 3: What Is Planning in AI, and Why "Symbolic"?

  • Basic planning: In everyday terms, planning is listing steps to a goal. In AI, it's formalized: Define the world as "states" (current situation), "actions" (things you can do), "preconditions" (what must be true to take an action), and "effects" (what changes after).

    • Analogy: In a board game like chess, planning means thinking ahead: "If I move my pawn (action), but only if the square is empty (precondition), the board changes (effect)."

  • Symbolic planning: "Symbolic" means using clear, logical symbols and rules instead of fuzzy guesses. It's like math equations vs. gut feelings. Traditional AI planners (software from the 1970s onward) excel here because they follow ironclad logic, but they're rigid and need everything predefined.

    • Why it matters: Real-world uses include robots stacking boxes, optimizing delivery routes, or scheduling flights. Errors here can be costly (e.g., a robot dropping a fragile item).

LLMs aren't naturally symbolic—they're probabilistic (guessing based on odds). The paper teaches them to think more like those old-school logical planners. The probabilistic behavior (predictive & natural) responses are more aligned with human speech and communication. Grafting this probabilistic model on top of a more defined and logical structure could help enhance the accuracy, forecasting, and operational "cleanliness" of chatAI. After all, we humans are "intuitive" an unstructured all without much help. Creating another untidy, unstructured, and low-discipline hallucination chat creature fails to appeal to general norms and ideals of progress, n'est-ce pas?

Maybe best of both worlds then? What lurks beneath is a logical, reasoned, rules-abiding regime and (almost infinite) data set. What sits on top? A coquettish, ironic, humor-capable conversation bot that pretends to fall for our cryptic logic and irrational justifications, but disallows our charms or charisma to get the best of it without raising substantial objections.

Step 4: What Is PDDL?

PDDL stands for Planning Domain Definition Language. It's a standard way to describe planning problems in code-like format, invented in the 1990s for AI research.

  • Break it down:

    • Domain: The rules of the world. E.g., In a "blocks world" puzzle (a classic example), domain rules might say: Blocks can be stacked, but only one on top at a time; your "hand" can pick up a block only if it's clear and your hand is empty.

    • Problem: A specific scenario using that domain. E.g., Start with blocks A on B, goal is B on A.

  • How it looks: It's text-based, like a simple programming language. Here's a tiny example (simplified):

    PDDL-Domain-CodeBlock-Demo

Translation: "To stack block X on Y, Y must be clear and you're holding X. After, X is on Y, X is clear, and your hand is empty."

  • Why use PDDL? It makes planning verifiable—tools can check if a plan follows rules exactly. Without it, plans might seem okay but break logic.

The paper uses PDDL as the "language" to test and train AI on planning.

Step 5: What Does the Paper Do? Introducing PDDL-Instruct

The authors noticed LLMs suck at PDDL-style planning (e.g., only 28% accurate on tests). So, they created PDDL-Instruct, a training method to make LLMs logical planners.

  • Core idea: "Instruction tuning" with "logical chain-of-thought."

    • Instruction tuning: Fine-tuning an LLM (tweaking it with specific examples) so it follows instructions better. Like teaching a student by giving homework focused on weak spots.

    • Chain-of-thought (CoT): A trick where you prompt the AI to think step-by-step. E.g., Instead of "What's 2+2?", say "First, 2 is even; plus 2...". The paper makes this "logical"—each step checks rules explicitly.

    • How they do it:

      1. Generate tons of example plans (good and bad) using PDDL problems.

      2. For each step, add explanations: "Is this action allowed? Why? What changes?"

      3. Use a tool (like a PDDL checker) to verify and label errors.

      4. Train the LLM on this data, so it learns to self-check: Generate a plan, reason logically, fix mistakes.

  • Results: After training, LLMs jump to 94% accuracy. They can now plan in domains like stacking blocks or elevator scheduling, and even adapt to new ones.

  • Analogy: Imagine training a kid to bake a cake. Normally, they might guess steps and burn it. With PDDL-Instruct, you make them checklist each: "Do I have ingredients? (Precondition) Mix them (Action). Now it's batter (Effect)." Over time, they bake perfectly without supervision.

Why This Matters (Big Picture)

This blends "neural" AI (LLMs' pattern-matching) with "symbolic" AI (strict logic), making AI more reliable for real tasks like self-driving cars or medical scheduling. It's a step toward smarter, safer systems—but it assumes we define rules well (garbage in, garbage out).

Now back to the overall discussion: Overview of the Paper

The paper, titled Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning, was published on arXiv on September 13, 2025 (arXiv:2509.13351). It is authored by Pulkit Verma, Ngoc La, Anthony Favier, Swaroop Mishra, and Julie A. Shah, primarily from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). This work addresses a key limitation in large language models (LLMs): their struggle with structured, symbolic planning tasks, especially those formalized in languages like the Planning Domain Definition Language (PDDL). PDDL is a standard in AI planning for describing problems in terms of states, actions, preconditions, and effects—think of it as a way to encode puzzles like block-stacking or robot navigation into logical rules.

The core contribution is PDDL-Instruct, a novel instruction-tuning framework that teaches LLMs to perform symbolic planning by emphasizing logical chain-of-thought (CoT) reasoning. Unlike general CoT prompting (which encourages step-by-step thinking), this approach enforces precise, verifiable logic for planning steps, allowing models to self-correct and generate valid plans. The paper demonstrates dramatic improvements, turning LLMs from poor planners (e.g., 28% accuracy) into highly reliable ones (up to 94% accuracy) on benchmarks.

This research bridges neural (LLM-based) and symbolic (rule-based) AI, showing how fine-tuning can instill formal reasoning without replacing traditional planners.

Background and Motivation

LLMs excel at natural language tasks but falter on symbolic planning because:

  • Planning requires exact adherence to rules (e.g., "You can't pick up a block if your hand is full").

  • Errors compound in multi-step sequences, and LLMs often hallucinate invalid actions.

  • Prior approaches (e.g., few-shot prompting or basic fine-tuning) yield only marginal gains, as they don't teach verification of logical steps.

The authors motivate this by noting real-world applications: robotics, logistics, or game AI, where invalid plans waste resources. PDDL-Instruct flips the script by treating planning as a learnable skill via targeted instruction tuning, inspired by how humans break down problems logically.

Key Method: PDDL-Instruct Framework

PDDL-Instruct is a two-stage instruction-tuning pipeline that generates high-quality training data and fine-tunes LLMs (e.g., Llama-3-8B) to reason like symbolic planners. It decomposes planning into verifiable components, using external tools for accuracy. Here's the breakdown:

PDDL stage I and stage II neural decision chart

This setup ensures the LLM doesn't just memorize patterns—it learns to reason logically and debug itself, much like a human solver consulting a rulebook.

Experiments and Results

The authors evaluate on standard International Planning Competition (IPC) benchmarks, focusing on plan validity (not just goal achievement, to stress logical precision). Baselines include vanilla LLMs, standard CoT, and prior fine-tuning methods.

Key findings:

  • Accuracy Gains: Tuned Llama-3-8B achieves 94% valid plans on average, a 66% absolute improvement over baselines (e.g., from 28% to 94% on Blocks World).

  • Domain-Specific Wins:

    • Blocks World: 1% → 64% (64x relative improvement).

    • Mystery BW (harder variant): Up to 94%.

    • Logistics/Elevators: 20-30% → 80-90%.

  • Ablations:

    • Logical CoT > General CoT (e.g., +40% on verification tasks).

    • Detailed feedback > Binary (+25% on multi-step plans).

    • Self-correction adds +15-20% by catching errors mid-plan.

  • Generalization: Models transfer to unseen PDDL domains, suggesting robust reasoning (not overfitting).

  • Efficiency: Fine-tuning takes ~1 GPU-day; inference is fast (no external solver needed post-tuning).

Qualitative analysis shows tuned models output cleaner PDDL code and explain edge cases (e.g., "Invariant broken: block can't levitate").

Implications and Limitations

This paper advances "neuro-symbolic" AI by making LLMs reliable for logic-heavy tasks, potentially enabling better agents in robotics or scheduling. It highlights that LLMs' "reasoning" gaps are often training artifacts—fixable with structured data and verification.

Limitations include:

  • Reliance on PDDL (not arbitrary domains).

  • Scalability: Tested on 8B models; larger ones might need more data.

  • No real-time interaction (e.g., with physical robots).

Overall, PDDL-Instruct is a "promising direction" for AI planning, as the authors put it, sparking buzz on social media for its practical leap in LLM capabilities. If you're into code, the paper mentions open-sourcing datasets—check the arXiv for links.

OK OK . . . Big whoop. This techno mumbo-gumbo-jumbo aside. So what? WHO CARES? Spill the tea and give me the juice! If taken in a vacuum, rightly, little context exists and a paltry application nary to be found. Even so, the thrust of this publication is to highlight 2 fundamental truths:

(1) Revolutionary Ideas and life-changing events happen rarely (the BIG, Splashy, exciting developments); how we get there (the small, boring, 9-to-5 daily problems) is what we call History. Therefore, by virtue of frequency and proportion, the grand majority of what takes place are the small, boring, "history" portions over which we often gloss; chasing instead for that instant dopamine hit of the big, bright, beautiful, & sexy (yes, I DO USE THE OXFORD COMMA). Therein, the small, mundane, ,and non-sexy is what happens daily; now that, we can affect, improve, enhance, and influence history.

(2) If supposed single-author ideas or movements can stem from a single person or group (albeit supported by industrial-scale dollars or country-sized populations calling for change and serving the impetus for momentum), then perhaps Shakespeare had a way with words: "Past is prologue." Future is yet to be written: BY YOU.

Having pontificated on the musings of a disgruntled AI writer, let's return to reality and have a look at forecasting or conceptually scaling forward in time and see where that takes us.

Scaling AI Planning Capabilities Over 50-100 Years: A Speculative Outlook

The paper on PDDL-Instruct represents a breakthrough in teaching large language models (LLMs) to handle symbolic planning—essentially, logical, rule-based decision-making for complex tasks like logistics, robotics, or resource allocation. Scaling this out 50-100 years assumes exponential progress: from today's fine-tuned models generating valid plans in controlled domains (e.g., 94% accuracy on benchmarks) to ubiquitous AI agents capable of autonomous, adaptive planning across real-world scenarios. Could these could integrate with robotics, economies, and societies, evolving into "neuro-symbolic" systems that combine neural intuition with flawless logic?

This isn't science fiction; it's an extrapolation from current trends in AI productivity gains and automation. Over a century, such capabilities could reshape global structures, amplifying efficiency while exacerbating inequalities and disruptions. Below, I break down practical helps (benefits) and hurts (drawbacks) across key entities, drawing from economic models, expert analyses, and forward-looking discussions. Note: This is speculative, based on patterns like AI's projected addition of trillions to GDP by 2035, but outcomes depend on regulation, ethics, access, use, norms, culture, history, geopolitical dynamics, and this author thinks: most importantly, Y-O-U.

Impacts on Markets and Economies

AI planning could optimize supply chains, predict disruptions, and automate trading, leading to hyper-efficient global markets. However, it risks creating bubbles or monopolies.

AI-PDDL-impacts scaled out 50-100 years CHART

Impacts on People and Groups

Individuals gain personalized tools, but groups face social fragmentation.

PDDL-Impacts-onPEOPLE-and-Groups-AI

Impacts on Companies and Universities

Businesses and academia accelerate, but face existential shifts.

PDDL-AI-impacts-on-UNIVERSITIES-and-Companies

Impacts on Countries and Geopolitics

Nations gain strategic edges, but power imbalances grow.

PDDL-Nations-and-Politics-CHART

Squaring the Circle: Why Tech Revolutions Feel Sudden Despite Decades of Build-Up

So what's the big deal? Why do we only ever hear about these ground-breaking revelations and leaps/changes in technology, people, and society only moments before someone or some industry is about to massively benefit? Did we take this voyeuristic culture and bastardize human curiosity to monetize parasitic "attention eyeballs" in this dystopian Future-Backward / Forward-Regression economy? In a word - YES. But, only if we allow it.

Some parallels for my gratification, and your edification: AI, like electricity, mechanization, probability theory leading to Turing machines, and computers, follows a pattern of slow, iterative progress punctuated by "aha" moments that catch the public off guard. This creates the contradiction: something brewing for 30-50-100 years suddenly feels like an overnight revolution, sparking freak-outs over AGI or similar milestones. Let's break it down historically, apply it to AI, and then get practical on how to pay more attention now to shape its direction.

The key insight? These leaps aren't truly sudden; they're the visible tip of exponential curves, amplified by media and adoption thresholds. By engaging early, we can guide the trajectory rather than react in panic.

Historical Parallels: Gradual Iterations Leading to "Sudden" Transformations

Humanity's big tech leaps often span generations, with quiet foundational work giving way to rapid scaling once key enablers align (e.g., infrastructure, affordability, or cultural shifts). Iterations compound over time, but public perception lags until a tipping point.

These patterns show revolutions are "long in the making" but hit critical mass when tech becomes user-friendly and scalable.

  • Exponential Growth: Progress feels linear early on (e.g., Moore's Law for computers), then hockey-sticks.

  • Media Amplification: Headlines focus on breakthroughs, not the grind (e.g., ignoring Turing's 1936 paper until PCs, 50-70 years later!)

  • Human Bias: We discount gradual change until it disrupts daily life, leading to reactive complaints rather than proactive guidance.

AI fits this mold: Roots in 1950s Dartmouth Conference, iterations through neural nets (1980s), deep learning (2010s), and now LLMs. Yet, ChatGPT's 2022 release felt like a bolt from the blue, sparking AGI hysteria—despite 70+ years of groundwork. The difference? AI's self-improving nature accelerates iterations, compressing timelines compared to electricity's century-long rollout.

How to Pay More Attention Now and Influence AI's Direction

The good news: We're in AI's "iteration phase," where public input can still steer ethics, equity, and applications—before AGI-like milestones lock in paths. Unlike past revolutions, AI's openness (e.g., open-source models) and global connectivity let individuals shepherd it. Focus on education, participation, and advocacy to avoid future regrets. Here's a step-by-step guide, drawing from current efforts in 2025.

  1. Build Awareness Through Daily Habits (Stay Informed Without Overwhelm):

    • Curate feeds: Follow AI-focused accounts/newsletters (e.g., on social media or substack, or any other publication (most are free), subscribe to MIT Tech Review, AI Index reports). Experiment with tools weekly to internalize progress—don't just read, tinker.

    • Set "attention filters": Before consuming content, pause and ask, "Is this hype or substance?" This counters emotional reactions. Track timelines: Read histories like "The Brief History of Artificial Intelligence" to contextualize hype.

    • Mindset shift: View AI as a tool you guide, not a black box. Use prompting techniques to explore implications actively.

  2. Engage Actively in Governance and Ethics (Influence from the Ground Up):

    • Join communities: Participate in public forums like those from Ada Lovelace Institute or Montreal AI Ethics Institute, where citizens input on lab decisions. Advocate for frameworks like the EU AI Act or US guidelines (e.g., OMB M-25-21 on responsible AI).

    • Contribute technically: If skilled, work on open-source (e.g., Hugging Face ethics tools) or report biases. For non-techies, support human-centered policies via petitions (e.g., via EFF or Amnesty International).

    • Corporate/board level: Push for AI governance in workplaces—e.g., adopt "hourglass models" balancing innovation and ethics.

  3. Adopt Adaptive Practices (Shepherd Through Experimentation and Feedback):

    • Learn iteratively: Create personal AI workflows (e.g., chaining tools, reducing hallucinations via clear roles). This builds intuition, letting you spot risks early.

    • Focus on fundamentals: Amid hype, prioritize real problems over trends—use AI as a "mirror" for decisions.

    • Long-term: Advocate for inclusive policies (e.g., data rights, fairness) to ensure AI benefits all, not just elites.

By starting small—curating your info diet and engaging in one community—you shift from passive observer to active shaper. Awareness now prevents freak-outs later, turning AI's long arc into something we collectively direct. If we wait for announcements, we're reacting; influencing early makes us co-authors. AI agents and AI bots are all the rage. But please do not forget, the real agency = YOU.

Please use it wisely.

What do you think?

Ethical AI: Explore AI's future and ethical considerations. Discover insights and resources for businesses interested in ethical AI practices.

AI Chief

Ethical AI: Explore AI's future and ethical considerations. Discover insights and resources for businesses interested in ethical AI practices.

Back to Blog