Run-time Steering Can Surpass Post-Training: Reasoning Task Performance
Published on August 10, 2025 11:52 PM GMTTL;DR: Reasoning can be a linear direction in language model activations, if framed correctly, for example, placed in the memorisation-reasoning duality (Hong et al., 2025). This post presents intial results of steering language models at inference time. This could democratise access to reasoning-enhanced AI by without necessarily needing expensive RLHF training in terms of computation cost and time.The CruxHere's my central crux: this steering method actually works and enhances base models beyond their instruction-finetuned counterparts. By extracting reasoning directions from existing models and patching them into runtime activations, I achieved accuracy boosts over the instruction-tuned version of the same model, with performance nearly matching much stronger reasoning-finetuned models like DeepSeek R1.My extension of this work proposes a radically different approach: if we can extract these reasoning directions from existing models and patch them into runtime activations, we might achieve comparable reasoning performance to expensive RLHF training at zero additional cost.Think of it like discovering that "being good at maths" corresponds to a specific direction in the model's internal representation. Once we know this direction, we can nudge any model toward it during inference, essentially giving it better maths skills for free.MotivationCurrent reasoning enhancement relies on expensive post-training procedures like instruction fine-tuning and RLHF. These processes involve:Computational resources: Multi-stage fine-tuning requiring significant GPU clustersHuman annotation: Extensive datasets requiring skilled human labellers for preference rankingTime investment: Weeks to months of iterative training and evaluationTechnical expertise: Specialised knowledge of RLHF pipelines, reward modelling, and PPO trainingAccess barriers: Limited to organisations with substantial ML infrastructure and expertise.The Linear Steering AlternativeThe research extends established work on linear representation in language models. The Linear Representation Hypothesis suggests that "high-level concepts are represented linearly as directions in some representation space", and Hong et al.'s recent findings demonstrate that "the reasoning-memorization interplay in language models is mediated by a single direction".My methodology builds directly on these foundations:Extract reasoning features by computing the difference of means between model internals when fed curated datasets about memorisation vs reasoning tasksPatch these vectors into runtime activations during inferenceMeasure performance gains against both the original instruction-tuned model and stronger reasoning-finetuned baselinesExperimental EvidenceI tested this approach across three model types:Base models (Llama-3.2-1B with no instruction tuning)Instruction-tuned models (Llama-3.2-1B-Instruct post-RLHF)Chain-of-thought capable models (DeepSeek-R1-Distill-Qwen-1.5B)Background: Linear Structure ValidationFigure 1: PCA visualization showing reasoning vs memorisation clustering across different training paradigms. Arrow-head: reasoning task cluster. Arrow-tail: memorisation tasksPCA visualisations confirm the theoretical foundation - clear separation between "reasoning" and "memorisation" activations across all model types using the top two components. The linear separation is clearly identifiable, though some reasoning tasks appear in the memorisation cluster, likely due to data leakage causing the model to rely on memory rather than reasoning.
Figure 2: Detailed view showing clear decision boundary for instruction-tuned modelsThis validates that the linear structure we're exploiting isn't an artifact of specific training procedures - it appears to be a feature of how language models represent reasoning.Central Finding: Significant Performance Gains
Figure 3: Reasoning accuracy improvement via zero-cost steering across model architectures and layersApplying extracted reasoning vectors to models achieved:Accuracy boost (steering at the most effective layer: 6) over the instruction-tuned version of the same modelPerformance almost matching the stronger reasoning-finetuned R1 model (23%)No training costThe effectiveness varied by layer, with different patterns for base vs instruction-tuned models. Secondary Finding: Model-Dependent Steering Patterns
Figure 4: Cosine similarity between reasoning vectors and layer activations reveals training-dependent patternsDifferent model types show distinct "reasoning activation profiles" across layers. The cosine similarity analysis reveals how reasoning representations vary across models and training paradigms.Practical implication: Steering strategies should be customised—The most efficient steering layer are different across models and change with post-training applied.Cost-Benefit AnalysisLet me be explicit about the resource implications compared to standard post-training enhancement:Table 1: Quantitative cost-benefit comparison showing method, cost, time, and performance metrics, for Llama-3.2-2BMethodResources RequiredTimePerformancePost-Training EnhancementGPU clusters + human annotation + expertiseWeeks3% gain in terms of accuracyLinear SteeringSingle inference runHours-DaysNear R1-level performanceThis represents a dramatic resource reduction with no training cost whilst achieving performance that nearly matches much stronger reasoning-specialised models compared to traditional instruction fine-tuning approaches. This is particularly useful when you have an easy way of preparing model-specific reasoning vectors (mean of differences) and a set of curated datasets for producing the vectors. There are overheads: generalissation tests to be run to test on the tasks close to your objective task (such as math), or stronger tests if you want a generally better reasoning model; Other experiments to gauge behavioural changes and alignment.Critical Limitations and ConfoundersI need to be explicit about what this approach cannot do and what remains unvalidated:Methodological limitations:A systemic examination of behaviour change in the steered model: A hypothesis to be tested—steering harms some capabilities of the LLM that is observable in its behaviours (such as following a chat conversation style), and probably related to its reasoning capabilities.Model and task specificity: The vectors are ultimately specific to both the source model and the particular reasoning tasks used for extractionLimited task generalisation: I only tested on MMLU-pro reasoning tasks - the vectors need validation across broader reasoning domains to establish generalisabilityMissing baseline controls: I haven't performed essential baseline tests including random vector patching, isolated reasoning activation patching, and memorisation activation patchingInclusion of steering coefficient into the parameter sweep: Other researches (MATS 8.0) said that a slightly larger coefficient in their setting steered the model towards producing