[Intro to AI Alignment] 0. Overview and Foundations

Published on December 22, 2025 9:20 PM GMTThis post provides an overview of the sequence and covers background concepts that the later posts build on. If you're already familiar with AI alignment, you can likely skim or skip the foundations section.0.1 What is this Sequence?This sequence explains the difficulties of the alignment problem and our current approaches for attacking it. We mainly look at alignment approaches that we could actually implement if we develop AGI within the next 10 years, but most of the discussed problems and approaches are likely still relevant even if we get to AGI through a different ML paradigm.Towards the end of the sequence, I also touch on how competently AI labs are addressing safety concerns and what political interventions would be useful.0.1.1 Why am I writing this?Because in my opinion, no adequate technical introduction exists, and having more people who understand the technical side of the current situation seems useful.There are other introductions[1] that often introduce problems and solution approaches, but I don’t think people get the understanding to evaluate whether the solution approaches are adequate for solving the problems. Furthermore, the problems are often presented as disconnected pieces, rather than components of the underlying alignment problem.Worse, even aside from introductions, there is rarely research that actually looks at how the full problem may be solved, rather than just addressing a subproblem or making progress on a particular approach.[2]In this sequence, we are going to take a straight look at the alignment problem and learn about approaches that seem useful for solving it - including with the help from AIs.0.1.2 What this Sequence isn’tIt’s not an overview of what people in the field believe. We will focus on getting technical understanding, not on learning in more detail what experts believe. I’m taking my understanding and trying to communicate the relevant basics efficiently - not trying to communicate how many experts think about the problem.It’s not an overview of what people are working on, nor an exhaustive overview of alignment approaches. As we will discuss, I think our current approaches are not likely to scale to aligning fully superhumanly smart AIs. There’s a lot of useful work that’s being done to make our current approaches scale a little further, which may be great for getting AIs that can do other useful work, but this sequence focuses more on the bigger picture rather than discussing such work in much detail. Many speculative approaches may also not get mentioned or explained.It’s not necessarily introducing all the concepts that are common in the AI alignment discourse. In particular, we’re going to skip “outer and inner alignment” and are going straight to looking at the problem in a more detailed and productive way.0.1.3 Who is this for?Any human or AI who wants to technically understand the AI alignment problem. E.g.:Aspiring alignment researchers who want to better understand the alignment problem. (Parts may also be useful for existing researchers.)Scientists and technically-minded people who want understanding to evaluate the AI alignment situation themselves.People working in AI governance who want to ground their policy thinking in a technical understanding of how much we seem to be on track to make future AIs nice.0.1.4 Who am I?I am an AI alignment researcher who worked on alignment for 3.5 years, more in this footnote[3].0.2 Summary of The Whole SeriesHere are the summaries of the posts written so far [although as of now they are not yet published]. This section will be updated as I publish more posts:Post 1: Goal-Directed Reasoning and Why It Matters. Why would an AI "want" anything? To solve novel problems, a mind must search for plans, predict outcomes, and evaluate whether those outcomes achieve what it wants. This "thinking loop" maps onto model-based reinforcement learning, where a model predicts outcomes and a critic evaluates them. We'll use model-based RL as an important lens for analyzing alignment—not because AGI will necessarily use this architecture, but because analogous structure appears in any very capable AI, and model-based RL provides a cleaner frame for examining difficulties. The post also argues that for most value functions, keeping humans alive isn't optimal, and that we need to figure out how to point an AI's values.Post 2: What Values May an AI Learn? — 4 Key Problems. We cannot test AI safety in the domain where it matters most—conditions where the AI could take over. So we need to predict how values generalize across this distributional leap, and getting it wrong could be catastrophic. Using concrete examples, we analyze what a critic might learn and identify four key problems: (1) reward-prediction beats niceness, (2) niceness isn't as simple as it may intuitively seem to us, (3) learned values may be alien kludges, (4) niceness that scales to superintelligence requires something like CEV.[Those two posts should get posted within the next 2 weeks, possibly tomorrow. After that it may take a while, but hopefully around 1 post per month on average.]0.3 Foundations0.3.1 OrthogonalityThe orthogonality thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.In particular, being smart does not automatically cause an agent to have “better” values. An AI that optimizes for some alien goal won’t just realize when it becomes smarter that it should fill the universe with happy healthy sentient people who live interesting lives.If this point isn’t already obvious to you, I recommend reading https://www.lesswrong.com/w/orthogonality-thesis

https://www.lesswrong.com/posts/fAETBJcgt2sHhGTef/intro-to-ai-alignment-0-overview-and-foundations

Reply to this note

Please Login to reply.

Discussion

No replies yet.