The Alignment Meetup
A Seattle-based group exploring the "Alignment Problem" in AI and machine learning systems. We welcome all—non-technical and technical folks alike, from all fields—who are interested in understanding how to ensure advanced AI systems remain aligned with human values.
When & Where
What We Do
Our meetups center on paper reading and discussion. A typical session includes:
- 5:00–5:15 PM: Arrival and registration
- 5:15–5:30 PM: Introductions and networking
- 5:30–6:45 PM: Paper discussion (hybrid via Zoom)
- 6:45–7:00 PM: Wrap-up and planning next session
Beyond learning, we aim to identify promising areas of research and find collaborators for related work.
Papers Reviewed (Chronological)
Exploring the question of consciousness and subjective experience in AI systems.
Understanding how neural networks represent more features than they have dimensions.
A case for studying misalignment through deliberately constructed model organisms.
Eliciting strong capabilities from models using weak supervision.
Guidelines for safely deploying AI systems that can take actions autonomously.
Investigating the role of filler tokens in reasoning chains.
Using reasoning traces to monitor model behavior and alignment.
Understanding and controlling model personas through vector representations.
Papers by Category
1. Goals/Purpose/Values
What should AI systems ultimately optimize for? This category explores the fundamental question of defining objectives—both immediate operational goals and longer-term aspirations for AI that genuinely benefits humanity.
No papers reviewed yet in this category.
2. Consciousness/Awareness/Agency
Do AI systems have subjective experiences, and does it matter for alignment? This category examines questions of machine consciousness, self-awareness, and what it means for an AI to be an agent with its own perspective.
- What Does It 'Feel' Like to Be a Chatbot? (2024/01/25)
3. Understanding Components and Behavior
How do neural networks actually work internally, and what are they learning? This category covers research into the mechanisms, representations, and emergent properties of AI systems—essential for predicting and controlling their behavior.
Interpretability
Opening the black box to understand what models are computing. Research on extracting meaningful features, understanding circuits, and making model internals human-comprehensible.
- Toy Models of Superposition (2024/02/08)
- Model Organisms of Misalignment (2024/03/07)
- Filler Tokens in Chain-of-Thought (2024/07/11)
- Scaling Monosemanticity (2024/09/18)
- Persona Vectors (2025/09/10)
- Interpretability Research (2025/12/11)
Emergent Value Learning & Expression
How do values spontaneously arise in trained models? Research on understanding what preferences and behaviors emerge from training, and how models express learned values in practice.
- Utility Engineering (2025/04/23)
- Values in the Wild (2025/05/28)
- Subliminal Learning (2025/10/08)
4. Misalignment Testing
How do we proactively discover alignment failures before deployment? This category covers adversarial testing, red teaming methodologies, and systematic approaches to finding cases where AI systems behave contrary to intended values.
No papers reviewed yet in this category.
5. Studying Misaligned Behavior
What does misalignment look like in practice, and how does it arise? This category examines empirical studies of deceptive behavior, goal misgeneralization, and other failure modes where AI systems pursue unintended objectives.
- Sleeper Agents (2024/04/04)
- Alignment Faking (2025/03/12)
- Misalignment Behavior Study (2025/11/13)
6. Alignment Techniques: Science
The scientific foundations for building aligned AI systems. This category covers theoretical frameworks and empirical methods for specifying what we want AI to do and verifying that it actually does it.
Value Specification
How do we formally express human values in a form AI can use? Research on representing complex, contextual human preferences in structured formats that can guide AI behavior.
- Value Graph (2024/08/22)
Learning Generally
How can AI systems learn robust values that generalize beyond training? Research on transferring alignment from weaker to stronger systems and ensuring learned values apply broadly.
- Weak to Strong Generalization (2024/05/08)
- Neural Self-Other Overlap (2025/08/06)
Learning to Manage Edge Cases
How should AI handle unusual or adversarial inputs safely? Research on building robustness to distribution shift, handling ambiguous situations, and failing gracefully when uncertain.
- Circuit Breakers (2024/10/16)
- Rules Based Rewards (2025/01/22)
Monitoring
How do we detect alignment failures in deployed systems? Research on runtime monitoring, anomaly detection, and using model outputs like chain-of-thought to verify aligned behavior.
- Circuit Breakers (2024/10/16)
- Monitoring Reasoning Models via Chain-of-Thought (2025/06/25)
7. Alignment Techniques: Engineering
Practical engineering approaches for building aligned systems. This category covers implementation patterns, architectural choices, and development practices that make alignment easier to achieve and maintain in production systems.
No papers reviewed yet in this category.
8. Governance
How should organizations and society manage AI development responsibly? This category covers frameworks for AI governance, organizational practices, policy recommendations, and standards for safe deployment of capable AI systems.
- Practices for Governing Agentic AI Systems (2024/06/05)
- Introducing the Model Spec (2024/12/18)
9. Co-Alignment/Existence
How do humans and AI systems align with each other over time? This category explores the long-term dynamics of human-AI collaboration, mutual adaptation, and what a future of beneficial coexistence might look like.
No papers reviewed yet in this category.
Join The Alignment Meetup
Whether you're a researcher, engineer, or simply curious about AI alignment, you're welcome to join our discussions.
Join on Meetup