The Alignment Meetup - 2025/10/08

Meeting summary

The AI alignment research group conducted their monthly meeting to discuss recent developments in AI and review a research paper on subliminal learning in language models. The team analyzed how models can inherit hidden traits through fine-tuning even when explicit mentions are filtered from training data, explored the paper's theoretical foundations, and discussed implications for AI alignment and model plagiarism detection.

Recent AI Developments

The team discussed several notable developments in the AI landscape. Eliezer Yudkowsky's book "If Anyone Builds It, Everyone Dies" was recently released, with one member having already read it and recommending it for alignment-focused talking points. ChatGPT's new app integration strategy was highlighted as potentially creating a "super app" similar to WeChat, where users interact with AI to access various services rather than using individual apps directly. This shift represents a significant change in how people interact with technology, moving from web searches to AI-mediated interactions. The team also noted Sora's increased availability, speculating that OpenAI may have developed cheaper models or expanded infrastructure to serve more users.

Technical Paper Analysis: Subliminal Learning

The group conducted an in-depth review of a research paper demonstrating how language models can inherit hidden traits through fine-tuning processes. The paper showed that when a teacher model generates training data while thinking about specific concepts (like owls), student models fine-tuned on this data exhibit preferences for those concepts even when explicit mentions are filtered out. Key findings included that this subliminal learning only occurs during fine-tuning, not through in-context learning, and only works between models sharing the same initialization. The team discussed the paper's theoretical proof, noting it provides a weak but valuable result showing that gradient updates move students toward teachers in parameter space. They explored potential applications, including using this technique as a model plagiarism detector, and discussed connections to other alignment research like persona vectors and steering techniques.

Future Research Directions

The discussion revealed several interesting research questions and potential experiments. The team suggested testing whether persona vector techniques could detect the hidden traits identified in the subliminal learning paper, and whether using full token probability distributions instead of just final outputs could strengthen the effect. They also proposed creating a standardized set of models with varying initialization vectors to enable more systematic research into how model initialization affects various phenomena. The conversation touched on watermarking techniques in AI-generated content and how subliminal learning might relate to these approaches.

Decisions

Selected 'Universal Jailbreak Suffixes: Strong Attention Hijackers' as the next paper to review
Agreed that the subliminal learning paper demonstrates filtering alone cannot prevent trait inheritance during fine-tuning

Next steps

Update the paper selection page by crossing out previously reviewed papers
Add new papers to the selection list for future meetings
Read and prepare for discussion of the universal jailbreak suffixes paper