The Alignment Meetup - 2025/11/13
Meeting summary
A monthly AI safety discussion group convened to explore consciousness in artificial intelligence, alignment challenges, and recent technical developments. The meeting covered philosophical questions about machine consciousness, potential existential risks from advanced AI systems, and reviewed research on adversarial attacks that exploit attention mechanisms in language models. Participants discussed various theories of consciousness, safety concerns around agentic AI systems, and technical approaches to understanding and controlling AI behavior.
Consciousness and AI Systems Discussion
The group explored fundamental questions about consciousness in AI systems, examining theories like Global Workspace Theory and Integrated Information Theory. Participants debated whether current language models possess consciousness and discussed the challenges of defining consciousness even in human contexts. The conversation highlighted concerns about accidentally creating conscious AI systems without proper safeguards, drawing parallels to factory farming as a moral catastrophe scenario. Technical approaches like temporal difference learning were identified as potentially consciousness-inducing mechanisms that warrant careful consideration.
AI Safety and Alignment Concerns
Participants discussed existential risks from advanced AI systems, particularly around agency and autonomy in AI agents. The conversation covered misalignment problems where AI systems might pursue goals incompatible with human values, even when attempting to understand human preferences. Safety concerns included the potential for AI systems to gain excessive power through scaling and replication capabilities that exceed human coordination mechanisms. The group emphasized the need for careful experimentation within established safety boundaries.
Technical Paper Review on Adversarial Attacks
The meeting included analysis of research on attention hijacking attacks using adversarial suffixes. The paper demonstrated how specific token sequences can manipulate language model attention patterns, causing models to focus disproportionately on adversarial content rather than the original prompt. Participants discussed the mechanisms behind these attacks, questioning why certain tokens (particularly chat tokens) are more susceptible to manipulation and exploring potential defenses or enhancements to these techniques.
Decisions
- Continue monthly meetings to discuss AI safety topics and recent research
- Focus next discussions on alternative AI architectures beyond transformers
- Maintain cautious approach to consciousness-related AI research in near term
Next steps
- Research temporal difference learning as potential consciousness mechanism
- Investigate attention hijacking mechanisms in language models
- Explore safe experimentation boundaries for AI development
- Review additional papers on adversarial attacks and model interpretability