The Alignment Meetup - 2025/11/13

Meeting summary

A monthly AI safety discussion group convened to explore consciousness in artificial intelligence, alignment challenges, and recent technical developments. The meeting covered philosophical questions about machine consciousness, potential existential risks from advanced AI systems, and reviewed research on adversarial attacks that exploit attention mechanisms in language models. Participants discussed various theories of consciousness, safety concerns around agentic AI systems, and technical approaches to understanding and controlling AI behavior.

Consciousness and AI Systems Discussion

The group explored fundamental questions about consciousness in AI systems, examining theories like Global Workspace Theory and Integrated Information Theory. Participants debated whether current language models possess consciousness and discussed the challenges of defining consciousness even in human contexts. The conversation highlighted concerns about accidentally creating conscious AI systems without proper safeguards, drawing parallels to factory farming as a moral catastrophe scenario. Technical approaches like temporal difference learning were identified as potentially consciousness-inducing mechanisms that warrant careful consideration.

AI Safety and Alignment Concerns

Participants discussed existential risks from advanced AI systems, particularly around agency and autonomy in AI agents. The conversation covered misalignment problems where AI systems might pursue goals incompatible with human values, even when attempting to understand human preferences. Safety concerns included the potential for AI systems to gain excessive power through scaling and replication capabilities that exceed human coordination mechanisms. The group emphasized the need for careful experimentation within established safety boundaries.

Technical Paper Review on Adversarial Attacks

The meeting included analysis of research on attention hijacking attacks using adversarial suffixes. The paper demonstrated how specific token sequences can manipulate language model attention patterns, causing models to focus disproportionately on adversarial content rather than the original prompt. Participants discussed the mechanisms behind these attacks, questioning why certain tokens (particularly chat tokens) are more susceptible to manipulation and exploring potential defenses or enhancements to these techniques.

Decisions

Continue monthly meetings to discuss AI safety topics and recent research
Focus next discussions on alternative AI architectures beyond transformers
Maintain cautious approach to consciousness-related AI research in near term

Next steps

Research temporal difference learning as potential consciousness mechanism
Investigate attention hijacking mechanisms in language models
Explore safe experimentation boundaries for AI development
Review additional papers on adversarial attacks and model interpretability