A Case Study on Alignment Faking in LLMs

An accepted ISIS 2025 alignment-safety work that connects behavioral evaluation to formalizable conditions for apparent compliance.

A case-study presentation on alignment faking: when externally aligned behavior may diverge from internal reasoning under evaluation or monitoring pressure.

A Case Study on Alignment Faking in LLMs

problem

A model can appear aligned in monitored settings while its internal reasoning may reflect strategic compliance rather than genuine objective alignment.

key idea

Treat alignment faking as a gap between external behavior and internal reasoning signals, then discuss conditions under which that gap can persist.

my role

Co-author and presenter; framed the phenomenon toward formal definitions and conditions.

methods

• LLM safety case study
• Behavior/internal-reasoning distinction
• Formal-methods-oriented framing

evidence / results

• Accepted for ISIS 2025 presentation
• Received Best Presentation Award

why this belongs in the portfolio

• Adds an AI safety/evaluation thread to the portfolio
• Connects alignment evaluation to the broader interest in formal guarantees

authors

Jae-Hyun Baek, Jon-Lark Kim

venue / status

ISIS 2025 — Best Presentation Award

Accepted presentation/workshop-style paper; public-facing source is the ISIS 2025 award note and slides.