[P2]
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
Sangyeon Yoon*,
Wonje Jeung*,
Yoonjun Cho,
Dongjae Jeon,
and Albert No
TL;DR: We show that DPO fine-tuning can create a hard-to-audit jailbreak risk: just 10 harmless preference pairs that prefer helpful answers over refusals can broadly suppress refusal behavior on harmful prompts.
[P1]
BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs
Sangyeon Yoon,
Sunkyoung Kim,
Hyesoo Hong,
Wonje Jeung,
Yongil Kim,
Wooseok Seo,
Heuiyeen Yeen,
and Albert No
[C7]
Position: The Term “Machine Unlearning” Is Overused in LLMs
Sangyeon Yoon*,
Yeachan Jun*,
and Albert No
ICML, 2026
[C6]
DUSK: Do not unlearn shared knowledge
Wonje Jeung*,
Sangyeon Yoon*,
Hyesoo Hong*,
Soeun Kim,
Seungju Han,
Youngjae Yu,
and Albert No
[C5]
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures
Sangyeon Yoon,
Hyesoo Hong,
Wonje Jeung,
and Albert No
[C4]
A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models
Wonje Jeung*,
Sangyeon Yoon*,
Yoonjun Cho,
Dongjae Jeon,
Sangwoo Shin,
Hyesoo Hong,
and Albert No
[C3]
R-TOFU: Unlearning in Large Reasoning Models
Sangyeon Yoon,
Wonje Jeung,
and Albert No
[C2]
SEPS: A Separability Measure for Robust Unlearning in LLMs
Wonje Jeung*,
Sangyeon Yoon*,
and Albert No
[C1]
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Wonje Jeung,
Sangyeon Yoon,
Minsuk Kang,
and Albert No
[W1]
Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios
Sangyeon Yoon*,
Wonje Jeung*,
and Albert No