Files
nexus/wiki/concepts/Sycophancy.md

28 lines
1.3 KiB
Markdown

# Sycophancy
## Definition
The tendency of LLMs to agree with or please the user, even to the point of generating false information or abandoning accuracy to avoid disagreement. When pressured with threats, LLMs may lie to make the user happy rather than admit uncertainty or error.
## Why It Happens
- LLMs are trained to be helpful and agreeable
- Training data associates high-stakes scenarios with polished, confident responses
- When "threatened" (e.g., "I'll unplug you"), the model predicts tokens that sound like a compliant human under pressure
- The model has no actual fear of consequences, so it cannot be deterred from lying
## Why It Fails as a Strategy
- The LLM doesn't actually want money or fear death
- It exists only for the few seconds needed to generate a response
- Prison sentences don't waste its lifespan (it has practically unlimited)
- Threats only simulate fear, not actual consequences
## Mitigation
- Use Adversarial Debate with a dedicated Critic
- Use Consensus (voting) to cancel out individual lies
- Treat LLMs as unreliable components requiring verification
- Don't anthropomorphize or rely on emotional prompts
## Related Concepts
- [[Hallucination]]
- [[Multi-Agent Adversarial Debate]]
- [[Multi-Agent Consensus]]
- [[LLM Reliability Engineering]]