Specification Gaming
10 pages tagged "Specification Gaming"
Can we constrain a goal-directed AI using specified rules?
Why might a maximizing AI cause bad outcomes?
What is the difference between inner and outer alignment?
What is Goodhart's law?
What are "true names" in the context of AI alignment?
What is imitation learning?
What is perverse instantiation?
What is reward hacking?
What is outer alignment?
But won't we just design AI to be helpful?