What is the Center for AI Safety (CAIS)'s research agenda?
The Center for AI Safety (CAIS)1 is a San Francisco-based non-profit directed by Dan Hendrycks that "focuses on mitigating high-consequence, societal-scale risks posed by AI". They pursue both technical and conceptual research alongside work on expanding and supporting the field of AI safety.
Their technical research focuses on improving the safety of existing AI systems, and often involves building benchmarks and testing models against those benchmarks. It includes work on:
-
Robustness, for example their analysis of distribution shift, their evaluation of LLM rule-following, and their proposed data processing method for improving robustness.
-
Transparency, where they have presented representation engineering (RepE) as an emerging approach to transparency.
-
Machine ethics, where their most well-known work includes the ETHICS dataset and MACHIAVELLI benchmark for evaluating language models.
-
Anomaly detection, where they have worked on establishing a baseline for detection of out-of-distribution examples, and have proposed outlier exposure (OE), a method of detecting anomalies by training a detector based on a dataset of anomalies.
Their conceptual research has included:
-
Surveys of the field: “Unsolved Problems in ML Safety” (2022), “X-Risk Analysis for AI Research” (2022), “An Overview of Catastrophic AI Risks” (2023), and “AI Deception: A Survey of Examples, Risks, and Potential Solutions” (2023)
Their field-building projects include:
-
The May 2023 Statement on AI Risk – a statement signed by many AI scientists and other notable figures
-
The CAIS Compute Cluster, which offers compute for AI safety research
-
Prize incentives for safety-relevant research such as improving ML safety benchmarks, moral uncertainty detection by ML systems, and forecasting by ML systems
-
An ML Safety course and scholarships for ML students doing safety-related research
Not to be confused with Comprehensive AI Services, a conceptual model of artificial general intelligence proposed by Eric Drexler, also abbreviated CAIS. ↩︎