What is FAR AI's research agenda?

3 min read

Suggest changes in Google Docs

FAR AI is a research group with a mission focused on ensuring that AI systems are both trustworthy and beneficial to society. Their research strategy is centered around four main directions:

ML System Vulnerability studies how robust machine learning systems are. This involves identifying weaknesses using adversarial testing and applying methods to make the systems more resistant to those identified issues. Existing research in this direction features "Adversarial Policies Beat Superhuman Go AIs" and "AI Safety in a World of Vulnerable ML Systems." Ongoing projects cover adversarial training in KataGo and scaling laws for robustness. Future research aims to explore how to adapt existing alignment techniques to tolerate failures.
Value Alignment ensures that AI systems are aligned with human preferences and are incentivized to fulfill them. Some existing research in this field consists of "Training Language Models with Language Feedback" and "Pretraining Language Models with Human Preferences." Ongoing research works towards aligning reinforcement learning (RL) agents via image-captioning models.
Language Model Evaluation assesses how scale and techniques such as Reinforcement learning from human feedback (RLHF) affect model performance and value alignment. Research such as "Inverse Scaling" and "Evaluating the Moral Beliefs Encoded in LLMs" falls under this umbrella.
Model Internals focuses on reverse engineering ML models to understand how they function. Existing research includes "Eliciting Latent Predictions from Transformers with the Tuned Lens." Continuing in this line of investigation, the concept of mechanistic interpretability of mesa-optimization seeks to further understand how some machine learning models develop their own internal methods for problem-solving. Looking ahead, FAR AI plans to set standards for what constitutes a “good” interpretability hypothesis, which means developing criteria to evaluate how well a proposed explanation helps us understand a model’s behavior.

In addition to incubating a portfolio of high-potential AI safety research agendas described above, they support projects that are:

Too large or complex to be led by academia, or
Un-aligned with the interests of the commercial sector as they are unprofitable.

They are following a hits-based approach so they expect some of their agendas to be extremely impactful and others to have limited impact. Their goal is to rapidly identify those that work and scale them up, reallocating resources away from those that don’t.

What is everyone working on in AI alignment?