Scaling Up Good Synthetic Reasoning
We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedback. Our research focuses on improving AI systems' ability to understand and adhere to human values and preferences across diverse contexts. By leveraging Chain-of-Thought (CoT) reasoning and innovative training techniques, GenRM aims to create more robust, generalizable, and ethically aligned AI systems.
Show your support by starring our GitHub repository.
The rapid advancement of language models (LMs) necessitates robust alignment with diverse user values. However, current preference optimization approaches often fail to capture the plurality of user opinions, instead reinforcing majority viewpoints and marginalizing minority perspectives. We introduce PERSONA, a reproducible test bed designed to evaluate and improve pluralistic alignment of LMs. We procedurally generate diverse user profiles from US census data, resulting in 1,586 synthetic personas with varied demographic and idiosyncratic attributes. We then generate a large-scale evaluation dataset containing 3,868 prompts and 317,200 feedback pairs obtained from our synthetic personas. Leveraging this dataset, we systematically evaluate LM capabilities in role-playing diverse users, verified through human judges, and the establishment of both a benchmark, PERSONA Bench, for pluralistic alignment approaches as well as an extensive dataset to create new and future benchmarks.
Existing methods for controlling language models, such as RLHF and Constitutional AI, involve determining which LLM behaviors are desirable and training them into a language model. However, in many cases, it is desirable for LLMs to be controllable at inference time, so that they can be used in multiple contexts with diverse needs. We illustrate this with the Pink Elephant Problem: instructing an LLM to avoid discussing a certain entity (a "Pink Elephant"), and instead discuss a preferred entity ("Grey Elephant"). We apply a novel simplification of Constitutional AI, Direct Principle Feedback, which skips the ranking of responses and uses DPO directly on critiques and revisions. Our results show that after DPF fine-tuning on our synthetic Pink Elephants dataset, our 13B fine-tuned LLaMA 2 model significantly outperforms Llama-2-13B-Chat and a prompted baseline, and performs as well as GPT-4 in on our curated test set assessing the Pink Elephant Problem.
Research Scientist
2024
We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedbac...
2024
This paper introduces Direct Preference Optimization (DPO), a novel approach for training language models that leverages preference data....
2021
This work presents Combo, a novel approach for offline reinforcement learning that combines model-based and conservative policy optimization technique...
Research Scientist
2024
We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedbac...
2023
This paper introduces RWKV, a novel architecture that combines the efficiency of RNNs with the expressiveness of Transformers....
2023
This work presents Logic-LM, a method for enhancing language models with symbolic reasoning capabilities....
Co-founder, Research Scientist
Research Scientist
Research Scientist
Research Scientist
Research Scientist
Our most recent 3 publications
3 publications
2024-10-3
GenRM: Generative Reward Models for AI AlignmentDakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak
2024-07-24
PERSONA: A Reproducible Testbed for Pluralistic AlignmentLouis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fränken, Chelsea Finn
2024-02-12
Suppressing pink elephants with direct principle feedbackLouis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman
We're always open to new collaborations and ideas. If you're interested in working with us or have any questions, please reach out!