SynthLabs Research Hub

Scaling Up Good Synthetic Reasoning

Latest Research

GenRM: Generative Reward Models for AI Alignment

GenRM: Generative Reward Models for AI Alignment

We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedback. Our research focuses on improving AI systems' ability to understand and adhere to human values and preferences across diverse contexts. By leveraging Chain-of-Thought (CoT) reasoning and innovative training techniques, GenRM aims to create more robust, generalizable, and ethically aligned AI systems.

Latest Breakthroughs

Explore our most recent AI advancements and discoveries.

Follow us on GitHub / Hugging Face

Show your support by starring our GitHub repository.

PERSONA: A Reproducible Testbed for Pluralistic Alignment

PERSONA: A Reproducible Testbed for Pluralistic Alignment

The rapid advancement of language models (LMs) necessitates robust alignment with diverse user values. However, current preference optimization approaches often fail to capture the plurality of user opinions, instead reinforcing majority viewpoints and marginalizing minority perspectives. We introduce PERSONA, a reproducible test bed designed to evaluate and improve pluralistic alignment of LMs. We procedurally generate diverse user profiles from US census data, resulting in 1,586 synthetic personas with varied demographic and idiosyncratic attributes. We then generate a large-scale evaluation dataset containing 3,868 prompts and 317,200 feedback pairs obtained from our synthetic personas. Leveraging this dataset, we systematically evaluate LM capabilities in role-playing diverse users, verified through human judges, and the establishment of both a benchmark, PERSONA Bench, for pluralistic alignment approaches as well as an extensive dataset to create new and future benchmarks.

Suppressing pink elephants with direct principle feedback

Suppressing pink elephants with direct principle feedback

Existing methods for controlling language models, such as RLHF and Constitutional AI, involve determining which LLM behaviors are desirable and training them into a language model. However, in many cases, it is desirable for LLMs to be controllable at inference time, so that they can be used in multiple contexts with diverse needs. We illustrate this with the Pink Elephant Problem: instructing an LLM to avoid discussing a certain entity (a "Pink Elephant"), and instead discuss a preferred entity ("Grey Elephant"). We apply a novel simplification of Constitutional AI, Direct Principle Feedback, which skips the ranking of responses and uses DPO directly on critiques and revisions. Our results show that after DPF fine-tuning on our synthetic Pink Elephants dataset, our 13B fine-tuned LLaMA 2 model significantly outperforms Llama-2-13B-Chat and a prompted baseline, and performs as well as GPT-4 in on our curated test set assessing the Pink Elephant Problem.

Join the team

Join the team

Research Team

Rafael Mitkov Rafailov

Rafael Mitkov Rafailov

Research Scientist

Selected Work

Alon Albalak

Alon Albalak

Research Scientist

Selected Work

Collaborators

eleuther-logostanford-logo

Recent Publications

Our most recent 3 publications

3 publications

2024-10-3

GenRM: Generative Reward Models for AI Alignment

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak

2024-07-24

PERSONA: A Reproducible Testbed for Pluralistic Alignment

Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fränken, Chelsea Finn

2024-02-12

Suppressing pink elephants with direct principle feedback

Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman


Interested in Collaboration?

We're always open to new collaborations and ideas. If you're interested in working with us or have any questions, please reach out!