We're scaling synthetic reasoning

  • Post-training is unlocking new capabilities in foundation models.
  • Training on raw human data doesn’t scale.
  • Current methods of “alignment” are insufficient;
    evaluations are even worse.
  • Human intent is rich in preferences, collapsed by uniform models.
  • AI`s potential hinges on trust, from interpretable data to every layer built upon it.
  • Your models should adapt and scale, automatically.

Frontier challenges in AI post-training

Computer Science > Machine Learning
Generative Reward Models

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak

Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs). The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs. Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM. However, recent work has shown that synthetic preferences labels may not align well with human preference judgments. To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies.

Submitted on 2 Oct 2024
Use Cases
Blog Post
EleutherAI said it best

Democratizing AI research is essential, as the future of transformative technologies should not be confined to the corridors of a few profit-driven entities, but open to independent inquiry and understanding for the collective good.

Let's collaborate on open science ML research →

Computer Science > Machine Learning
Supressing Pink Elephants with Direct Principle Feedback

Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, and Stella Biderman

Existing methods for controlling language models, such as RLHF and Constitutional AI, involve determining which LLM behaviors are desirable and training them into a language model. However, in many cases, it is desirable for LLMs to be controllable at inference time, so that they can be used in multiple contexts with diverse needs. We illustrate this with the Pink Elephant Problem: instructing an LLM to avoid discussing a certain entity (a “Pink Elephant”), and instead discuss a preferred entity (“Grey Elephant”).

Submitted on 12 Feb 2024

Supported By

Microsoft’s M12 Ventures
Eric Schmidt's First Spark Ventures

[ mei ventures ]

[ ashish vaswani ]