RLHF and RLAIF in GPT-NeoX

Integrating large scale post training and preference learning into GPT-NeoX

Introduction

Today, SynthLabs and EleutherAI are excited to announce large scale post training and preference learning in GPT-NeoX, one of the most widespread and adopted pretraining frameworks for large scale language models. One of the many efforts within our deep partnership with EleutherAI is to improve the accessibility and performance of preference learning at scale.

Post training, and hence preference learning as a whole, has consistently proven itself to be one of the main determining factors for if a model is performant and enjoyable to use, powering models like ChatGPT and GPT-4. Preference learning involves distilling human preferences into language models, to align language model's objectives with those of humans.

Until today, large-scale preference learning was bottlenecked by a lack of easily scalable and robust frameworks. Pushing the boundary of what models are easily trainable and what training methodologies are easily accessible will enable a new wave of research developments and breakthroughs in the space of preference learning. It will also open up a new set of previously unknown applications, much like the introduction of many of EleutherAI's prior open-source models did.

SynthLabs is also excited to announce an early access program to our managed training platform, building off of a number of our contributions to GPT-NeoX.

SynthLabs Specialized Offerings

As part of our commitment to advancing AI research and development, SynthLabs provides a suite of specialized services and tools designed to enhance the AI training and deployment process:

Enterprise Data Integration: Seamless connectivity with Snowflake, Databricks, and Grafana for robust data management and visualization.
Process Reward Modeling (PRM): Comprehensive support for implementing and optimizing PRM environments, with ongoing efforts towards automation.
Red Teaming Solutions: Advanced strategies to ensure optimal model safety and reliability.
Customized Training Pipelines: Tailored workflows for specific use cases, including online learning, reasoning enhancement, and tool integration.
Data and Model Quality Assurance: Sophisticated analytics and evaluation suites to ensure peak model performance.
Streamlined Evaluation: Easy integration with evaluation frameworks for comprehensive model assessment.

These offerings are designed to empower researchers, developers, and organizations in their AI endeavors, providing the tools and support needed to push the boundaries of what's possible in language model development and deployment.

Interested in Managed Training?

We're excited to invite you to join our early access program for our managed training platform.

Book a demo!

Collaboration

This collaboration combines SynthLabs' expertise in preference learning—the same minds behind trlX, the first open-source library to implement scalable and easy to use RLHF techniques; #1 leaderboard-ranked models like StableBeluga; and StableVicuna, one of the first open-source models fine-tuned with RLHF—with EleutherAI's leadership in optimizing model training at scale.

New Methodologies

Today, we introduce a number of methodologies implemented into GPT-NeoX for performing large-scale preference learning.

Direct Policy Optimization (DPO)

Direct Policy Optimization (DPO) stands out as one of the most robust and reliable methods for preference learning available today. Its stability and scalability have made it the preferred choice in models like LLaMA 3 and LLaMA 3.1. This iteration of GPT-NeoX introduces a highly performant and scalable implementation of DPO, empowering researchers, small businesses, and AI practitioners to conduct efficient and reliable preference learning at any scale. With DPO already gaining traction in cutting-edge applications, it is poised to become a key player in modern AI alignment research.

Reward Modeling and Supervised Fine-Tuning

Secondly, we present functionality for training reward models as well as improved supervised finetuning within the GPT-NeoX library. We hope that enabling reward modeling training in NeoX in particular will open the door to large-scale reward modeling research. By "large-scale", we refer to massively parallel models, and distributed high performance computing. Literature has shown that increasing the size of your reward model results in a significant improvement in both robustness and performance.

Kahneman-Tversky Optimization (KTO)

Finally, we present an implementation of Kahneman-Tversky Optimization (KTO). KTO is a method designed to use binary rewards for preference learning, unlike the conventional pairwise-preference approaches found in other preference post-training approaches. For instance, in a point-of-sale chatbot, KTO can efficiently learn from simple "successful sale" or "no sale" outcomes, rather than comparing pairs of interactions.

Efficiency

GPT-NeoX builds on leading core technologies for large scale optimization including ZeRO, 3D parallelism, and flash attention and combines them with both novel HPC optimizations as well as support and out–of-the-box performance on a wide variety of GPUs (NVIDIA, AMD), model architectures (transformers, mixture-of-experts, Mamba, RWKV), interconnects (InfiniBand, Ethernet, Slingshot), and job launchers (Slurm, MPI, IBM Job Step Manager). Through maintaining performance across all combinations, GPT-NeoX has become a standard library for training large scale models deployed across a wide variety of academic and cloud systems.

By adding support for preference learning (SFT, DPO, KTO) into GPT-NeoX, we are able to exploit pretraining optimizations (both scale-up and scale-out) during the post-training process. This alleviates the efficiency bottleneck inherent to existing post-training libraries like TRL.

HF Hyperparameters from the alignment handbook repo for zephyr-7b-beta
13B Seconds per iteration for TRL using 8 Gradient Accumulation Steps, batch size per device of 2, zero3, and gradient checkpointing enabled

In particular, we find that leveraging GPT-NeoX for post training provides a 30-40% speed-up compared to TRL at the 7B and 13B parameter scale. GPT-NeoX has been scaled to thousands of GPUs, and we fully expect similar performance improvements when using these features at larger scales.

Reproducibility + Release

To get started with preference learning techniques such as SFT, DPO, KTO, and Reward Modeling, please refer to our post-training folder in the repository. These examples will guide you through the process of applying the various methods to fine-tune models.

To verify our implementation, we've recreated the HuggingFaceH4/zephyr-7b-beta model using our DPO implementation. Details on how we generated the data as well as the GPT-NeoX configuration can be found here

To evaluate our model, we utilized the latest lm-evaluation-harness with vLLM:

SynthLabs: Frontier post-training research

SynthLabs is a post-training AI research lab advancing and scaling synthetic reasoning. Our mission is to open and democratize new frontiers in post-training research, specializing in developing innovative preference learning techniques and optimizing the performance and alignment of foundation models.

Through our ongoing collaboration with EleutherAI, we're making sophisticated AI techniques accessible, enabling a new era of large-scale, open science research. We're empowering academics and innovators to explore post-training research questions that were once the exclusive domain of large industry labs.

As part of this commitment, we plan to implement various policy gradient approaches, including REINFORCE. Looking ahead, SynthLabs and EleutherAI aim to expand support for online reinforcement learning methodologies applicable to more complex, agentic environments. We intend to explore these topics, along with studying reward models at scale, within the GPT-NeoX framework.

EleutherAI Mission Statement

EleutherAI is a world-renowned non-profit research lab specializing in large language models and natural language processing. We strive to lower the barrier of entry to doing research on large language models through providing accessible research infrastructure to train and evaluate large language models. By integrating preference learning functionality into our GPT-NeoX training library we enable our team, as well as the dozens of academic, small company, and government labs around the world who use GPT-NeoX, to easily work with this technology at massive scale. Open-sourcing scalable preference learning tools is another step towards ensuring the future of AI systems isn't solely determined by the most powerful for-profit companies.

EleutherAI looks forward to a fruitful partnership with SynthLabs, and is happy to engage with other like-minded individuals and organizations! If you would like to work with us or support our mission, please get in touch at [email protected]

Future GPT-NeoX Tease

GPT-NeoX has been improving! We now have alpha implementations of the following:

AMD GPU support
Mixture-of-Experts (MoE) support
RWKV and Mamba support
Sequence parallelism

The implementation of preference learning is part of a broader push to improve the GPT-NeoX library and continue to power open research at scale on frontier HPC systems. Preference learning will be included in the upcoming GPT-NeoX 3.0 release, which includes stable versions of the above features.

To start working with early implementations of the above today, check out the GPT-NeoX repository!