PERSONA: A Reproducible Testbed for Pluralistic Alignment
PERSONA: A Reproducible Testbed for Pluralistic Alignment
The rapid advancement and adoption of language models (LMs) has highlighted critical challenges in aligning these models with the diverse values and preferences of global users. Existing reinforcement learning from human feedback (RLHF) approaches often fail to capture the plurality of user opinions, instead reinforcing majority viewpoints and marginalizing minority perspectives. To address this, we introduce PERSONA, a comprehensive and reproducible test bed designed to evaluate and improve pluralistic alignment in language models.
Our approach utilizes synthetic personas, crafted through a combination of US census data and procedural generation, to simulate a wide array of user profiles with diverse demographic and idiosyncratic attributes. We present a detailed methodology for constructing a representative demographic of 1,586 personas, each enriched with individualistic personality traits and core values. Leveraging this synthetic demographic, we generate a large-scale preference dataset containing 3,868 prompts and 317,200 pairs of diverse feedback.
This dataset enables the evaluation of language models` ability to align with both group-level and individual preferences across various controversial and value-laden topics. Our contributions include a systematic evaluation of current LM capabilities in role-playing diverse users, verified through human judges, and the establishment of a benchmark for pluralistic alignment approaches. Our work aims to facilitate the development of more inclusive and representative language models, paving the way for future research in global pluralistic alignment.
Interested in Collaboration?
We're always open to new collaborations and ideas. If you're interested in working with us or have any questions about PERSONA, please reach out!