RLHF: Which Humans Are in the Loop?

January 2, 2026•by Reilly Sweetland•4 min read

rlhfai-ethicshate-speech-detectiontraining-datarepresentation

RLHF: Which Humans Are in the Loop?

Expanding the perspectives that shape how AI understands hate speech

The incredible progress of AI over the past few years was the result of multiple technologies working together. One that doesn't get as much publicity is called RLHF—Reinforcement Learning with Human Feedback – a breakthrough that opened the door to the remarkably capable language models we have today.

But there's a fundamental question embedded in that acronym that deserves more attention: Which humans?

What is RLHF?

At its core, RLHF is simple. After initial training, an AI model is shown to human evaluators who provide feedback on its outputs. That feedback becomes additional training, allowing the model to continuously improve based on human judgment.

Think of it as a kind of apprenticeship. The model generates something, a human says "yes, that's good" or "no, that's wrong," and the model learns from that guidance. Over thousands—sometimes millions—of these interactions, the model develops an increasingly refined sense of what humans want.

The Representation Challenge

The quality of RLHF depends entirely on the quality and breadth of human perspectives feeding into it. And when it comes to nuanced, culturally-specific domains like hate speech, getting representative perspectives is genuinely difficult.

Consider what's required: feedback providers need to recognize not just overt slurs, but dog whistles, coded language, historical references, and context-dependent phrases that shift meaning across communities. No single team of evaluators—no matter how skilled or well-intentioned—can hold all of that knowledge.

This isn't a criticism of current approaches. It's an acknowledgment of the scale of the challenge. The more perspectives that can be incorporated into training data, the more robust the resulting models become.

Why Hate Speech Is Especially Hard

Hate speech has properties that make it uniquely challenging:

It can establish itself as ground truth. When hateful content goes unrecognized by AI systems, it gains implicit legitimacy. Each missed identification is a small signal that such speech is acceptable.

It evolves faster than training cycles. Hate speech adapts. Slurs get replaced by euphemisms. In-group language gets co-opted. By the time a pattern is recognized and incorporated into training, new patterns have already emerged.

Affected communities are often underrepresented. Some communities are small, marginalized, or geographically dispersed in ways that mean their experiences don't naturally surface in typical data collection. The hate speech that targets them may be invisible to evaluators who haven't lived that experience.

The Value of Broader Perspectives

The fundamental challenge is that hate speech is deeply contextual. Definitions cannot be separated from the perspectives of those who experience it. This is where the "H" in RLHF becomes critical.

At DefineHate.org, we're taking a different approach: rather than pursuing top-down classifications, we've created a system that allows members of targeted communities to self-organize and label examples of hate speech against their groups.

There is no universal definition of hate speech. But when a labeled dataset is organized by specific demographics, consensus patterns start to emerge.

We place far more importance on selecting human labelers—the "H" in RLHF—than previous efforts. Those with lived experience of discrimination have crucial expertise that has been historically excluded from content moderation systems.

By prioritizing authentic representation and community-driven consensus, we're working to transform hate speech detection from a static technical problem into a dynamic, participatory process—one that evolves with community needs.

Our goal isn't to replace existing approaches, but to augment them with perspectives that have historically been missing from the "H" in RLHF. We're not building this dataset to enforce moderation or censorship—we're giving voice to communities that deserve to be heard, and providing algorithm designers who genuinely wish to respect those communities a dataset that allows them to do so.

We're building tools to help expand the perspectives available for training AI systems to understand hate speech. If you're interested in learning more, we'd love to connect.