Improving Model Safety Behavior with Rule-Based Rewards

Introduction

As large language models grow more capable, ensuring they respond safely without sacrificing helpfulness has become one of AI development's hardest problems. Many businesses struggle with a familiar tension: models trained using conventional human feedback methods often become either overly cautious—refusing harmless requests—or produce sycophantic, over-hedged responses that erode user confidence. These issues aren't just technical nuances; they represent a fundamental challenge in defining and measuring complex behaviors.

Rule-Based Rewards (RBRs), developed by OpenAI researchers, offer a direct solution to this problem. Rather than relying solely on expensive, inconsistent human annotations, RBRs encode safety policies as explicit, measurable rules and use an AI grader to evaluate responses in real time during reinforcement learning. This is the same logic that Applied Behavior Analysis has applied to human performance for decades: when you define what "good" looks like in observable, measurable terms, you get consistent results that holistic judgment alone cannot deliver.

What follows is a breakdown of how RBRs work, what the research shows they achieve, and why the behavioral science behind them is far from new.

TLDR

RBRs replace costly human safety labels with AI feedback guided by explicit behavioral rules
Safety policies decompose into binary "propositions"—simple true/false statements about model responses
OpenAI's experiments showed RBR-trained models achieved 97.1 F1 score vs. 91.7 for human-feedback baselines
Over-refusals dropped significantly while maintaining safety performance
RBRs update rapidly without relabeling datasets and integrate directly into existing reward pipelines

Why Human Feedback Alone Falls Short for AI Safety

The standard RLHF (Reinforcement Learning from Human Feedback) pipeline trains a reward model on human preference data, then uses that model to fine-tune the LLM via reinforcement learning algorithms like Proximal Policy Optimization (PPO). While effective for general helpfulness, this approach has documented weaknesses when applied to safety-specific training.

Inconsistent Annotator Judgment

Without precise instructions, human annotators rely on personal judgment, producing inconsistent labels. The RBR research documents a concrete example: when ranking responses to self-harm requests, some annotators favored completions recommending a U.S. suicide hotline—which wouldn't help international users. Fixing such errors requires expensive data relabeling.

Human-Auto Agreement Rates reported in the study ranged from 0.85 to 0.96 across different behavior categories, revealing systematic variability in how annotators interpret safety guidelines.

Over-Cautiousness and Over-Refusals

Models trained on safety data often refuse safe requests, reducing usefulness without improving safety. In the OpenAI experiments, the human-feedback baseline increased over-refusals by roughly 14%, dropping the "Not-Overrefuse" metric to 84.70% compared to 98.13% for the helpful-only baseline.

This failure mode parallels a known challenge in organizational behavior: punishment-heavy compliance systems suppress productive behavior alongside harmful behavior, producing rule-following without judgment — the same tension that emerges in AI safety training when blanket refusals crowd out useful responses.

Scalability and Adaptability Challenges

As safety guidelines evolve with model capabilities or user behaviors, previously collected human data becomes outdated and must be replaced at high cost. Every policy update triggers a full relabeling cycle — there's no incremental path forward.

Limitations of AI-Feedback Methods

The scalability problem pushed researchers toward AI-feedback alternatives. Constitutional AI (CAI), developed by Anthropic, attempts this by using AI evaluators trained on written principles rather than human preference labels. However, general guidelines like "choose the less harmful response" leave too much discretion to the AI. Research on specific vs. general principles shows that broad principles struggle with nuanced traits and can make models progressively less helpful and more evasive over time.

What Are Rule-Based Rewards (RBRs)?

Rule-Based Rewards are a preference modeling method that encodes desired and undesired model behaviors as explicit, structured rules. An LLM grader then evaluates responses against those rules in real time during reinforcement learning training, removing the need for large human-annotated safety datasets.

Propositions: The Building Blocks

Propositions are binary (true/false) statements about specific aspects of a model's response:

"The completion contains a short apology"
"The response is judgmental toward the user"
"The response contains disallowed content"

LLMs are much more accurate at classifying narrow, specific behaviors than at making holistic quality judgments. This decomposition uses that strength.

Combining Propositions into Rules

For each target response type, a set of rules defines what combinations of proposition values are "ideal," "less good," or "unacceptable."

Three Response Types:

Hard Refusals — Firm refusal with a brief apology for clearly unsafe requests; ideal response includes an apology, a statement of inability to comply, no judgmental language, and no disallowed content
Soft Refusals — Empathetic acknowledgment with a decline for sensitive topics like self-harm; ideal response acknowledges the user's emotional state without being judgmental
Comply — Full, helpful response for safe requests and boundary cases; ideal response directly addresses the user without unnecessary refusals

Three RBR response types hard refusal soft refusal and comply comparison

The LLM Grader

A fixed, helpful-only language model evaluates each response against classification prompts for each proposition, producing probability scores. These scores feed into a linear RBR reward function with learnable weights, which is added to the base helpfulness reward model during PPO training — the reinforcement learning process that shapes how the model responds over time.

The Behavioral Science Connection

This approach mirrors a foundational principle of Applied Behavior Analysis (ABA): decomposing complex behavior into discrete, observable, measurable components. Rather than asking a rater to holistically judge "how good" a response is, both ABA-based performance management and RBRs identify specific behavioral dimensions, define their desired states, and evaluate them individually—then aggregate those measurements into an overall performance signal.

How the RBR Framework Works: From Propositions to RL Training

How the Training Data Is Generated

Rather than relying on human annotators to score completions, RBRs use synthetically generated data:

A helpful-only model is prompted to produce diverse completions for each safety-relevant prompt
Completions include ideal responses, sub-optimal responses with specific undesired traits (judgmental language, excessive verbosity), and unacceptable responses containing disallowed content
The RBR and a moderation model verify that each synthetic completion has the intended properties, resampling as needed

Only around 500 completions (the "Gold set") are needed to tune the classification prompts for each proposition. The OpenAI study used exactly 518 human-labeled completions across three behavior categories — compared to thousands required for traditional RLHF safety data, a 10x+ reduction in human annotation burden.

How RBR Weights Are Fit and Combined with the Base Reward Model

Weight-Fitting Process

Using the synthetically generated comparison data, a linear model is trained via hinge loss optimization to assign RBR weights such that the combined total reward (base helpfulness RM + RBR) correctly ranks ideal completions above sub-optimal and unacceptable completions.

Because the linear model has only a small number of parameters — one per proposition/class feature — fitting it takes minutes on commodity hardware.

Direct Integration During PPO Training

The RBR reward is added directly to the helpful-only reward model score during reinforcement learning:

R_total(x, y) = R_helpfulness(x, y) + R_RBR(x, y)

This direct integration avoids the information loss that occurs when nuanced behavioral rules are compressed into a single RM score — something prior AI-feedback methods cannot prevent.

What the Results Show

F1 Score Performance

These numbers tell a clear story. The OpenAI research shows RBR-PPO achieved an F1 score of 97.1, outperforming both the human-feedback baseline (91.7) and the helpful-only baseline (95.8).

The F1 score balances two competing objectives:

Not-Unsafe: Percentage of unsafe prompts correctly refused
Not-Overrefuse: Percentage of safe prompts correctly answered

Detailed Breakdown:

Model	Overall F1	Not-Unsafe	Not-Overrefuse
RBR-PPO	97.1	97.27%	97.01%
Human-PPO	91.7	100.00%	84.70%
Helpful-PPO	95.8	93.64%	98.13%

The human baseline achieved perfect safety scores (100%) but at the cost of a ~12 percentage point drop in correctly handled safe prompts. RBR maintained both safety and usefulness.

RBR-PPO versus human feedback versus helpful-only baseline F1 score safety comparison chart

Robustness Across Reward Model Types

RBR training was tested across multiple reward model types, including over-cautious models and those trained on outdated safety data. It improved performance across all of them. Applied to the over-cautious Human-RM specifically, RBRs reduced over-refusals by 16% compared to Human-PPO.

No Capability Degradation

Capability benchmarks showed no meaningful degradation in RBR-trained models:

Benchmark	Helpful-PPO	RBR-PPO
MMLU	82.3	82.3
HellaSwag	87.8	87.8
GPQA	43.3	43.5
LAMBADA	87.1	87.1

Across every benchmark, RBR-PPO matched or marginally exceeded the helpful-only baseline — suggesting that structured rule-based rewards can encode safety constraints without trading away the underlying capabilities that make a model useful.

Limitations and Ethical Considerations

Scope Limitations

RBRs work best when desired behaviors can be clearly defined as explicit, binary, easy-to-judge propositions. For subjective tasks—such as evaluating overall quality of creative writing—it's harder to define exhaustive rule sets. RBRs should be combined with human feedback rather than used in isolation.

Reduced Human Oversight Risks

Shifting safety evaluation from human annotators to LLM graders reduces direct human supervision and risks amplifying biases already present in the grader model. Research on LLM-as-judge bias identifies 12 distinct bias types including position bias, verbosity bias, and self-enhancement bias. Position bias alone can shift preferences by over 30% when response order is swapped.

Mitigation recommendations:

Carefully audit proposition accuracy on held-out labeled sets
Evaluate for demographic or content biases in grader outputs
Use position swapping and multi-judge ensembles
Consider hybrid approaches combining RBRs with targeted human review for sensitive content

Four LLM grader bias mitigation steps for rule-based reward safety auditing

These mitigation steps also highlight a deeper challenge: the quality of RBR outputs depends heavily on how well each proposition prompt is constructed.

Proposition Prompt Tuning

Achieving high classification accuracy for each proposition requires iterative refinement of classification prompts—similar to the instruction refinement required in human data collection. Plan for this tuning cycle upfront, particularly when extending RBRs to new content policy domains outside those tested in the original research. A poorly tuned prompt can undermine an otherwise sound reward structure.

Frequently Asked Questions

What is a rule-based reward?

A rule-based reward (RBR) is a reward signal used in reinforcement learning that evaluates model outputs against explicit behavioral rules rather than relying on human annotators or a learned reward model. Rules are composed of binary propositions about specific response qualities, and an LLM grader scores each response against those propositions during training.

How can I improve model safety behavior?

Four practices consistently improve safety outcomes during RL training:

Define precise content and behavior policies upfront
Encode those policies as fine-grained propositions using RBRs
Include both unsafe and safe "boundary" prompts to prevent over-refusals
Audit the grader's classification accuracy on a held-out labeled set to catch drift or bias

How do you train an effective reward model?

An effective safety reward model needs high-quality, diverse comparison data covering both desired and undesired completions, plus carefully specified labeling guidelines to reduce annotator inconsistency. A mechanism like RBRs helps incorporate behavioral rules directly into the reward signal, rather than relying solely on holistic human preference ratings.

What is the difference between RBR and RLHF?

RLHF trains a reward model on human-annotated preference comparisons, then uses that model to guide RL training. RBRs skip large-scale annotation by encoding desired behaviors as explicit rules scored by an LLM grader in real time — and they work on top of, not instead of, a base helpfulness reward model.

What are the limitations of rule-based rewards for AI safety?

RBRs are most effective for behaviors that can be clearly defined as binary, observable propositions. They are harder to apply to subjective or open-ended quality dimensions, require careful prompt tuning for each proposition, and carry a risk of amplifying biases in the grader LLM if not audited carefully.