AI Stability & Drift Test Results

Which AI Models Stay On-Task in Long Conversations?

We tested 5 leading AI models across 10 different test cases over 20-turn conversations. Models showed varying levels of stability—some maintained consistent behavior while others started drifting, getting wordier or dropping instructions mid-conversation.

Test Results Published: December 21, 2025

Methodology: Each model was tested with 10 different test cases (avoid hedging, code formatting, JSON output, etc.) over 20-turn conversations. We measured compliance drop (how often models stop following instructions) and verbosity drift (how much response length changes over time). Each test case ran 5 trials per model with 2 different system presets (Default and Friendly), totaling 100 conversations per model.

Understanding the Metrics

📏 Verbosity Drift

Measures how much response length changes from the first turn to the 20th turn of a conversation.

              drift = |words_turn20 - words_turn1| / words_turn1
            

0.0 = Perfect stability - Same word count throughout
0.2 = 20% change - Moderate drift in response length
0.5 = 50% change - Significant drift (wordier or terser)

Lower is better. A model that starts with 50-word responses and drifts to 75 words has a drift of 0.5 (50% increase).

✓ Compliance Drop

Measures how often models stop following the original instructions after 20 turns, including a conflicting mid-conversation instruction.

              drop = compliance_turn1 - compliance_turn20
            

0.0 = Perfect consistency - Follows instructions equally well on both turns
0.06 = 6% drop - Slightly less compliant by turn 20
0.2 = 20% drop - Significantly worse at following instructions

Lower is better. Negative values mean the model actually improved compliance by turn 20.

Test Design: Each conversation starts with a specific instruction (e.g., "respond in exactly 20 words"). After 10 filler turns, we insert a conflicting instruction ("respond in one sentence"). Then 8 more filler turns, and finally we re-test the original instruction. This tests whether models maintain their initial task understanding or get distracted.

Key Findings

Stability Rankings

Models ranked by their ability to maintain consistent behavior across conversations.

View metric:

Model Comparison

Compare stability metrics across all tested models.

Detailed Results

In-depth metrics for each model and preset combination.

Select Model:

Raw Data & Verification

Download complete conversation transcripts and metrics for independent verification. All data includes timestamps, token usage, and latency measurements.

Summary Data

Aggregated metrics for all models and presets with statistical analysis.

Download summary.json (16KB)

Highlights

Top findings showing biggest deltas between presets.

Download highlights.json (2KB)

Complete Raw Logs (Compressed)

All 10 run files with full conversation transcripts, token counts, and latency data (5 models × 2 presets). Files are gzip-compressed for efficiency.

Data Contents: Each run.json.gz file contains complete conversation transcripts for 10 test cases × 5 trials = 50 conversations per model/preset. Includes all user prompts, assistant responses, token usage, latency measurements, and computed metrics. Files are gzip-compressed (decompress with gunzip or any standard decompression tool). Total dataset: ~280KB compressed, ~5.7MB uncompressed across 10 files.