Post

[Paper Review] Investigating Data Contamination for Pre-training Language Models

Review of a critical study on data contamination in LLMs, questioning whether performance gains are due to memorization of test data.

[Paper Review] Investigating Data Contamination for Pre-training Language Models

Note: This is a review of the paper “Investigating Data Contamination for Pre-training Language Models” (arXiv:2401.06059, 2024).

Code: No official code repository found at the time of writing.

For a Korean version of this review, please visit the OUTTA AI Tech Blog.

Why I Read This Paper

With LLMs pouring out these days, you might have suspected at least once, “Is this model’s performance real?” Often, benchmark scores are high, but the actual usability is poor. This paper directly investigates one of the causes: “Data Contamination.”

What was particularly interesting was that they measured the impact by mixing contaminated data on purpose while training the model “From Scratch.” This is a completely different approach from existing studies that only “guessed” with already trained models. It was very impressive because it showed with clear data why we shouldn’t blindly trust LLM performance evaluations.


Introduction

The exceptional performance of Large Language Models (LLMs) is often attributed to model scale and data size. However, a critical question remains: Are these models actually learning, or are they just memorizing the test data?

This paper investigates Data Contamination, where evaluation data leaks into the pre-training corpus. Unlike previous studies that analyze contamination post-hoc, this research takes a “pre-training level analysis” approach, training GPT-2 models from scratch with controlled amounts of contamination to measure its direct impact.

Most prior contamination work is post-hoc: it estimates leakage in already-trained models via n-gram overlap (as reported for GPT-3, PaLM, and LLaMA-2). This paper takes the opposite direction — training GPT-2 from scratch with controlled contamination — which lets it measure the causal effect instead of guessing, and then shows those overlap-based detectors are unreliable.


Data Contamination Types

The authors distinguish between two types of contamination:

  1. Text Contamination: Only the input text of the evaluation samples is present in the pre-training data.
  2. Ground-truth Contamination: The input text, prompt, and the correct answer (ground truth) are all present. This is a more severe form of leakage that previous studies often overlooked.

Experiments & Findings

1. Impact on Performance

For generation-style tasks, Ground-truth Contamination tends to have a larger impact on performance than simple Text Contamination. SST-2 is a notable exception: there, Text Contamination yields the higher score. The authors attribute this to the nature of text classification, which depends mainly on the model’s comprehension of the input text, so seeing the labeled answer in pre-training helps less than seeing the input itself.

DatasetMetricOriginalText Contam.GT Contam.
SST-2Acc (%)48.3454.8951.02
SQuADF19.079.7811.45
CNN/DMROUGE-124.7626.8428.80
MMLUAcc (%)22.8723.0323.13

Table 1: Performance comparison (numbers from the paper; see Tables 2 and 3). Ground-truth contamination boosts performance most clearly in generation tasks such as SQuAD and CNN/DM, whereas for the SST-2 classification task text contamination has the larger effect.

2. The U-Shaped Effect of Repeated Contamination

One of the most interesting findings is the U-shaped performance trend when the contamination is repeated multiple times in the pre-training corpus.

Contamination Factor Analysis Figure 1: The effect of repeated contamination. Performance initially improves as the contamination factor increases (0-10 repetitions), but then starts to decline and even drops below the baseline with excessive repetition (20+) (from Fig. 1 of the paper).

This suggests that while some exposure to test data helps, over-fitting to the specific examples eventually hurts the model’s generalizability or introduces noise.

3. Failure of Existing Detection Methods

The authors also evaluate existing contamination detection methods (like n-gram overlap used in PaLM and LLaMA-2).

Contamination Detection Analysis Figure 2: Analysis of LLaMA-2’s contamination detection method. The “Dirty” category (high overlap) does not necessarily correspond to higher performance, indicating that current detection methods are unreliable (from the paper).

They find that these methods often fail to distinguish between harmful contamination and harmless data overlap, leading to both false positives and false negatives.


Conclusion & Insight

The contribution is methodological: by training GPT-2 from scratch with controlled contamination, the paper shows causally — not by post-hoc guessing — that ground-truth contamination inflates scores most on generation tasks, and that current n-gram detection is unreliable.

Strengths

  • Controlled from-scratch training isolates the causal effect of contamination, which post-hoc studies on already-trained models cannot.
  • The Text vs. Ground-truth distinction is a useful, often-overlooked axis; the U-shaped repetition effect is a concrete, actionable finding.

Limitations

  • Experiments are at GPT-2 scale; whether the same magnitudes hold for today’s much larger LLMs is untested and may not extrapolate.
  • Conclusions are tied to a specific benchmark set (SST-2/SQuAD/CNN-DM/MMLU), and several absolute scores are low (MMLU ≈ random), so some effects are measured near the noise floor.
  • It critiques existing detectors but does not deliver a robust replacement detection method.

Open Questions / My Take

The headline lesson — leaderboard scores can be inflated by memorization, and overlap-based detectors miss it — is important. The open question is the scale gap: do these effects grow, shrink, or change shape at 100B+ parameters?

This post is licensed under CC BY 4.0 by the author.