NeurIPS 2025

Science Needs Reproducibility. Algorithms Should Too.

Why standard statistical tests flip-flop on the same data, and how a simple roll of the dice fixes it.

โšก TL;DR

The Problem: Standard statistical tests are unstable. If you run a test on two different datasets from the same distribution, one might say "Reject" and the other "Accept" just by chance. This breaks reproducibility.

The Old Answer: Use a fixed threshold (e.g., "Reject if $Z > 1.96$"). If your data lands at 1.97 one day and 1.95 the next, your conclusion flips.

The New Answer: Replicable Testing. Instead of a fixed line in the sand, pick a random threshold once, and share it.

Key Insight: By spreading the "danger zone" (where decisions flip) across a random range, we ensure that for any specific distribution, the probability of landing in the danger zone is tiny.

๐Ÿ“– The Story in Plain English

The Cliff Edge Problem

Imagine you're testing whether a coin is biased. You flip it 100 times. If you get more than 60 heads, you call it "Biased". If you get 60 or fewer, you call it "Fair".

Now suppose you have a coin that naturally produces about 60 heads on average.

  • Experiment 1: You get 61 heads. Result: BIASED.
  • Experiment 2: You get 59 heads. Result: FAIR.

Two scientists, studying the exact same coin, reach opposite conclusions. This is the reproducibility crisis in a nutshell. Standard algorithms are "brittle" at the boundary.

๐Ÿ’ก The Solution: Don't Stand on the Edge

The paper's solution is surprisingly simple: don't use 60 as the cutoff.

Instead, before the experiment starts, pick a random number between 50 and 70. Let's say you pick 64.3. Now everyone uses 64.3.

If the coin averages 60, it will almost never hit 64.3. Both scientists will likely get numbers near 60 (say, 58 and 62), and both will say "Fair" (because both are less than 64.3).

By randomizing the threshold, we make sure the "cutoff" rarely lands exactly where the data lives.

๐Ÿ“š Before We Dive In: What You'll Need

  • Hypothesis Testing: Deciding between a "Null Hypothesis" (nothing interesting) and an "Alternative" (something interesting).
  • Total Variation Distance (TV): A way to measure how different two probability distributions are. $TV(P, Q) = \sup_A |P(A) - Q(A)|$.
  • Sample Complexity: How many data points ($n$) you need to get the right answer with high probability.
  • Chebyshev's Inequality: A bound stating that random variables rarely stray far from their mean.

๐Ÿšฉ The Problem: Algorithms that Flip-Flop

We want an algorithm $A$ that takes samples $S$ from a distribution $D$ and outputs a decision (e.g., "Uniform" or "Not Uniform").

A standard tester guarantees Correctness:

  • If $D$ is uniform, $P[A(S) = \text{Accept}] \ge 0.9$.
  • If $D$ is far from uniform, $P[A(S) = \text{Reject}] \ge 0.9$.

But what if $D$ is somewhere in the middle? Or what if we run the test twice?

Definition: Replicability

An algorithm is $\rho$-replicable if for any distribution $D$, when run on two independent sample sets $S_1, S_2$:

$$ P[A(S_1) = A(S_2)] \ge 1 - \rho $$

In other words: The algorithm must output the same answer almost all the time, regardless of the random sampling noise.

๐Ÿคจ Reality Check

Q: "Can't we just set a random seed to fix this?"

A: No! Setting a seed fixes the algorithm's internal randomness, but it doesn't fix the data's randomness. If you collect new patients for a clinical trial (new samples $S_2$), a deterministic algorithm might still give a different result than it did for the first group ($S_1$).

๐ŸŽฎ Interactive: The Stability Simulator

See for yourself why fixed thresholds fail and random thresholds succeed. The curve represents the distribution of your test statistic (e.g., the number of heads).

Clearly Null (-3) Boundary (0) Clearly Alt (+3)

Replicability Score

Fixed Threshold (0) 50%
Random Threshold ($r$) 95%

*Score = Probability that two independent runs give the same result.

โš™๏ธ The Algorithm: Replicable Independence Testing

The paper introduces IndependenceStats, a replicable algorithm to test if a distribution is a product distribution (i.e., its coordinates are independent).

Algorithm: ReplicableIndependenceTester(S_p, S_q)
  1. Input: Sample sets $S_p$ (from $P$) and $S_q$ (from product of marginals).
  2. Flattening: Transform samples into a "flattened" representation to handle heavy elements.
  3. Statistic: Compute the "Closeness Statistic" $Z$ between the flattened sets.
  4. Noise Addition: Add Laplacian noise to $Z$ to smooth its distribution.
    This ensures the value doesn't jump around too much.
  5. Thresholding: Generate random threshold $r$ using Public Coin.
    If $Z + \text{Noise} > r$, return REJECT.
    Else, return ACCEPT.

๐Ÿช™ Interactive: The "Public Coin" Necessity

Why do we need a shared random seed? See what happens when two researchers use their own private random seeds.

๐Ÿ‘ฉโ€๐Ÿ”ฌ Researcher Alice

Data Signal ($Z_A$): ...
Threshold ($r_A$): ...
Waiting...

๐Ÿ‘จโ€๐Ÿ”ฌ Researcher Bob

Data Signal ($Z_B$): ...
Threshold ($r_B$): ...
Waiting...

๐Ÿ“ˆ The Cost: Sample Complexity

Replicability isn't free. As the domain size ($n$) grows, the number of samples needed grows faster than for standard testing.

Standard ($\sqrt{n}$)
Replicable ($n^{2/3}$)

At $n = $ 1,000: Standard needs ~31 samples. Replicable needs ~100 samples.

(Cost Factor: 3.2x)

๐Ÿง  The Insight: Randomized Thresholding

The core idea is to transform a standard tester into a replicable one using randomized thresholding.

Suppose you have a test statistic $Z$ (e.g., the $\chi^2$ distance).

  • Standard Test: Reject if $Z > T$.
  • Replicable Test: Reject if $Z > r$, where $r$ is chosen uniformly from $[0, R]$.

Why does this work?

The value of $Z$ varies slightly due to sampling noise. Let's say $Z$ fluctuates within a range of width $w$.

  • If we use a fixed threshold $T$, and $Z$ fluctuates around $T$, we get 50/50 results.
  • If we use a random threshold $r$, the only time we have a problem is if $r$ lands inside the tiny range where $Z$ fluctuates.

If the range of possible thresholds $R$ is much larger than the fluctuation width $w$, the probability of "bad luck" is small ($w/R$).

๐Ÿ”ข Concrete Example

Imagine $Z$ is usually around 100, but fluctuates by $\pm 1$ (so it's in $[99, 101]$).

  • Fixed Threshold: If threshold is 100, you flip-flop constantly.
  • Random Threshold: Pick $r$ from $[0, 200]$.

The only way to flip-flop is if $r$ lands exactly between 99 and 101. The chance of that is $2 / 200 = 1\%$. 99% of the time, $r$ is either safely below 99 (always Accept) or safely above 101 (always Reject).

๐Ÿค“ Math Deep Dive

The Replicability Lemma โ–ผ

Let $Z$ be a random variable (our test statistic) such that for any distribution $D$, $Z$ is concentrated in an interval of width $w$ with high probability.

Let $r \sim \text{Uniform}([0, R])$. Define the output $A = \mathbb{I}(Z > r)$.

The probability that two independent runs $Z_1, Z_2$ disagree is:

$$ P[A(Z_1) \neq A(Z_2)] \le P[r \in [\min(Z_1, Z_2), \max(Z_1, Z_2)]] $$

Since $|Z_1 - Z_2| \le w$ (w.h.p.), and the density of $r$ is $1/R$:

$$ P[\text{Disagreement}] \approx \frac{w}{R} $$

To make this small (say, $\le 0.01$), we need $R \ge 100w$.

The Cost: Increasing $R$ means we need a larger gap between the "Null" and "Alternative" hypotheses to distinguish them, which translates to requiring more samples.

Sample Complexity Results โ–ผ

The paper proves tight bounds for Replicable Uniformity Testing on a domain of size $n$:

Standard Testing

$\Theta(\sqrt{n})$

Replicable Testing

$\Theta(n^{2/3})$

This $n^{2/3}$ cost is the price of reproducibility. The paper proves this is optimal for replicable algorithms.

๐Ÿง Critical Analysis: Strengths, Weaknesses & Open Questions

โœ… What This Paper Does Well

  • Foundational Framework: It establishes the "rules of the game" for replicable distribution testing, defining the property and proving the first tight bounds.
  • General Technique: The "randomized thresholding" idea is simple, powerful, and likely applicable to many other problems beyond distribution testing.
  • Settles Open Questions: It resolves the sample complexity of replicable uniformity testing, matching the lower bound.

โš ๏ธ Legitimate Concerns

  • The Cost of Stability: The jump from $\sqrt{n}$ to $n^{2/3}$ is significant. This happens because to ensure the random threshold is safe, we need the "Null" and "Alternative" distributions to be farther apart (or have more concentrated statistics), which requires more samples.
  • The "Public Coin" Assumption: The protocol assumes all researchers share a random seed (a "public coin"). If Researcher A uses seed 123 and Researcher B uses seed 456, their results are not guaranteed to match. Coordinating this shared randomness infrastructure is a major practical hurdle.

๐ŸŽฏ Bottom Line

This paper is a theoretical milestone that exposes the "price of reproducibility." While the cost is higher sample complexity, the benefit is algorithms that don't gaslight researchers. As AI and science become more automated, these stability guarantees will become non-negotiable.