Skip to main content

How ProofPod Works

You tell us what you want to test. We tell you how to run it and whether it worked.

ProofPod handles the hard parts—which locations to test, how long to run, and whether the results are real or just noise. Here's what happens behind the scenes.


1. We design your experiment

Running a test sounds simple. Pick some stores, try something new, see what happens. But which stores? How many? How long do you need to run before the results mean anything?

We analyze your historical data and tell you exactly how to run your test—which locations, which controls, and how long you need for a confident read.

The result: You start with a plan that will actually yield a clear answer, not a guess you'll second-guess later.

Under the hood

We use techniques from causal inference—the same methods economists use to measure policy effects and pharmaceutical companies use to evaluate treatments. Our matching algorithm accounts for seasonality, location-level variation, and historical volatility to create balanced test and control groups with enough statistical power to detect meaningful effects.


2. We solve the problem A/B testing cannot

On a website, you can show two visitors different experiences at the same moment. In a store, you cannot. Every customer in that location gets the same price, the same promo, the same experience.

That means true randomization isn't possible. Instead, we compare locations to each other—test locations running your experiment versus control locations running business as usual. This requires different methods than traditional A/B testing, and we've built ProofPod specifically for this challenge.

The result: You get rigorous answers even though you cannot run a textbook randomized experiment.


3. We isolate the true effect

Your test locations went up 12%. Great—but was that your experiment, or was it the weather, a holiday, or a competitor closing nearby?

We measure how your test locations changed relative to your control locations over the same period. This isolates what your experiment actually caused.

The result: You're not fooled by external factors. If the baseline went up 10% and your test group went up 12%, your experiment drove a 2% lift—not 12%.

Under the hood

We use methods developed by economists to measure cause and effect in the real world—where you cannot run perfect laboratory experiments. Our approach controls for day-of-week effects, location-level differences, and correlated observations over time.


4. We answer the question you're actually asking

Traditional statistics tell you whether you'd see data this extreme by chance. That's not what you need to know. You need to know whether the change you made is actually causing the results you're seeing.

We calculate the probability your experiment helped, the probability it hurt, and whether you need more data. That's a direct answer to the decision in front of you.

The result: A clear recommendation—Scale, Kill, or Continue—with the confidence level behind it.

Under the hood

We use Bayesian inference to convert effect estimates into decision-relevant probabilities. Instead of asking "is this statistically significant?" we ask "what's the probability this exceeded your target lift?" That's the question that matters for your decision.


5. We account for real-world noise

What about weather, or a manager quitting, or a local event?

These things happen. And they're already in your data.

Your historical sales include countless examples of bad weather weeks, manager transitions, local disruptions, and random variance. We analyze this history to understand how noisy your data naturally is—and we design your test with that noise in mind.

This means we set the right test duration, select the right number of locations, and build in enough cushion that one-off events don't throw off your results. If something truly unusual happens—an outlier we can detect—we'll flag it.

The result: You don't need perfect conditions to get a trustworthy answer. You just need a well-designed test.


6. You get a clear recommendation

Scale — Strong evidence it works. Roll it out.

Kill — Strong evidence it hurt. Stop it.

Continue — Not enough data yet. Keep running.

Each recommendation includes the estimated effect size in dollars, the probability you hit your target, and any data quality issues to watch.


Why this approach?

Traditional A/B testing was built for websites with millions of visitors. You have 10–300 locations, and you cannot randomize at the individual level. That requires different methods—ones developed by economists and data scientists for exactly this problem.

We also use Bayesian statistics rather than the frequentist approach you'll find in most testing tools. The practical difference: we can tell you to stop early. If your experiment is clearly winning after two weeks, you don't have to wait six weeks to "reach significance." If it's clearly losing, you can kill it and stop the bleeding. Either way, that's money back in your pocket.

ProofPod brings this rigor to operators who don't have a data science team. You get confident answers without the complexity.