When You have 750 Million Events A Day, You’d Better Split Test Correctly

This week Unbounce hosted the event “Split Testing: Everything You Wanted to Know but Were Afraid to Ask (Probably)” at their new office space in Vancouver.

Over 100 people attended the event to hear Thomas Levi, Senior Data Scientist at the freshly acquirec PlentyOfFish, speak on split test methodologies. Levi has a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania. His post-doctoral studies were in cosmology and string theory, where he wrote 19 papers, earning hundreds of citations.

When split testing, it is incredibly easy to fool oneself, due to assumption and poorly designed tests, into thinking results are legitimate when they are in fact not. For instance, a 3% conversion increase in one scenario may seem positive, but if the team running the test cannot be certain that doing nothing at all over the same period would not have led to a 5% increase, they cannot really be confident about their results.

What appears to be a 3% gain may in fact be a 2% loss. If possibilities like this are not accounted for over hundreds or thousands of tests, the result for an online business in a highly competitive space could be disastrous.  

This talk examined ways to avoid these problems by providing a general overview of four testing methodologies: null hypothesis testing, sequential probability ratio testing, multi-armed bandits, and Bayesian sequential test design. 

All four methodologies have different strengths and weaknesses that have to be weighed when determining what kind of test to employ.  One test may be simple to perform but the tester must wait for a test to fully complete its duration before implementing positive changes, resulting in an opportunity cost.  A different test may dynamically apportion traffic as the test is being performed allowing the company to reap gains sooner, but can be sensitive to short-term seasonality.

“There are no hard and fast rules here. It’s all about trying to bound your uncertainty as much as possible, and get the generalization once you actually make the full change to be as true as possible,” said Levi.

At POF, where the site has approximately 750 million events each day, being disciplined, objective, and cognizant of the strengths and shortcomings of each test methodology when choosing them is critical in order to get at what is most important:  the truth about the users’ behavior. Deviation from this can lead to costly losses of time and revenue at enormous scale.

“I’m very skeptical of anyone that says, ‘This is the way you should split test,’ or ‘This is the best method’, because it very heavily depends on your case scenario,” said Levi.  “There’s no free lunch here.  Everything has pros and cons, and for whatever the case is, you have to pick a pro and con.”

Levi will follow this introductory talk with in-depth presentations on each testing method.  For details, you can follow Levi on Twitter at @tslevi, follow Unbounce Community Development Manager Cheryl Draper at @cheryldraper, or check upcoming Data Science meetups.