标签:
Author: WeiMin, Jason Wang
Online controlled A/B testing is a common practice for companies likeMicrosoft, Amazon, Google and Yahoo to evaluate the effectiveness of featuresimprovement. This business strategy is also widely used in eBay Searchscience, Merchandizing, Shipping and other domains to infer the causalrelationship between algorithm changesand financial gain. As the name implies, two equal size groups of user, onegroup is assigned to version A, usually the existing algorithm (called controlgroup), and the other is exposed to version B, the new algorithm (called treatmentgroup), while other variables are identical. Feature launch decision is made ifthe new algorithm significantly increases mean of Gross Mechanize Bought (GMB).
However, in ebay market, there are a small number of users who shophigh-end products, and a very large numbers of users who purchase low priceproducts, even make no purchase at all. This long – tailed distribution of GMB increasesthe magnitude of noise when detecting the treatment effect of A/B test. Toimprove the test sensitivity, variance reduction techniques such as capped mean and Toso-Tailhad been applied on mean of GMB estimation previously. This paper introducesanother variance reduction technique called post-stratification to furtherimprove test sensitivity.
Post-stratification inspires from stratification in sampling theory.Stratified sampling outperforms simple random sampling when units from the samestrata are similar to each other regarding with the interest of measurement. As users arrive over time in live trafficexperimentation, though it is impossible to sample user from a pre-formedstrata, sensitivity of hypothesis testing still benefits from stratificationafter data collection. This is called post-stratification adjustment. To adjust mean of GMB during experimentationperiod, users’ pre-experimentation period GMB are collected to bucket users.The underlying assumption is that users’ purchase behavior is predictable giventheir historical behavior, says, a frequent high-end purchaser beforeexperiment period is also likely to be a heavy buyer during experiment period.In implementation, to further improve the magnitude of variance reduction, GMBlift is decomposed to the combination GMB per purchaser lift and percentage ofpurchaser out of users lift, so that purchasers can be modeled separately. Thereason is obvious, it is difficult to track non-purchasers if they do not signin.
The Post-Stratified metrics GMB were rolled out on EP, a central placeof experimentation platform on Nov 2014. Since then, more experiments go from insignificant to significantand more new algorithms are launch-able. In sum, the post-stratificationadjusted metrics is a valuable improvement, which saves experimentationresources, speeds up testing pace and supports launching more profitable newfeatures on eBay.
In this section, we will theoretically show that the variance isreduced using post-stratification adjustment. Let’s denote Y as the target, GMBin our case. is the sample size . X is the auxiliaryvariable that is known. t and c represent treatment group and control grouprespectively. And represents treatment effect. T test based onsample mean is used to test the significant of treatment effect before the rollout of post-stratified metric. We call this asSample Average Treatment Effect (SATE),
· Post Stratification Adjustment
From above formula, we can see that thevariance of sample mean is split into within-strata variation andbetween-strata variation, and the between-strata variance term is removed bystratification. The more homogenous of target within groups and the moreheterogeneous between groups, the better variance reduction can be achievedusing stratification.
· Regression Adjustment
is the correlation between X and Y for each treatment/control group. The higher the correlation between X and Y, the better the variance reduction could be achieved. One thing needs to point out, thoughis unbiased,is biased by , which is in order of .When sample size is large,is unbiased empirically. In Deng’s[1] paper, is estimated by pooled treatment and control group together, which is also the coefficient of X when fitting regression of Yon X using all units in the two groups, says, .
From Lin’s simulation study[3], the result by using pooled θ and using different θ for treatment and control do not differ significantly, but when treatment and control have unbalanced sample size,using different θ is more accurate.
Covariate Selection
To apply post stratification effectively, the variables selected forforming strata is critical. According to both Deng’s paper[1] and Miratrix’ paper[2],the higher the correlation between the covariates and the variable of interesting, the greater the variance reduced. And the covariates should be independentof treatment to avoid bias. For example, users’ in-experiment purchase ishighly correlated with in-experiment clicks, but in-experiment clicks cannot beused to group users, which is because in-experiment clicks also been impactedby treatment effects. As a matter of fact, covariates that correlated with GMBbut independent with treatment allocation varies, such as geographic anddemographic information, user segmentation and preference derived fromhistorical purchase so on. Based on our experimentations, users’ in-experiment(in-expt) purchase is mostly correlated with pre-experiment (pre-expt)purchase, which is inline with Deng’s conclusion from Microsoft A/B testing trials. Of course, strata can be formed by more than one covariate. Multiple covariates or covariate combination works better than simple covariate.
2-lift model
Based on an empirical study by David. G[4], GMB lift can be decomposed into a sum of two terms: participation rate lift (fraction of GUIDs who make a purchase) - and GMB per purchaser . Effective stratification requires modeling, and we can do that more effectively by developing a separate model for each term, thus improving variance reduction.
Post Stratification Implementation
Stratification works best in combination with some control on outliers. In other words, when psot-stratificationadjustment is applied on outliers processed data, the variance reduction is much larger than applying on raw data directly. A simple solution is to capoutliers at 99.9% percentile of GMB from purchasers. Then and are modeled separately.
:A binary indicator of purchase or not is usually highly correlated with how frequent the user was active on ebay besides of purchased amount before experimentation period. Therefore, pre-expt active days and pre-expt GMB combined together to form the strata for treatment effect of fraction of purchaser post-stratification adjustment.Purchase amount isusually correlated with users’ historical purchased amount. For purchasers,pre-expt GMB is used to adjust in-expt GMB by regression. To be specific,overall treatment effect for Average GMB per Participant is estimated as ,f is the fitted regression model.
Impact on Decision Making
Post-Stratified metrics were rolled out on EP, a central place ofexperimentation platform on Nov 2014. Till March 2015, the standard error ofGMB lift estimation is reduced by ~5% for 96% out of total 228 experiments. As the increasing of test sensitivity, GMB lifts of 15more experiments go from insignificant to significant,which implies 15 more new algorithms are launch-able. In sum, the post-stratification adjusted metrics is a valuable improvement, which saves experimentation resources, speed-up testing pace and supports launching more profitable new features on eBay.
Since the launch ofpost-stratification adjusted GMB, we found that there is lift delta between unadjusted GMB lift (SATE) and adjusted GMB lift for a few experimentations.For example, in Figure 2, the distribution of GMB lift estimators using SATE (reddensity) and Post-stratification (green density) are compared. The variance of adjustedGMB lift is smaller than unadjusted GMB lift (the green density is thinner andhigher than red density), which is as expected. However, we do see lift deltaexist in experiment 4254 and 4406 (red dash line is not overlapped with greendash line). What might be the reason?
Figure 1: Variance Reduction –densitycomparison.
Post-stratification adjustment is valid when covariate X is independent with treatment effect. The reason that pre-experimentation features are selected is that treatment effect has not been introduced to pre experiment period yet and the expect value of X are the same in treatment and control group, says,,then the adjusted metrics will be biased.
In Figure 3, distributions of the bootstrappedmean of pre-expt GMB for test and control group are compared. Red and Green Dashlines are estimations of the expected value of pre-expt GMB for treatment andcontrol group respectively. It is clear to see that, the 2 tests in experiment4303 without lift delta, the expectations or pre-expt GMB are equal for testand control. But experiment 4254 and 4406 with lift delta, the bootstrapped means are different in treatment and control groups.
Asmentioned above, when using regression to adjust treatment effect,,
could be extended to a more sophisticatedmodel, such as multiple variable linear regression, random forest, gradientboosted trees and so on. Our expectation is that just as using a singlecovariate in a linear model is better than using it to construct strata,similarly using multiple covariates will perform better than a singlecovariate, and a non-linear model will perform even better. So a more sophisticated model to predictin-expt GMB using pre-expt features will be a potential improvement of variance reduction using post-stratification adjustment.
[1]Alex Deng, Ya Xu and Ron Kohavi. Improvingthe Sensitivity of Online Controlled Experiments by Utilizing Pre-ExperimentData. To appear in WSDM 2013.
[2]Luke W. Miratrix, Jasjeet S. Sekhon andBin Yu. Adjust Treatment Effect Estimates by Post-Stratification in RandomizedExperiments.Journal of the Royal Statistical Society, Series B. August 10,2012.
[3]Winston Lin. Agnostic Notes on RegressionAdjustments to Experiment Data: Reexamining Freedman’s Critique. July 26,2012.
[4]David Goldberg. A two-modelTransfer Function. eBay InternalTechnical Report.
[5]David Goldberg. Stratificationand the Deng et al. paper. eBay InternalTechnical Report.
A/B Test Sensitivity Improvement by Using Post-Stratification
标签:
原文地址:http://blog.csdn.net/ebay/article/details/46548665