Rare Event Data
6. July 2007 10:35Solution :
More efficient sampling designs exist for making valid inference:
For example: sampling all available events and a tiny fraction of nonevents
Enable to save as much as 99% of data collection costs or / and be able to collect much more meaningful (expensive) feature variables
Sampling :
Examples (x, y, s)
This kind of sampling is also called oversampling, retrospective sampling, biased sampling, or choice-based sampling.
The oversampling method has been widely used in signal detection theory and it consists of resampling the small class at random until it contains as many examples as the other class.
The downsizing (undersampling) method consists of the randomly removed samples from the majority class population until the minority class becomes some specific percentage of the majority class.
This produced two different datasets for each time step: one with a churner/nonchurner ratio 1/1 and the other with a ratio 2/3.
In the biological sciences, studies using this kind of sampling are known as case-control studies.Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of stratified sampling . However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect, such as the predicted event probabilities, differences or ratios of event probabilities . you know the probabilities of events and nonevents in the population, then you can adjust the intercept either by weighting or by using an offset.
Adjusting the Intercept
To adjust by weighting, add a variable to your data set that takes the value p1/r1 in event observations, and the value (1-p1)/(1-r1) in nonevent observations, where p1 is the probability of an event in the population and r1 is the proportion of events in your data set. Specify this variable in the WEIGHT statement in PROC LOGISTIC. Or, to adjust by using an offset, add a variable to your data set defined as log[(r1*(1-p1)) / ((1-r1)*p1)], where log represents the natural logarithm. Specify this variable in the OFFSET= option of the MODEL statement in PROC LOGISTIC.
Example:
data full;
do i=1 to 1000;
x=rannor(12342);
p=1/(1+exp(-(-3.35+2*x)));
y=ranbin(98435,1,p);
drop i;
output;
end;
run;
data sub;
set full;
if y=1 or (y=0 and ranuni(75302)<1/9) then output;
run;
proc freq data=full;
table y / out=fullpct(where=(y=1) rename=(percent=fullpct));
title "response counts in full data set";
run;
proc freq data=sub;
table y / out=subpct(where=(y=1) rename=(percent=subpct));
title "Response counts in oversampled, subset data set";
run;
data sub;
set sub;
if _n_=1 then set fullpct(keep=fullpct);
if _n_=1 then set subpct(keep=subpct);
p1=fullpct/100; r1=subpct/100;
w=p1/r1; if y=0 then w=(1-p1)/(1-r1);
off=log( (r1*(1-p1)) / ((1-r1)*p1) );
run;
ods select parameterestimates(persist);
proc logistic data=sub;
model y(event="1")=x;
output out=out p=pnowt;
title "True Parameters: -3.35 (intercept), 2 (X)";
title2 "Unadjusted Model";
run;
proc logistic data=out;
model y(event="1")=x; weight w;
output out=out p=pwt;
title2 "Weight-adjusted Model";
run;
proc logistic data=out;
model y(event="1")=x / offset=off;
output out=out xbeta=xboff;
title2 "Offset-adjusted Model";
run;
data out;
set out;
poff=logistic(xboff-off);
run;
proc freq data=full noprint;
table y / out=priors(drop=percent rename=(count=_prior_));
run;
proc logistic data=out;
model y(event="1")=x;
score data=sub prior=priors out=out2;
title2 "Unadjusted Model; Prior-adjusted probabilities";
run;

Email 