Rare Event Data

by Irina 6. July 2007 10:35
In literatures, proven to be difficult to predict two problems:
  • Popular statistical procedure, such as logistic regression, sharply underestimate the probability of rare events
  • Commonly used data collection inefficient for rare event data

    Solution :

    More efficient sampling designs exist for making valid inference:
    For example: sampling all available events and a tiny fraction of nonevents
    Enable to save as much as 99% of data collection costs or / and be able to collect much more meaningful (expensive) feature variables

    Sampling :

    Examples (x, y, s)

  • S controls the selection of examples ( 1 means selected, 0 means not selected )
  • We have only access to S=1 examples
  • s is independent of x given y
  • P(s|x,y)=P(s|y)
  • Selected examples are biased
  • The biasness only depends only on label y
  • Corresponding to change in the prior probabilities of labels

  • This kind of sampling is also called oversampling, retrospective sampling, biased sampling, or choice-based sampling.
    The oversampling method has been widely used in signal detection theory and it consists of resampling the small class at random until it contains as many examples as the other class.
    The downsizing (undersampling) method consists of the randomly removed samples from the majority class population until the minority class becomes some specific percentage of the majority class.
    This produced two different datasets for each time step: one with a churner/nonchurner ratio 1/1 and the other with a ratio 2/3.

    In the biological sciences, studies using this kind of sampling are known as case-control studies.Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of stratified sampling . However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect, such as the predicted event probabilities, differences or ratios of event probabilities . you know the probabilities of events and nonevents in the population, then you can adjust the intercept either by weighting or by using an offset.

    Adjusting the Intercept
    To adjust by weighting, add a variable to your data set that takes the value p1/r1 in event observations, and the value (1-p1)/(1-r1) in nonevent observations, where p1 is the probability of an event in the population and r1 is the proportion of events in your data set. Specify this variable in the WEIGHT statement in PROC LOGISTIC. Or, to adjust by using an offset, add a variable to your data set defined as log[(r1*(1-p1)) / ((1-r1)*p1)], where log represents the natural logarithm. Specify this variable in the OFFSET= option of the MODEL statement in PROC LOGISTIC.

    Example:

            data full;
            do i=1 to 1000;
              x=rannor(12342);
              p=1/(1+exp(-(-3.35+2*x)));
              y=ranbin(98435,1,p);
              drop i;
              output;
            end;
            run;
    
          data sub;
            set full;
            if y=1 or (y=0 and ranuni(75302)<1/9) then output;
            run;
    
          proc freq data=full;
            table y / out=fullpct(where=(y=1) rename=(percent=fullpct));
            title "response counts in full data set";
            run;
          proc freq data=sub;
            table y / out=subpct(where=(y=1) rename=(percent=subpct));
            title "Response counts in oversampled, subset data set";
            run;
          data sub;
            set sub;
            if _n_=1 then set fullpct(keep=fullpct);
            if _n_=1 then set subpct(keep=subpct);
            p1=fullpct/100; r1=subpct/100;
            w=p1/r1; if y=0 then w=(1-p1)/(1-r1);
            off=log( (r1*(1-p1)) / ((1-r1)*p1) );
            run;
    
          ods select parameterestimates(persist);
          proc logistic data=sub;
            model y(event="1")=x;
            output out=out p=pnowt;
            title "True Parameters: -3.35 (intercept), 2 (X)";
            title2 "Unadjusted Model";
            run;
          proc logistic data=out;
            model y(event="1")=x; weight w;
            output out=out p=pwt;
            title2 "Weight-adjusted Model";
            run;
         proc logistic data=out;
            model y(event="1")=x / offset=off;
            output out=out xbeta=xboff;
            title2 "Offset-adjusted Model";
            run;
          data out;
            set out;
            poff=logistic(xboff-off);
            run;
          proc freq data=full noprint;
            table y / out=priors(drop=percent rename=(count=_prior_));
            run;
          proc logistic data=out;
            model y(event="1")=x;
            score data=sub prior=priors out=out2;
            title2 "Unadjusted Model; Prior-adjusted probabilities";
            run;
    
    

    About the author

    Irina Spivak Irina Spivak
    Team Leader at G-Stat. More...


    Send mail Email

    Authors

    Blogroll

      Disclaimer

      The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

      © Copyright 2010

      Sign in

      eXTReMe Tracker