Levels of Measurement.

by Irina 17. February 2009 08:33
 Nominal Data

With nominal data, as the name implies, the numbers function as a name or label and do not have numeric meaning. For instance, you might create a variable for gender, which takes the value 1 if the person is male and 0 if the person is female.

There are two main reasons to choose numeric rather than text values to code nominal data: data is more easily processed by some computer systems as numbers, and using numbers bypasses some issues in data entry such as the conflict between upper- and lowercase letters.

  Ordinal Data

 Ordinal data refers to data that has some meaningful order, so that higher values represent more of some characteristic than lower values. For instance, in medical practice burns are commonly described by their degree, which describes the amount of tissue damage caused by the burn. A first-degree burn is characterized by redness of the skin, minor pain, and damage to the epidermis only, while a second-degree burn includes blistering and involves the dermis, and a third-degree burn is characterized by charring of the skin and possibly destroyed nerve endings. These categories may be ranked in a logical order: first-degree burns are the least serious in terms of tissue damage, third-degree burns the most serious.

However, there is no metric analogous to a ruler or scale to quantify how great the distance between categories is, nor is it possible to determine if the difference between first- and second-degree burns is the same as the difference between second- and third-degree burn.

  Interval Data
    
Interval data has a meaningful order and also has the quality that equal intervals between measurements represent equal changes in the quantity of whatever is being measured. Example of it – is the Fahrenheit scale, like all interval scales, has no natural zero point, because 0 on the Fahrenheit scale does not represent an absence of temperature but simply a location relative to other temperatures.

Multiplication and division are not appropriate with interval data.

  Ratio Data

 
Ratio data has all the qualities of interval data (natural order, equal intervals) plus a natural zero point. Many  physical measurements are ratio data: for instance,height, weight, and age all qualify.

Continuous and Discrete Data   

Another distinction often made is that between continuous and discrete data.

Continuous data can take any value, or any value within a range. Most data measured by interval and ratio scales, other than that based on counting, is continuous: for instance, weight, height, distance, and income are all continuous.Discrete data can only take on particular values, and has clear boundaries .As the old joke goes, you can have 2 children or 3 children, but not 2.37 children, so “number of children” is a discrete variable.

Nominal data is also discrete, as are binary and rank-ordered data.


 OReilly .Statistics in a Nutshell

Building Models for Marketing Decisions

by Irina 29. October 2007 13:34
Leeflang, Wittink, Wedel and Naert (2000) classify models according to their primary purpose or intended use.
They distinguish:
  • Descriptive models. These models intend to describe decision processes of managers or customers.
  • Predictive models.   These models forecast or predict future events or outcomes.
  • Normative models.   These models are used to obtain recommended or optimal courses of action.


Descriptive models are not restricted for decision problems. For example one can describe the market by the structure of brand loyalty.
By predictive models we mean models for forecasting or prediction future events. For example, a firm may want to predict sales for brand, under alternative prices, advertising spending levels and package sizes.

Indeed, one can argue that for a model to have valid normative implications, it must have predictive value and at least some descriptive power. However, a descriptive model need not have normative implications and a predictive model may not be useful for normative considerations. They also point out that it is often logical to proceed from a descriptive to a predictive and then to a normative model. In other situations, a descriptive model may be sufficient. Forecasting or prediction does not always mean answering “what if” type of questions, such as, how the demand changes if price is increased by 10 percent. In some brand choice models, the structure of brand loyalty and switching is summarized in a transition probability matrix.

Normative or Prescriptive models –has one of its outputs as recommended course of action. For example, the objective function in a media allocation model may be the optimization of profit.

Demand models make up a special class of predictive models. We refer to the demand model when we have a performance variable related to the level of demand. Many demand models belong to the subset of predictive models. In a demand model, the performance variable is a measure of demand. This performance variable may depend on a number of other variables, such as marketing decision variables employed by the firm and its competitors. We distinguish individual demand models and aggregate demand models.

Aggregate demand may refer to:

1. The total number of units of a product category purchased by the population of all spending units. The corresponding demand model is called a model of industry sales, or a model of product class.
2. The total number of units of a particular brand purchased by the population of all spending units. The demand model is then a brand sales model.
3. The number of units of a particular brand purchased by the total population relative to the total number of units purchased of the product class, in which case the demand models is a market share model.

Gatignon and Robertson (1986) identify three types of models, which differ in Their objectives and implications:


Theoretical models. These models offer a mathematical description of a process in which some constructs are systematically joined to others. The objective is to generate theoretical propositions that appropriately describe the possible influence of variables on the diffusion pattern and diffusion rate. These descriptions are the raison d’être of theoretical models and should Provide suggestions to managers.

Normative models. These models also start with a description and assume functional relationships among the variables that affect the diffusion process. The behavioral assumptions may be less complex than those of theoretical models, given that the objective is not to make descriptive propositions but to develop optimal marketing strategies. An objective function for the firm is determined and the model implications are expressed with respect to variables incorporated into the model .

Empirical models. The objective of these models is to fit data and test a Specific theoretical proposition or a complete model. Marketing has focused more on empirical and normative than on theoretical Models.

Using the ROC Curve to Measure Sensitivity & Specificity

by Irina 16. October 2007 11:58


Two indices are used to evaluate the accuracy of a test that predicts dichotomous outcomes (e.g. logistic regression) – sensitivity and specificity. They describe how well a test discriminates between cases with and without a certain condition.

Sensitivity - the proportion of true positives or the proportion of cases correctly identified by the test as meeting a certain condition (e.g. in mammography testing, the proportion of patients with cancer who test positive).

Specificity - the proportion of true negatives or the proportion of cases correctly identified by the test as not meeting a certain condition (e.g. in mammography testing, the proportion of patients without cancer who test negative).

The lift -is a measure of a predictive model calculated as the ratio between the results obtained with and without the predictive model.


Choosing a Cut-off

The position of the cut-off determines the number of true positives, true negatives, false positives, and false negatives. As you increase your sensitivity (true positives) and can identify more cases with a certain condition, you also sacrifice accuracy on identifying those without the condition (specificity). This value (C) can be estimated by maximizing the index J

J=MAX(Sensitivity(C) + Specificity(C))

Receiver Operating Characateristic (ROC) Curve

A Receiver Operating Characteristic (ROC) curve is a graphical representation of the trade off between the false negative and false positive rates for every possible cut off. By tradition, the plot shows the false positive rate (1-specificity) on the X axis and the true positive rate (sensitivity or 1 - the false negative rate) on the Y axis.1 The accuracy of a test (i.e. the ability of the test to correctly classify cases with a certain condition and cases without the condition) is measured by the area under the ROC curve. An area of 1 represents a perfect test, while an area of .5 represents a worthless test. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test; the true positive rate is high and the false positive rate is low. Statistically, more area under the curve means that it is identifying more true positives while minimizing the number/percent of false positives

  ods select parameterestimates association;
    proc logistic data=data1;
       model disease/n=age / outroc=roc1 roceps=0;
       output out=outp p=phat;
       ods output association=assoc;
       run;
        data _null_;
        set assoc;
        if label2='c' then call symput("area",cvalue2);
        title "area=&area";

        proc gplot data=roc1; 
        plot _sensit_*_1mspec_; 

        run; 
        quit; 
       run;

It is important to use the ROCEPS=0 option in the MODEL statement of PROC LOGISTIC when you fit your model because this option allows all the unique predicted values to be output to the OUTROC= data set. Otherwise, the values may be rounded yielding fewer points on the ROC plot.

Rare Event Data

by Irina 6. July 2007 10:35
In literatures, proven to be difficult to predict two problems:
  • Popular statistical procedure, such as logistic regression, sharply underestimate the probability of rare events
  • Commonly used data collection inefficient for rare event data

    Solution :

    More efficient sampling designs exist for making valid inference:
    For example: sampling all available events and a tiny fraction of nonevents
    Enable to save as much as 99% of data collection costs or / and be able to collect much more meaningful (expensive) feature variables

    Sampling :

    Examples (x, y, s)

  • S controls the selection of examples ( 1 means selected, 0 means not selected )
  • We have only access to S=1 examples
  • s is independent of x given y
  • P(s|x,y)=P(s|y)
  • Selected examples are biased
  • The biasness only depends only on label y
  • Corresponding to change in the prior probabilities of labels

  • This kind of sampling is also called oversampling, retrospective sampling, biased sampling, or choice-based sampling.
    The oversampling method has been widely used in signal detection theory and it consists of resampling the small class at random until it contains as many examples as the other class.
    The downsizing (undersampling) method consists of the randomly removed samples from the majority class population until the minority class becomes some specific percentage of the majority class.
    This produced two different datasets for each time step: one with a churner/nonchurner ratio 1/1 and the other with a ratio 2/3.

    In the biological sciences, studies using this kind of sampling are known as case-control studies.Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of stratified sampling . However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect, such as the predicted event probabilities, differences or ratios of event probabilities . you know the probabilities of events and nonevents in the population, then you can adjust the intercept either by weighting or by using an offset.

    Adjusting the Intercept
    To adjust by weighting, add a variable to your data set that takes the value p1/r1 in event observations, and the value (1-p1)/(1-r1) in nonevent observations, where p1 is the probability of an event in the population and r1 is the proportion of events in your data set. Specify this variable in the WEIGHT statement in PROC LOGISTIC. Or, to adjust by using an offset, add a variable to your data set defined as log[(r1*(1-p1)) / ((1-r1)*p1)], where log represents the natural logarithm. Specify this variable in the OFFSET= option of the MODEL statement in PROC LOGISTIC.

    Example:

            data full;
            do i=1 to 1000;
              x=rannor(12342);
              p=1/(1+exp(-(-3.35+2*x)));
              y=ranbin(98435,1,p);
              drop i;
              output;
            end;
            run;
    
          data sub;
            set full;
            if y=1 or (y=0 and ranuni(75302)<1/9) then output;
            run;
    
          proc freq data=full;
            table y / out=fullpct(where=(y=1) rename=(percent=fullpct));
            title "response counts in full data set";
            run;
          proc freq data=sub;
            table y / out=subpct(where=(y=1) rename=(percent=subpct));
            title "Response counts in oversampled, subset data set";
            run;
          data sub;
            set sub;
            if _n_=1 then set fullpct(keep=fullpct);
            if _n_=1 then set subpct(keep=subpct);
            p1=fullpct/100; r1=subpct/100;
            w=p1/r1; if y=0 then w=(1-p1)/(1-r1);
            off=log( (r1*(1-p1)) / ((1-r1)*p1) );
            run;
    
          ods select parameterestimates(persist);
          proc logistic data=sub;
            model y(event="1")=x;
            output out=out p=pnowt;
            title "True Parameters: -3.35 (intercept), 2 (X)";
            title2 "Unadjusted Model";
            run;
          proc logistic data=out;
            model y(event="1")=x; weight w;
            output out=out p=pwt;
            title2 "Weight-adjusted Model";
            run;
         proc logistic data=out;
            model y(event="1")=x / offset=off;
            output out=out xbeta=xboff;
            title2 "Offset-adjusted Model";
            run;
          data out;
            set out;
            poff=logistic(xboff-off);
            run;
          proc freq data=full noprint;
            table y / out=priors(drop=percent rename=(count=_prior_));
            run;
          proc logistic data=out;
            model y(event="1")=x;
            score data=sub prior=priors out=out2;
            title2 "Unadjusted Model; Prior-adjusted probabilities";
            run;
    
    

    Trend estimation

    by Irina 25. May 2007 06:17

    Trend estimation :

    Trend in a time series is a slow, gradual change in some property of the series over the whole interval under investigation. Trend is sometimes loosely defined as a long term change in the mean, but can also refer to change in other statistical properties. For example, tree-ring series of measured ring width frequently have a trend in variance as well as mean.Identification of trend in a time series is subjective because trend in a sample cannot be unequivocally distinguished from low frequency fluctuations.
    Curve-fitting.
    If a time series changes in level gradually over time, it makes sense to consider as trend some simple function of time itself.
    The simplest and most widely used function of time used in detrending is the least-squares-fit straight line, which treats linear trend. Simple linear regression is used to fit the model:

    x = a + bt + et

    where xt  is the original time series at time t , a is the regression constant, b is the regression  coefficient, and are the regression residuals.The advantage of the straight-line method is simplicity. The straight line may unrealistic, however, in restricting the functional form of the trend.

    Trend estimation in Teradata:

    SELECT CAST(REGR_SLOPE(deposit , period ) AS DECIMAL(8,4)) as beta,
    sqrt(REGR_SXX( deposit , period))  as sxx,
    sqrt(REGR_Syy( deposit , period )) as syy,
    CAST(REGR_R2(deposit , period)  AS DECIMAL(8,4)) as r,
    sqrt(1-r)*syy  as s_e,
    cast(s_e/(sqrt(14)*sxx) AS DECIMAL(8,4)) as s_ee,
    beta/s_ee as t,
    case when abs(t)>1.96 then 1 else 0 end as significant,
    case when beta>0  and significant=1 then 1 
         when beta<0  and significant=1 then -1
         else 0 end as trend

    Trend estimation in SAS:


    data leadprd;
          input date:monyy5. leadprod customer @@;
          format date monyy5.;
          title 'Lead Production Data';
          title2 '(in tons)';
          datalines;
       jan90 38500 1 feb90 37900  1 mar90 36900  1  apr90 38600  1 
       may90 36400 1 jun90 33300  1 jul90 34000  1  aug90 38000  1 
       sep90 37400 1 oct90 42300  1 nov90 36900  1  dec90 34800  1 
       jan91 33900 1 feb91 34000  1 mar91 37200  1  apr91 33300  1 
       may91 29800 1 jun91 24700  1 jul91 30800  1  aug91 31100  1 
       sep91 32400 1 oct91 32900  1 nov91 29100  1  dec91 31800  1 
       jan92 32100 1 feb92 30500  1 mar92 36800  1  apr92 30300  1 
       may92 29500 1 jun92 24700  1 jul92 27600  1  aug92 23800  1 
       sep92 21400 1 feb90 37900  2 mar90 36900  2  apr90 38600  2
       may90 36400 2 jun90 33300  2 jul90 34000  2   aug90 38000 2 
       sep90 37400 2 oct90 42300  2 nov90 36900  2   dec90 34800 2 
       jan91 33900 2 feb91 34000  2 mar91 37200  2   apr91 68800 2 
       may91 75000 2 jun91 85000  2 jul91 10555  2   aug91 11520 2 
       sep91 32400 2 oct91 22500  2 nov91 29100  2   dec91 31800 2 
       jan92 32100 2 feb92 23556  2 mar92 33505  2   apr92 43005 2 
       may92 66500 2 jun92 77550  2 jul92 88800  2   aug92 99990 2 
       ;
       run;

    Next produce your forecasts and save their predicted values to SAS data sets. This example uses the forecasting capabilities of the FORECAST, the ARIMA, and the REG procedures. The OUT1STEP option of PROC FORECAST specifies that only the one-step-ahead forecasts are output to the data set LEADOUT1. The LEAD= option produces forecasts for 12 months beyond the sample period.
     proc forecast data=leadprd out=leadout1 out1step
          lead=12 interval=month;
          id date;
          var leadprod;
    	  by customer;
       run;
    
        proc arima data=leadprd;
          i var=leadprod nlag=15;
          e p=1;
          f lead=12 interval=month id=date out=leadout2;
       by customer;
       run;
       quit;

    To estimate a time trend for the lead prediction data, it is necessary to create a new variable T that spans both the sample and forecast periods.

    data ttrend;
          set leadout2;
          t+1;
       run;
    proc reg data=ttrend;
    model leadprod = t;
    output out=leadout3 p=ptrend;
    ods output ParameterEstimates = estim;
    ods output  FitStatistics=k;
    ods output anova=n;
    by customer;
    run;
    quit;
    proc sql;
    create table estim_trend as
    select a.*,
    case when Probt<0.05 then 1 else 0 end as significant,
    case when Estimate>0  and calculated significant=1 then 1 
         when Estimate<0  and calculated significant=1 then -1
         else 0 end as trend
     from estim a
     where variable ne 'Intercept';
     quit;
    
     data final;
          merge leadout1(keep=date leadprod customer
                       rename=(leadprod=pfore)) 
            leadout2(keep=date leadprod forecast customer
                       rename=(leadprod=actual forecast=parima)) 
            leadout3(keep=date ptrend customer);
    		by customer date;
       run;
        

    About the author

    Irina Spivak Irina Spivak
    Team Leader at G-Stat. More...


    Send mail Email

    Authors

    Blogroll

      Disclaimer

      The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

      © Copyright 2010

      Sign in

      eXTReMe Tracker