Relative importance of explanatory variables

by Irina 26. January 2008 13:02

by David Firth

Question: What are the most important variables in regression?

Important for what? Without a criterion for importance, the inquiries are meaningless.' Goes on to distinguish three notions of importance, namely :

  1. `theoretical importance' (measured by βΧ)  
  2. `level importance' (measured by βΧ *  μΧ  
  3.  `dispersion importance' (measured by     σΥ       /  βΧ* σΧ  


    Of the last, Achen says: `although almost no one is substantively interested in it, many social scientists use it as their sole importance measure.' He suggests standardization by the (supposed fixed) range of a variable, rather than by its s.d., to achieve comparability across samples.

The QUANTREG Procedure

by Irina 22. October 2007 12:40
The QUANTREG procedure models the effects of covariates on the conditional quantiles of a response variable by means of quantile regression.Quantile regression, which was introduced by Koenker and Bassett (1978), extends the regression model to conditional quantiles of the response variable, such as the median or the 90th percentile. Quantile regression is particularly useful when the rate of change in the conditional quantile, expressed by the regression coefficients, depends on the quantile.

  • Quantile regression is also flexible in the sense that it does not involve a link function that relates the variance and the mean of the response variable.
  • Quantile regression also offers a degree of data robustness.
  • Quantile regression cannot be carried out simply by segmenting the unconditional distribution of the response variable and then obtaining least-squares fits for the subsets. This approach leads to disastrous results when, for example, the data include outliers. In contrast, quantile regression uses all of the data for fitting quantiles, even the extreme quantiles.
    proc quantreg data=trout alpha=0.01 ci=resampling;
    model LnDensity = WDRatio / quantile=0.9
    CovB CorrB
    seed=12345;
    test WDRatio;
    run;
    
    ods html;
    ods graphics on;
    proc quantreg data=trout alpha=0.1 ci=resampling;
    model LnDensity = WDRatio / quantile=all seed=12345
    plot=quantplot;
    run;
    ods graphics off;
    ods html close;
    
    %macro quantiles(NQuant, Quantiles);
    %do i=1 %to &NQuant;
    proc quantreg data=bmimen ci=none algorithm=interior;
    model logbmi = inveage sqrtage age sqrtage*age
    age*age age*age*age
    / quantile=%scan(&Quantiles,&i,",");
    output out=outp&i pred=p&i;
    run;
    %end;
    %mend;
    %let quantiles = %str(.03,.05,.10,.25,.5,.75,.85,.90,.95,.97);
    %quantiles(10,&quantiles);
  • Factor Analysis

    by Irina 17. July 2007 10:45

    A Factor is a dimension underlying several variables.

    Analytical, it is a linear combination of the variables: F1=W1X1+W2X2+... Where: F1 - factor1, Xj - the variables of the study (5 in our example), Wj - weights used to combine the individual scores. The various methods of factor analysis are distinguished by the manner in which the weights Wj are determined.

    A Factor score: The score of a respondent on a factor. If we decide to settle with two factors we will have two factor scores for each of the 500 respondents.

    A Factor loading: The correlation between a factor and a variable

    Labeling Factors: The art of segmentation; consists of selecting a term which best describes all the variables that load highly a factor. Factor #1 may be labeled as “price conscious”: and factor #2 as “ fashion conscious”.

    The proportion of total variance of a certain variable accounted for by a factor may be obtained by squaring the loading. In our example factor #1 explains .92342=86.94% of the variance in variable 4.

    proc transpose  data =event_transaction  out=result prefix=event;
    by branch_cust_ip;
    id Event_Costing_Activity_Type_Co;
    var count;
    run;
    
    data result;
    set result;
    
    array events{*} _NUMERIC_ ;
    do i = 1 to dim(events);
    if events{i} = . then events{i} = 0;
    end;
    drop i;
    run;
    
    
    proc factor score data=result method=p rotate=orthomax nfactors=10  outstat=fact_events;
    var  event: ;
    run;
    
    proc score data=personal score=fact_events  out=scores_events;
    var  event: ;
    run;
    
    data scores_events;
    set  scores_events;
    max=max(Factor1,
    Factor2,
    Factor3,
    Factor4,
    Factor5,
    Factor6,
    Factor7,
    Factor8,
    Factor9,
    Factor10)
    ;
    
    min=min(Factor1,
    Factor2,
    Factor3,
    Factor4,
    Factor5,
    Factor6,
    Factor7,
    Factor8,
    Factor9,
    Factor10)
    ;
    run;
    
    data scores_events;
    set  scores_events;
    array factor Factor1-factor10;
    
    do i=1 to dim(factor);
    if max=factor [i] then factor_max=i;
    if min=factor [i] then factor_min=i;
    end;
    run;
    

    Principal Component Analysis

    by Irina 5. May 2007 01:49

    The Basics of Principal Component Analysis

    Principal component analysis is appropriate when you have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables.and believe that there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct.The principal components may then be used as predictor or criterion variables in subsequent analyses

    Because it is a variable reduction procedure, principal component analysis is similar in many respects to exploratory factor analysis. In fact, the steps followed when conducting a principal component analysis are virtually identical to those followed when conducting an exploratory factor analysis. However, there are significant conceptual differences between the two procedures, and it is important that you do not mistakenly claim that you are performing factor analysis when you are actually performing principal component analysis

    What is a Principal Component?

    How principal components are computed. Technically, a principal component can be defined as a linear combination of optimally-weighted observed variables. In order to understand the meaning of this definition, it is necessary to first describe how subject scores on a principal component are computed.It is possible to calculate a score for each subject on a given principal component. For example, in the preceding study, each subject would have scores on two components: one score on the satisfaction with supervision component, and one score on the satisfaction with pay component. The subject’s actual scores on the seven questionnaire items would be optimally weighted and then summed to compute their scores on a given component.

    For example, assume that component 1 in the present study was the “satisfaction with supervision” component. You could determine each subject’s score on principal component 1 by using the following fictitious formula:
    C1 = .44 (X1) + .40 (X2) + .47 (X3) + .32 (X4) + .02 (X5) + .01 (X6) + .03 (X7)

    The SAS System’s PROC FACTOR solves for these weights by using a special type of equation called an eigenequation. The weights produced by these eigenequations are optimal weights in the sense that, for a given set of data, no other set of weights could produce a set of components that are more successful in accounting for variance in the observed variables. The weights are created so as to satisfy a principle of least squares similar (but not identical) to the principle of least squares used in multiple regression.

    Number of components extracted.

    In reality, the number of components extracted in a principal component analysis is equal to the number of observed variables being analyzed. However, in most analyses, only the first few components account for meaningful amounts of variance, so only these first few components are retained, interpreted, and used in subsequent analyses (such as in multiple regression analyses).

    What is meant by “total variance” in the data set?

    The “total variance” in the data set is simply the sum of the variances of these observed variables. Because they have been standardized the total variance in a principal component analysis will always be equal to the number of observed variables being analyzed

    Principal Component Analysis is Not Factor Analysis !

    Both procedures can be performed with the SAS System’s FACTOR procedure, and they sometimes even provide very similar results. But factor analysis assumes that the covariation in the observed variables is due to the presence of one or more latent variables (factors) that exert causal influence on these observed variables. And in contrast, principal component analysis makes no assumption about an underlying causal model. Principal component analysis is simply a variable reduction procedure that (typically) results in a relatively small number of components that account for most of the variance in a set of observed variables.

    What is a communality?

    A communality refers to the percent of variance in an observed variable that is accounted for by the retained components (or factors).

    SAS Program and Output.

    You may perform a principal component analysis using either the PRINCOMP or FACTOR procedures.
    PROC FACTOR DATA=data-set-name
    PREPLOT PLOT
    SIMPLE
    METHOD=PRIN
    PRIORS=ONE
    MINEIGEN=p
    SCREE
    ROTATE=VARIMAX
    ROUND
    FLAG=desired-size-of-"significant"-factor-loadings ;
    VAR variables-to-be-analyzed ;
    RUN;
    FLAG=desired-size-of-”significant”-factor-loadings causes the printer to flag (with an asterisk) any factor loading whose absolute value is greater than some specified size.
    METHOD=factor-extraction-method specifies the method to be used in extracting the factors or components. The current program specifies METHOD=PRIN to request that the principal axis (principal factors) method be used for the initial extraction. This is the appropriate method for a principal component analysis
    PREPLOT option will show us a factor plot before rotation.
    PLOT option will show us a factor plot after rotation
    MINEIGEN=p specifies the critical eigenvalue a component must display if that component is to be retained.This statement will cause PROC FACTOR to retain and rotate any component whose eigenvalue is p or larger. Negative values are not allowed. (here, p = the critical eigenvalue).
    NFACT=n  allows you to specify the number of components to be retained and rotated, where n = the number of components.
    PRIORS=prior-communality-estimates specifies prior communality estimates. Users should always specify PRIORS=ONE to perform a principal component analysis.
    ROTATE=rotation-method specifies the rotation method to be used. The preceding program requests a varimax rotation, which results in orthogonal (uncorrelated) components.
    ROUND causes all coefficients to be limited to two decimal places, rounded to the nearest integer, and multiplied by 100 (thus eliminating the decimal point).
    SIMPLE requests simple descriptive statistics: the number of usable cases on which the analysis was performed, and the means and standard deviations of the observed variables.

    DETERMINING SAMPLE SIZE

    by Irina 20. April 2007 10:11

    Sample size determination is computed using three inputs:

  • The estimate of the population standard deviation (often obtained from earlier studies )
  • The acceptable level of sampling error
  • The desired confidence level

    Generally, research practitioners utilize the following sequence and inputs in computing sample size:

    1. Survey respondents will split 50/50 in response to dichotomous (e.g. yes/no) questions.

    2. The desired level of confidence will be 95%, or 1.96 standard deviations from the mean or .05 possible .

    Py = Proportion responding “yes”

    Pn = Proportion responding “no”

     Standard error is the acceptable amount of error/confidence interval. In the above case .05/1.96 (about 2 standard deviations), or .0255102. The standard formula for computing the sample size is:

    Py) (Pn)

    Std Error2

    So, when the respective values are input, we end up with .25/.0006507 or 384 respondents. This is why a survey sample size of 400 is often recommended.Sample size is important in avoiding Type I or Type II errors.

    Type I errors  are made by stating that there is a difference between two groups within a population on a given measurement, when in fact there is no difference. Accommodating this potential outcome is where most sample size calculations stop. Often, practitioners simply ignore the possibility of making a Type II error. The sample size typically needed to address Type I errors is 384.

    Type II errors  are made by stating that there is no difference between two groups within a population on a given measurement, when in fact there is a difference. While important, many researchers ignore statistical power calculations. In the “real world” tables and canned statistical tools are utilized to determine survey power, due to the complexity of the formulas. The sample size typically needed to address Type II errors is 1,236.

    Confidence level   suggests that other samples drawn from the same population will have similar values X% of the time. For most marketing research exercises, confidence levels are set at 95%.

    Confidence interval   includes the possible end point values for the entire population. The confidence interval allows for a computed amount of variation from the mean value based on the precision/cost value trade-off.

  • Carl Bergemann

    About the author

    Irina Spivak Irina Spivak
    Team Leader at G-Stat. More...


    Send mail Email

    Authors

    Blogroll

      Disclaimer

      The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

      © Copyright 2010

      Sign in

      eXTReMe Tracker