The LOESS procedure

by Irina 16. February 2008 05:48

PROC LOESS implements a nonparametric method for estimating local regression surfaces pioneered by Cleveland (1979); also refer to Cleveland et al. (1988) and Cleveland and Grosse (1991). This method is commonly referred to as loess, which is short for local regression.

PROC LOESS allows greater flexibility than traditional modeling tools because you can use it for situations in which you do not know a suitable parametric form of the regression surface. Furthermore, PROC LOESS is suitable when there are outliers in the data and a robust fitting method is necessary.

The main features of PROC LOESS are as follows:
  • fits nonparametric models
  • supports the use of multidimensional predictors
  • supports multiple dependent variables
  • supports both direct and interpolated fitting using kd trees
  • computes confidence limits for predictions
  • performs iterative reweighting to provide robust
  • fitting when there are outliers in the data
  • supports scoring for multiple data sets
    Local Regression and the Loess Method Assume that for i = 1 to n, the ith measurement yi of the response y and the corresponding measurement xi of the vector x of p predictors are related by
    yi = g(xi) + ei

    where g is the regression function and ei is a random error. The idea of local regression is that near x = x0, the regression function g(x) can be locally approximated by the value of a function in some specified parametric class. Such a local approximation is obtained by fitting a regression surface to the data points within a chosen neighborhood of the point x0.


    In the loess method, weighted least squares is used to fit linear or quadratic functions of the predictors at the centers of neighborhoods. The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. The fraction of the data, called the smoothing parameter, in each local neighborhood controls the smoothness of the estimated surface. Data points in a given local neighborhood are weighted by a smooth decreasing function of their distance from the center of the neighborhood.

  • The result of procedure basically is a curved regression line, useful at least for data description purposes and as a diagnostic to suggest whether a linear regression is appropriate or not

    Example 1.

       ods output OutputStatistics=PredLOESS;
       proc loess data=ExperimentA;
          model Yield = Temperature Catalyst  / scale=sd degree=2 select=gcv;
       run;
      
    ods output close;

    proc gam data=ExperimentA;
          model Yield = loess(Temperature) loess(Catalyst) / method=gcv;
          output out=PredGAM;
       run;

    .

    Although LOESS provides a model of the response surface, it do not provide an equation stating the dependence and do  not provide information about interactions and non-linearities.If the span for the preferred LOESS fit is small, it is unlikely that a common functiuon can be found for all the data. If the span is large, then it is quite likely that a common function can be found.
    If we have more than one (or two) outliers or points of influence (leverage points) we can't just drop one point, re-do the hat matrix, drop another point and re-do the hat matrix one more time . We need a more comprehensive approach like LOESS, M-estimation (which was introduced by Huber in 1973), S-estimaion, LTS-estimation, and MM-estimation. All of these (other than LOESS) are in PROC ROBUSTREG.


    It's not recommended to use LOESS for a binary dependent variable. LOESS can certainly handle multivariate data. But the fit is done as a weighted least squares model of linear and/or quadratic forms of the regressors. So using it as is, on categorical data, is not the best idea.Instead possible to use PROC GAM. It can fit splines and other nonparametric models, as well as semi-parametric models and parametric models, and it can fit them to binary dependent variables too.PROC GAM, there isn't a simple linear system that can be fed into PROC SCORE for scoring new data. So PROC GAM has a convenient SCORE statement to take care of that for us.  


  • Also MARS (Multivariate Adaptive Regression Splines) that fits piecewise linear regressions can be useful in the case that the dependent variable binary. It uses separate regression slopes in distinct intervals of the predictor variable space. PROC LOESS (version 8) which uses weighted polynomial regression, Kernel regression (in INSIGHT), and PROC TRANSREG (Version 8) which uses cubic polynomials in piecewise regression are similar to MARS  


    Tags:

    SAS | models

    Accessing a Permanent SAS Data Set with User-Defined Formats

    by Irina 2. February 2008 11:07

    If you want to use a permanent SAS data set that has user-defined formats, the only requirement is to remember to tell SAS where to find the formats. If you forget the FMTSEARCH= system option, you will get an error message telling you that SAS cannot find the formats. If you give a copy of a SAS data set with user-defined formats to another user, be sure to also give a copy of the format library to them as well
    Example:
    libname qis 'c:\qis\exposure\formats';
    options fmtsearch=(qis);


    A useful PROC FORMAT option is FMTLIB. This option creates a listing of each format in the specified library with the ranges and labels. As an example, if you want to display the definitions of all the formats in your QIS library, you would submit the following code:

    Example:

    proc format library=qis fmtlib;
    run;


    If you wish to remove the formats from the variables in your output data set, you can use a FORMAT statement with blanks in places of format names.


    data base1;
    set base ;
    format a ;

    run;

    * a-variable name


    TIPS from

    Learning SAS® by Example: A Programmer's Guide
    by Ron Cody

    STYLE ON THE CLASS AND CLASSLEV STATEMENTS:

    by Irina 18. January 2008 10:06

    Sometimes we want to color the same id (or something else) in the same color.
    Although not strictly a reporting procedure, one of the most important procedures in the would-be traffic lighter’s toolkit is PROC FORMAT. User defined formats allow the SAS® programmer to define foreground (font) and background color.

    PROC TABULATE

     In PROC TABULATE, background and foreground colors via a user-defined format can be applied in the various style statements (for example, the PROC statement, the VAR statement(s), CLASS statement(s), the CLASSLEV statements(s), the TABLES statement(s), and the BOX statement.) They can also be applied in a user-defined style template. Colors can be assigned directly in the style statement, or with a user-defined format.
    STYLE ON THE PROC STATEMENT:
    In general, you would not use style on the PROC TABULATE statement for the purposes of traffic lighting, as it applies an overall style to the table rather than highlights particular values. Nonetheless, it’s fun to play with while formatting your reports.
    Syntax:
    Proc tabulate data=yourdataset style={background=lightblue}; This would create a light blue background for your table regardless of what overall style you use.

    We can use the original table(list) in order to make the format
    proc sort nodupkey data=all_r out=list_value;
    by id;
    run ;
    proc sql;
    create table formats as 
    select  a.id as start,
            a.id as end ,
            color as label,
            'color' as fmtname
    from 
    (select a.* ,monotonic() as misp
    from list_value a ) a,
     color1 b
    where a.misp=(b.misp-5);
    quit;
     I calculate misp-5 in order to remove the black color.

    File with colors: color.csv (557.00bytes)

     
    data all_r1;
    set all_r;
    code=id;
    run;


    proc format cntlin=formats ;
    run;




    proc sql;
    create table all_r1 as
    select a.* ,monotonic() as misp
    from all_r1 a;
    quit;



    ods html file = 'c:\print_results.html';

    Proc tabulate data=all_r1 ;
    class misp code;
    Var id  new_new_id exposure;
    table misp*code ,  
    id=' '*(  min= 'id' *f=comma10. )*[style=[background=color. font_weight=bold]]
    exposure=' '*(  mean= 'exposure' *f=comma10. )*[style=[  font_weight=bold]];
    run;
    The result:
    The SAS System

     idexposure
    mispcode15131
    115
    21515141
    31515151
    41515161
    51616141
    61717131
    71717178
    81818146
    91818148
    101919131

    The QUANTREG Procedure

    by Irina 22. October 2007 12:40
    The QUANTREG procedure models the effects of covariates on the conditional quantiles of a response variable by means of quantile regression.Quantile regression, which was introduced by Koenker and Bassett (1978), extends the regression model to conditional quantiles of the response variable, such as the median or the 90th percentile. Quantile regression is particularly useful when the rate of change in the conditional quantile, expressed by the regression coefficients, depends on the quantile.

  • Quantile regression is also flexible in the sense that it does not involve a link function that relates the variance and the mean of the response variable.
  • Quantile regression also offers a degree of data robustness.
  • Quantile regression cannot be carried out simply by segmenting the unconditional distribution of the response variable and then obtaining least-squares fits for the subsets. This approach leads to disastrous results when, for example, the data include outliers. In contrast, quantile regression uses all of the data for fitting quantiles, even the extreme quantiles.
    proc quantreg data=trout alpha=0.01 ci=resampling;
    model LnDensity = WDRatio / quantile=0.9
    CovB CorrB
    seed=12345;
    test WDRatio;
    run;
    
    ods html;
    ods graphics on;
    proc quantreg data=trout alpha=0.1 ci=resampling;
    model LnDensity = WDRatio / quantile=all seed=12345
    plot=quantplot;
    run;
    ods graphics off;
    ods html close;
    
    %macro quantiles(NQuant, Quantiles);
    %do i=1 %to &NQuant;
    proc quantreg data=bmimen ci=none algorithm=interior;
    model logbmi = inveage sqrtage age sqrtage*age
    age*age age*age*age
    / quantile=%scan(&Quantiles,&i,",");
    output out=outp&i pred=p&i;
    run;
    %end;
    %mend;
    %let quantiles = %str(.03,.05,.10,.25,.5,.75,.85,.90,.95,.97);
    %quantiles(10,&quantiles);
  • Using the ROC Curve to Measure Sensitivity & Specificity

    by Irina 16. October 2007 11:58


    Two indices are used to evaluate the accuracy of a test that predicts dichotomous outcomes (e.g. logistic regression) – sensitivity and specificity. They describe how well a test discriminates between cases with and without a certain condition.

    Sensitivity - the proportion of true positives or the proportion of cases correctly identified by the test as meeting a certain condition (e.g. in mammography testing, the proportion of patients with cancer who test positive).

    Specificity - the proportion of true negatives or the proportion of cases correctly identified by the test as not meeting a certain condition (e.g. in mammography testing, the proportion of patients without cancer who test negative).

    The lift -is a measure of a predictive model calculated as the ratio between the results obtained with and without the predictive model.


    Choosing a Cut-off

    The position of the cut-off determines the number of true positives, true negatives, false positives, and false negatives. As you increase your sensitivity (true positives) and can identify more cases with a certain condition, you also sacrifice accuracy on identifying those without the condition (specificity). This value (C) can be estimated by maximizing the index J

    J=MAX(Sensitivity(C) + Specificity(C))

    Receiver Operating Characateristic (ROC) Curve

    A Receiver Operating Characteristic (ROC) curve is a graphical representation of the trade off between the false negative and false positive rates for every possible cut off. By tradition, the plot shows the false positive rate (1-specificity) on the X axis and the true positive rate (sensitivity or 1 - the false negative rate) on the Y axis.1 The accuracy of a test (i.e. the ability of the test to correctly classify cases with a certain condition and cases without the condition) is measured by the area under the ROC curve. An area of 1 represents a perfect test, while an area of .5 represents a worthless test. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test; the true positive rate is high and the false positive rate is low. Statistically, more area under the curve means that it is identifying more true positives while minimizing the number/percent of false positives

      ods select parameterestimates association;
        proc logistic data=data1;
           model disease/n=age / outroc=roc1 roceps=0;
           output out=outp p=phat;
           ods output association=assoc;
           run;
            data _null_;
            set assoc;
            if label2='c' then call symput("area",cvalue2);
            title "area=&area";
    
            proc gplot data=roc1; 
            plot _sensit_*_1mspec_; 
    
            run; 
            quit; 
           run;

    It is important to use the ROCEPS=0 option in the MODEL statement of PROC LOGISTIC when you fit your model because this option allows all the unique predicted values to be output to the OUTROC= data set. Otherwise, the values may be rounded yielding fewer points on the ROC plot.

    About the author

    Irina Spivak Irina Spivak
    Team Leader at G-Stat. More...


    Send mail Email

    Authors

    Blogroll

      Disclaimer

      The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

      © Copyright 2010

      Sign in

      eXTReMe Tracker