Tips for model planning:

by Irina 12. March 2009 09:17

The modeling strategy in general involves three

stages:

(1) variable specification

(2) interaction assessment

(3) confounding assessment followed by consideration of precision

A few statistical issues needing attention when we build model.

These issues are

multicollinearity, multiple testing, and influential observations.

 

Multicollinearity occurs when one or more of the independent variables in the model can be approximately determined by some of the other independent variables. When there is multicollinearity, the estimated regression coefficients of the fitted model can be highly unreliable. Consequently, any modeling strategy must check for possible multicollinearity at various steps in the variable selection process.

 

Multiple testing

the more tests, the more likely significant findings, even if no real effects

• variable selection procedures may yield

an incorrect model because of multiple testing

 

Influential observations

• individual data may influence regression coefficients, e.g., outlier

• coefficients may change if outlier is dropped from analysis

 

A hierarchically well-formulated model is a model satisfying the following characteristic: Given any variable in the model, all lower-order components of the variable must also be contained in the model.

 

The Hierarchical Backward

Elimination Approach

The strategy called hierarchical backward because we are working backward

from our largest starting model to a smaller final and we are treating variables of different orders at different steps. For those terms that are retained at a given stage, there is a rule for identifying lower-order components that must also be retained in

any further models.

 

 

How logistic regression may be used to analyze matched data .

 

Matching is a procedure carried out at the design stage

of a study which compares two or more groups. To match, we select a referent group for our study that is to be compared with the group of primary interest, called the index group. Matching is accomplished by constraining the referent group to be comparable to the index group on one or more risk factors, called “matching factors.”

For example, if the matching factor is age, then matching on age would constrain the referent group to have essentially the same age structure as the index group.

 

The most popular method for matching is called category matching. This involves first categorizing each of the matching factors and then finding, for each case,

one or more controls from the same combined set of matching categories.

 

For example, if we are matching on age, race, and sex, we first categorize each of these three variables separately.

For each case, we then determine his or her age–race–sex combination. For instance, the case may be 52 years old, white, and female. We then find one or

more controls with the same age–race–sex combination.

 

If our study involves matching, we must decide on the number of controls to be chosen for each case. If we decide to use only one control for each case, we call

this one-to-one or pair-matching. If we choose R controls for each case, for example, R equals 4, then we call this R-to-1 matching.

 

The primary advantage for matching over random sampling without matching is that matching can often lead to a more statistically efficient analysis. In particular,

matching may lead to a tighter confidence interval, that is, more precision, around the odds or risk ratio being estimated than would be achieved without matching.

 

The major disadvantage to matching is that it can be costly, both in terms of the time and labor required to find appropriate matches and in terms of information

loss due to discarding of available controls not able to satisfy matching criteria. In fact, if too much information is lost from matching, it may be possible to lose

statistical efficiency by matching.

 

In deciding whether to match or not on a given factor, the safest strategy is to match only on strong risk factors expected to cause confounding in the data.

The analysis of matched data can be carried out using a stratified analysis in which the strata consist of the collection of matched sets.

 

 

Logistic regression can also account for matching in the analysis of data, using a special method called conditional logistic regression.

 The computer calculates odds ratios in much the same way as McNemar’s test, but the results are “conditioned” on the matching variables. Interpretation of matched odds ratios (MORs) using conditional logistic regression is the same as interpretation of matched odds ratios calculated from tables. A stratified conditional logistic model has the same flexibility as an unconditional model, yet can still take into account the correlation structure attributable to matching.

Exist a SAS macro that fits a conditional logistic regression model to matched or finely stratified data using the PHREG procedure .

Phreg macro

 

The following SAS code fits a conditional logistic regression model to matched case-control data.

proc phreg;

model time*case(0)=X1 X2 / ties=discrete;

strata set;


Here CASE refers to case-control status, with zero indicating the variable level for controls. TIME is a dummy variable in this application and should be coded so that all

cases and controls have the same nonzero value. X1 and X2 are the independent variables of interest. The variable SET is used in the STRATA statement to uniquely

define each matched set.

 

 

Marketing models

by Irina 24. January 2009 07:07
Some forms of the common models used in marketing :

The linear model:

Y=a+b*X

  • The model is easy to visualize and understand
  • The model can approximate many complicated functions quite well
  • It assumes constant returns to scale
  • It has no upper bound on Y
  • ∆Y/∆X  is constant everywhere and equal to b
  • It often gives managers unreasonable guidance on decisions

 

The power series model:

If we are uncertain what the relationship between X and Y, we can use a power series model:

Y=a+b*X+c*X^2+DX^3+…

  • The model can take many shapes
  • May fit well within the range of the data
  • Normally behave badly (becoming unbounded) outside the data range

 

The fractional root model:

Y=a+b*XC

  • Has a simple but flexible form
  • There are combinations of parameters that give increasing, decreasing, and (with c=1) constant returns to scale
  • When c=1/2 the model is called the square root model, when c=-1 it is called the reciprocal model, Y approaches the value a when X gets large
  • If a=0, the parameter c has the economic interpretation of elasticity (the percent change in sales, Y, when there is a 1 percent change in marketing effort X). When X is price, c is normally negative, whereas it is positive for most other marketing variables

  The semilog model:

Y=a+b*ln X

the semilog model handles situation in which constant percentage increases in marketing effort result in constant absolute increases in sales and can be used to represent a response to advertising spending where after some threshold of awareness, additional spending may have diminishing returns.


The exponential model:

Y=aebX    where X>0

characterizes situation where there are increasing returns to scale (for b>0) ; however is most widely used as a price-response function for b<0 (increasing returns to decreases in price); when Y approaches 0 as X becomes large


The modified exponential model:

Y=a(1-e-bX) + c

It has an upper bound or saturation level at a+c and a lower bound of c, and it shows decreasing returns to scale. The model is used as a response function to selling effort.


The logistic model: Of the S-shaped models used in marketing, the logistic model is the most common. It has a form


Y=a/(1+e-(b+c*X)   )+d

this model has a saturation level at a+d and has a region of increasing returns followed by decreasing return to scale; it is symmetric around d+a/2, it is easy to estimate and it is widely used


The Gompertz model:

A less widely used S-shaped function is the following Gompetz model :
 
Y=abcX +d, a>0, b>0, b<1, c<1
Both the Gompetz and logistic curves lie between a lower bound and an upper bound; the Gompetz curve involves a constant ratio of successive first differences of log Y, whereas the logistic curve involves a constant ratio of successive first differences of 1/Y.

The better known logistic function is used more often then Gompetz because it is easy to estimate


The ADBUDG Model:
 
Y=b+(a-b)*X/(d+X)
The model is S-shaped for c>1 and concave for 0<c<1. It is bounded berween b (lower bound) and a (upper bound). It is widely used to model response to advertising and selling effort

The LOESS procedure

by Irina 16. February 2008 05:48

PROC LOESS implements a nonparametric method for estimating local regression surfaces pioneered by Cleveland (1979); also refer to Cleveland et al. (1988) and Cleveland and Grosse (1991). This method is commonly referred to as loess, which is short for local regression.

PROC LOESS allows greater flexibility than traditional modeling tools because you can use it for situations in which you do not know a suitable parametric form of the regression surface. Furthermore, PROC LOESS is suitable when there are outliers in the data and a robust fitting method is necessary.

The main features of PROC LOESS are as follows:
  • fits nonparametric models
  • supports the use of multidimensional predictors
  • supports multiple dependent variables
  • supports both direct and interpolated fitting using kd trees
  • computes confidence limits for predictions
  • performs iterative reweighting to provide robust
  • fitting when there are outliers in the data
  • supports scoring for multiple data sets
    Local Regression and the Loess Method Assume that for i = 1 to n, the ith measurement yi of the response y and the corresponding measurement xi of the vector x of p predictors are related by
    yi = g(xi) + ei

    where g is the regression function and ei is a random error. The idea of local regression is that near x = x0, the regression function g(x) can be locally approximated by the value of a function in some specified parametric class. Such a local approximation is obtained by fitting a regression surface to the data points within a chosen neighborhood of the point x0.


    In the loess method, weighted least squares is used to fit linear or quadratic functions of the predictors at the centers of neighborhoods. The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. The fraction of the data, called the smoothing parameter, in each local neighborhood controls the smoothness of the estimated surface. Data points in a given local neighborhood are weighted by a smooth decreasing function of their distance from the center of the neighborhood.

  • The result of procedure basically is a curved regression line, useful at least for data description purposes and as a diagnostic to suggest whether a linear regression is appropriate or not

    Example 1.

       ods output OutputStatistics=PredLOESS;
       proc loess data=ExperimentA;
          model Yield = Temperature Catalyst  / scale=sd degree=2 select=gcv;
       run;
      
    ods output close;

    proc gam data=ExperimentA;
          model Yield = loess(Temperature) loess(Catalyst) / method=gcv;
          output out=PredGAM;
       run;

    .

    Although LOESS provides a model of the response surface, it do not provide an equation stating the dependence and do  not provide information about interactions and non-linearities.If the span for the preferred LOESS fit is small, it is unlikely that a common functiuon can be found for all the data. If the span is large, then it is quite likely that a common function can be found.
    If we have more than one (or two) outliers or points of influence (leverage points) we can't just drop one point, re-do the hat matrix, drop another point and re-do the hat matrix one more time . We need a more comprehensive approach like LOESS, M-estimation (which was introduced by Huber in 1973), S-estimaion, LTS-estimation, and MM-estimation. All of these (other than LOESS) are in PROC ROBUSTREG.


    It's not recommended to use LOESS for a binary dependent variable. LOESS can certainly handle multivariate data. But the fit is done as a weighted least squares model of linear and/or quadratic forms of the regressors. So using it as is, on categorical data, is not the best idea.Instead possible to use PROC GAM. It can fit splines and other nonparametric models, as well as semi-parametric models and parametric models, and it can fit them to binary dependent variables too.PROC GAM, there isn't a simple linear system that can be fed into PROC SCORE for scoring new data. So PROC GAM has a convenient SCORE statement to take care of that for us.  


  • Also MARS (Multivariate Adaptive Regression Splines) that fits piecewise linear regressions can be useful in the case that the dependent variable binary. It uses separate regression slopes in distinct intervals of the predictor variable space. PROC LOESS (version 8) which uses weighted polynomial regression, Kernel regression (in INSIGHT), and PROC TRANSREG (Version 8) which uses cubic polynomials in piecewise regression are similar to MARS  


    Tags:

    SAS | models

    Elements of model building

    by Irina 31. October 2007 11:31

    Elements of model building:

    The classic approach to model building consists of 3 major parts:
    1. specification
    2. parameterization
    3. validation

    1. Specification(representation or structure) is the representation of the most important elements of the real world system in mathematical terms. This involves two major steps:

    1. Specifying the variables to be included in the model, and making the distinction between those to be explained (the dependent variables) and those providing the explanation (the explanatory or independent variables). For example to explain market share of brand (dependent variable) we could propose following explanatory variables: price, advertising expenditures, promotions, distribution, quality, and measure this variables for the brand and for competing brands. Often this also involves a choice of statistical distribution of those variables, or a distribution of error term of the dependent variable.

    2. A second aspect is the specification of a functional relationship between the variables. For example, the effects of explanatory variable can be linear or non-linear, immediate or/and lagged, additive or multiplicative. A choice among those options may be based of priori reasoning. Additive relationship, for example, implies that the explanatory variable do not interact, while the multiplicative assumes specific type of interaction. Also s-shaped function indicates increasing returns to scale for low values of an explanatory variable and decreasing returns for high values.

    2. Parameterization (or estimation) is the determination of parameter estimates for a model. For this data often is available or may be available without great effect, but should be careful with such data. May be “unobserved” or “latent” variables. Attitudes about products, intentions to purchase, feelings. Sometimes such unobserved variables are omitted from the model, as in the stimulus-response models: the models with no behavioral details. Alternatively one can develop instruments to measure the unobserved variables, either directly or as a function of observed or indicator variables.

    Apart from the collection issues we have to identify techniques to be applied for extracting estimates of the model parameters from the data collected.

    3. Validation:

    Validation criteria for model building can relate to:
    The model structure (specification)
    The data quality
    The estimation method
    The applicability of statistical tests
    The model’s relative performance, against alternative models
    The relevance of model results to intended use

    The idea of model selection is what we often have alternative models specifications, and we use data to distinguish between the alternatives. The superiority of one model over another may depend on the product category and on competitive conditions but also on the quality of the data. Even though theoretical arguments should inform the model specifications, in marketing, we want the empirical results to be not only consistent with what sound arguments dictate, but also with how marketplace behaves subsequent to model testing. With new data, the question is whether extent models apply, and with new models the question is whether the proposed specification outperforms prevailing benchmarks. In marketing the empirical research almost always includes a measure of predictive validity.

    Tags:

    models

    Building Models for Marketing Decisions

    by Irina 29. October 2007 13:34
    Leeflang, Wittink, Wedel and Naert (2000) classify models according to their primary purpose or intended use.
    They distinguish:
    • Descriptive models. These models intend to describe decision processes of managers or customers.
    • Predictive models.   These models forecast or predict future events or outcomes.
    • Normative models.   These models are used to obtain recommended or optimal courses of action.


    Descriptive models are not restricted for decision problems. For example one can describe the market by the structure of brand loyalty.
    By predictive models we mean models for forecasting or prediction future events. For example, a firm may want to predict sales for brand, under alternative prices, advertising spending levels and package sizes.

    Indeed, one can argue that for a model to have valid normative implications, it must have predictive value and at least some descriptive power. However, a descriptive model need not have normative implications and a predictive model may not be useful for normative considerations. They also point out that it is often logical to proceed from a descriptive to a predictive and then to a normative model. In other situations, a descriptive model may be sufficient. Forecasting or prediction does not always mean answering “what if” type of questions, such as, how the demand changes if price is increased by 10 percent. In some brand choice models, the structure of brand loyalty and switching is summarized in a transition probability matrix.

    Normative or Prescriptive models –has one of its outputs as recommended course of action. For example, the objective function in a media allocation model may be the optimization of profit.

    Demand models make up a special class of predictive models. We refer to the demand model when we have a performance variable related to the level of demand. Many demand models belong to the subset of predictive models. In a demand model, the performance variable is a measure of demand. This performance variable may depend on a number of other variables, such as marketing decision variables employed by the firm and its competitors. We distinguish individual demand models and aggregate demand models.

    Aggregate demand may refer to:

    1. The total number of units of a product category purchased by the population of all spending units. The corresponding demand model is called a model of industry sales, or a model of product class.
    2. The total number of units of a particular brand purchased by the population of all spending units. The demand model is then a brand sales model.
    3. The number of units of a particular brand purchased by the total population relative to the total number of units purchased of the product class, in which case the demand models is a market share model.

    Gatignon and Robertson (1986) identify three types of models, which differ in Their objectives and implications:


    Theoretical models. These models offer a mathematical description of a process in which some constructs are systematically joined to others. The objective is to generate theoretical propositions that appropriately describe the possible influence of variables on the diffusion pattern and diffusion rate. These descriptions are the raison d’être of theoretical models and should Provide suggestions to managers.

    Normative models. These models also start with a description and assume functional relationships among the variables that affect the diffusion process. The behavioral assumptions may be less complex than those of theoretical models, given that the objective is not to make descriptive propositions but to develop optimal marketing strategies. An objective function for the firm is determined and the model implications are expressed with respect to variables incorporated into the model .

    Empirical models. The objective of these models is to fit data and test a Specific theoretical proposition or a complete model. Marketing has focused more on empirical and normative than on theoretical Models.

    About the author

    Irina Spivak Irina Spivak
    Team Leader at G-Stat. More...


    Send mail Email

    Blogroll

      Disclaimer

      The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

      © Copyright 2013

      Sign in

      eXTReMe Tracker