The modeling strategy in general involves three
(1) variable specification
followed by consideration of precision
A few statistical issues needing attention when we build model.
These issues are
multicollinearity, multiple testing, and influential observations.
Multicollinearity occurs when one or more of the independent variables in the model can be approximately determined by some of the other independent variables. When there is multicollinearity, the estimated regression coefficients of the fitted model can be highly unreliable. Consequently, any modeling strategy must check for possible multicollinearity at various steps in the variable selection process.
the more tests, the more likely significant findings, even if no real effects
• variable selection procedures may yield
an incorrect model because of multiple testing
• individual data may influence regression coefficients, e.g., outlier
• coefficients may change if outlier is dropped from analysis
A hierarchically well-formulated model is a model satisfying the following characteristic: Given any variable in the model, all lower-order components of the variable must also be contained in the model.
The Hierarchical Backward
The strategy called hierarchical backward because we are working backward
from our largest starting model to a smaller final and we are treating variables of different orders at different steps. For those terms that are retained at a given stage, there is a rule for identifying lower-order components that must also be retained in
any further models.
How logistic regression may be used to analyze matched data .
Matching is a procedure carried out at the design stage
of a study which compares two or more groups. To match, we select a referent group for our study that is to be compared with the group of primary interest, called the index group. Matching is accomplished by constraining the referent group to be comparable to the index group on one or more risk factors, called “matching factors.”
For example, if the matching factor is age, then matching on age would constrain the referent group to have essentially the same age structure as the index group.
The most popular method for matching is called category matching. This involves first categorizing each of the matching factors and then finding, for each case,
one or more controls from the same combined set of matching categories.
For example, if we are matching on age, race, and sex, we first categorize each of these three variables separately.
For each case, we then determine his or her age–race–sex combination. For instance, the case may be 52 years old, white, and female. We then find one or
more controls with the same age–race–sex combination.
If our study involves matching, we must decide on the number of controls to be chosen for each case. If we decide to use only one control for each case, we call
this one-to-one or pair-matching. If we choose R controls for each case, for example, R equals 4, then we call this R-to-1 matching.
The primary advantage for matching over random sampling without matching is that matching can often lead to a more statistically efficient analysis. In particular,
matching may lead to a tighter confidence interval, that is, more precision, around the odds or risk ratio being estimated than would be achieved without matching.
The major disadvantage to matching is that it can be costly, both in terms of the time and labor required to find appropriate matches and in terms of information
loss due to discarding of available controls not able to satisfy matching criteria. In fact, if too much information is lost from matching, it may be possible to lose
statistical efficiency by matching.
In deciding whether to match or not on a given factor, the safest strategy is to match only on strong risk factors expected to cause confounding in the data.
Logistic regression can also account for matching in the analysis of data, using a special method called conditional logistic regression.
The computer calculates odds ratios in much
the same way as McNemar’s test, but the results are “conditioned” on the
matching variables. Interpretation of matched odds ratios (
Exist a SAS macro that fits a conditional logistic regression model to matched or finely stratified data using the PHREG procedure .
The following SAS code fits a conditional logistic regression model to matched case-control data.
model time*case(0)=X1 X2 / ties=discrete;
Here CASE refers to case-control status, with zero indicating the variable level for controls. TIME is a dummy variable in this application and should be coded so that all
cases and controls have the same nonzero value. X1 and X2 are the independent variables of interest. The variable SET is used in the STRATA statement to uniquely
define each matched set.