Tips for model planning:
12. March 2009 09:17
The
modeling strategy in general involves three
stages:
(1)
variable specification
(2)
interaction
assessment
(3)
confounding assessment
followed by consideration of precision
A
few statistical issues needing attention when we build model.
These
issues are
multicollinearity,
multiple testing,
and
influential observations.
Multicollinearity
occurs when one or more of the independent variables in the model can
be approximately determined by some of the other independent variables. When
there is multicollinearity, the estimated regression coefficients of the fitted
model can be highly unreliable. Consequently, any modeling strategy must check
for possible multicollinearity at various steps in the variable selection
process.
Multiple testing
the
more tests, the more likely significant findings, even if no real effects
•
variable selection procedures may yield
an
incorrect model because of multiple testing
Influential
observations
•
individual data may influence regression coefficients, e.g., outlier
•
coefficients may change if outlier is dropped from analysis
A
hierarchically well-formulated model is a model satisfying the following
characteristic: Given any variable in the model, all lower-order components of
the variable must also be contained in the model.
The
Hierarchical Backward
Elimination
Approach
The
strategy
called
hierarchical backward
because
we are working backward
from
our largest starting model to a smaller final and we are treating variables of
different orders at different steps. For those terms that are retained at a
given stage, there is a rule for identifying lower-order components that must
also be retained in
any
further models.
How
logistic regression may be used to analyze matched data
.
Matching
is a procedure carried out at the design stage
of a
study which compares two or more groups. To match, we select a referent group
for our study that is to be compared with the group of primary interest, called
the index group. Matching is accomplished by constraining the referent group to
be comparable to the index group on one or more risk factors, called “matching factors.”
For
example, if the matching factor is age, then matching on age would constrain
the referent group to have essentially the same age structure as the index
group.
The
most popular method for matching is called
category matching.
This involves first categorizing each of the matching factors and
then finding, for each case,
one
or more controls from the same combined set of matching categories.
For
example, if we are matching on age, race, and sex, we first categorize each of
these three variables separately.
For
each case, we then determine his or her age–race–sex combination. For instance,
the case may be 52 years old, white, and female. We then find one or
more
controls with the same age–race–sex combination.
If
our study involves matching, we must decide on the number of controls to be
chosen for each case. If we decide to use only one control for each case, we
call
this
one-to-one or pair-matching. If we choose
R
controls
for each case, for example,
R
equals
4, then we call this R-to-1 matching.
The
primary advantage for matching over random sampling without matching is that
matching can often lead to a more statistically efficient analysis. In
particular,
matching
may lead to a tighter confidence interval, that is, more precision,
around
the odds or risk ratio being estimated than would be achieved without matching.
The
major disadvantage to matching is that it can be costly, both in terms of the
time and labor required to find appropriate matches and in terms of information
loss
due to discarding of available controls not able to satisfy matching criteria.
In fact, if too much information is lost from matching, it may be possible to
lose
statistical
efficiency by matching.
In
deciding whether to match or not on a given factor, the safest strategy is to
match only on strong risk factors expected to cause confounding in the data.
Logistic regression can also account for
matching in the analysis of data, using a special method called conditional
logistic regression.
The computer calculates odds ratios in much
the same way as McNemar’s test, but the results are “conditioned” on the
matching variables. Interpretation of matched odds ratios (
Exist a SAS macro that fits a conditional logistic regression model to matched or finely stratified data using the PHREG procedure .
The
following SAS code fits a conditional logistic regression model to matched
case-control data.
proc phreg;
model time*case(0)=X1 X2 /
ties=discrete;
strata set;
Here
CASE refers to case-control status, with zero indicating the variable level for controls.
TIME is a dummy variable in this application and should be coded so that all
cases
and controls have the same nonzero value. X1 and X2 are the independent variables
of interest. The variable SET is used in the STRATA statement to uniquely
define
each matched set.

Email 