Same frequency distribution

by Irina 7. July 2007 10:55

A frequency distribution is a summary of the data set in which the interval of possible values is divided into subintervals, known as classes. For each class, the number of data values in that class is recorded; this is the frequency of the class. The relative frequency of the class is the frequency of the class divided by the number of values in the data set.

Sometimes we want to compare two populations and we want choose from the second population  observations with the  frequency distiribution that is similar to the first population.

Example: 

data a7;
set loans.Ekspr_model_result_200704;
if zevet=7;
code=km_L_PASIV_month_b*10+kmin_age;
run;

data a3;
set loans.Ekspr_model_result_200704;
if zevet=3;
code=km_L_PASIV_month_b*10+kmin_age;
run;

We want sample from a3 with the frequency distribution according to pasiv and age that the similar to a7.
proc freq DATA=a7; 
tables code /out=outkod  noprint;
run;

proc freq DATA=a3; 
tables code /out=outkod3(rename=(percent=percent3 count=count3))  noprint;
run;


proc sql ;
create table all_res as
seect a.*,count3/new_am as ratio_lacking,
min(count3/new_am) as min_ratio,
from 
(
select a.*,
b.percent3,
b.count3,
sum(count3) as counter,
int(sum(count3)*percent/100) as new_am,

percent/percent3 as ratio 
from outkod a,
outkod3 b
where a.code=b.code ) a;
quit;


data _null_ ; 
set all_res ;
if min_ratio=ratio_lacking;
num_pop=round(count3*100/percent);
call symput('nn',num_pop);
run;

%put &nn;

data all_res1;
set all_res;
new_new=round(&nn*percent/100);
ratio_new_new=  (new_new/count3)   ;
round_ratio_new=round(ratio_new_new,0.0001);
run;

proc sql noprint;
select round_ratio_new
into :ratio separated BY " " 

from all_res1; 
quit;

proc surveyselect data=a3 method=srs
                  rate=(&ratio) out=sample;
strata code;
run;

proc freq data=a7;
tables code;
run;

proc freq data=sample;
tables code;
run;

DETERMINING SAMPLE SIZE

by Irina 20. April 2007 10:11

Sample size determination is computed using three inputs:

  • The estimate of the population standard deviation (often obtained from earlier studies )
  • The acceptable level of sampling error
  • The desired confidence level

    Generally, research practitioners utilize the following sequence and inputs in computing sample size:

    1. Survey respondents will split 50/50 in response to dichotomous (e.g. yes/no) questions.

    2. The desired level of confidence will be 95%, or 1.96 standard deviations from the mean or .05 possible .

    Py = Proportion responding “yes”

    Pn = Proportion responding “no”

     Standard error is the acceptable amount of error/confidence interval. In the above case .05/1.96 (about 2 standard deviations), or .0255102. The standard formula for computing the sample size is:

    Py) (Pn)

    Std Error2

    So, when the respective values are input, we end up with .25/.0006507 or 384 respondents. This is why a survey sample size of 400 is often recommended.Sample size is important in avoiding Type I or Type II errors.

    Type I errors  are made by stating that there is a difference between two groups within a population on a given measurement, when in fact there is no difference. Accommodating this potential outcome is where most sample size calculations stop. Often, practitioners simply ignore the possibility of making a Type II error. The sample size typically needed to address Type I errors is 384.

    Type II errors  are made by stating that there is no difference between two groups within a population on a given measurement, when in fact there is a difference. While important, many researchers ignore statistical power calculations. In the “real world” tables and canned statistical tools are utilized to determine survey power, due to the complexity of the formulas. The sample size typically needed to address Type II errors is 1,236.

    Confidence level   suggests that other samples drawn from the same population will have similar values X% of the time. For most marketing research exercises, confidence levels are set at 95%.

    Confidence interval   includes the possible end point values for the entire population. The confidence interval allows for a computed amount of variation from the mean value based on the precision/cost value trade-off.

  • Carl Bergemann

    New SAS Procedures for Analysis of Sample Survey Data

    by Irina 20. April 2007 08:28

    The SURVEYSELECT procedure provides a variety of methods for selecting probability-based random samples.

    PROC SURVEYMEANS and PROC SURVEYREG analyze survey data collected according to a complex survey design.

    SURVEYSELECT

    The SURVEYSELECT procedure provides the following equal probability sampling methods:

  • equal probability sampling methods
  • unrestricted random sampling (with replacement)
  • systematic random sampling:
  • sequential random sampling

    This procedure also provides the following probability proportional to size (PPS) methods:

  • PPS without replacement
  • PPS with replacement
  • PPS systematic
  • various PPS algorithms for selecting two units per stratum
  • sequential PPS with minimum replacement

    Example:

    proc surveyselect data=frame out=sample

    method=srs n=(3, 5, 3, 6, 2);

    strata state region;

     run;

    The METHOD=SRS option specifies that simple random sampling is to be used for sample selection.In simple random sampling, units are selected with equal probability and without replacement. The N = (3, 5, 3, 6, 2) option specifies the sample sizes for the strata — a sample of 3 households from the stratum, 5 households from the second stratum and so on. The OUT=SAMPLE option names the output data set that contains the selected sample. The STRATA statement identifies STATE and REGION as the stratification variables. The input data set FRAME is sorted by these stratification variables.

    SURVEYMEANS

    The SURVEYMEANS procedure can compute the following statistics:
  • population total estimate and its standard deviation and corresponding t-test
  • PSU-level mean estimate and its standard error and corresponding t-test
  • 95% confidence limits for the population total and for the PSU-level mean estimates
  • degrees of freedom for the variance estimation
  • mean-per-element estimate and its standard error
  • data summary information
  • combined sampling fraction over strata and the total number of primary sampling units (PSUs)

    Example:

     proc surveymeans data=HHSample N=StrataTotals

     sum df clm fraction;

     var income expense;

     strata state region / list;

     weight weight; run;

  •  

    SURVEYREG

    The SURVEYREG procedure performs regression analysis for sample survey data. The procedure can handle complex survey sample designs, including designs with stratification, clustering, and unequal weighting. The procedure fits linear models for survey data and computes regression coefficients and their variance-covariance matrix. The procedure also provides significance tests for the model effects and for any specified estimable linear functions of the model parameters. Using the regression model, the procedure can compute predicted values for the sample survey data.

    Example:

    proc surveyreg data=HHSample N=StrataTotals;

    strata state region / list;

    model expense = income;

    weight weight;

    run;

    more...

    Tags: sampling

    SAS

    Sampling

    by Irina 19. April 2007 05:11

    Nonprobability Sampling

    The difference between nonprobability and probability sampling is that nonprobability sampling does not involve random selection and probability sampling does.Ii is not necessarily mean that nonprobability samples aren't representative of the population, but it does mean that nonprobability samples cannot depend upon the rationale of probability theory. With nonprobability samples, we may or may not represent the population well, and it will often be hard for us to know how well we've done so.However,in applied social research there may be circumstances where it is not feasible, practical or theoretically sensible to do random sampling.

    Accidental, Haphazard or Convenience Sampling

    One of the most common methods of sampling goes under the various titles listed here. I would include in this category the traditional "man on the street" (of course, now it's probably the "person on the street") interviews conducted frequently by television news programs to get a quick (although nonrepresentative) reading of public opinion.

    Purposive Sampling

    In purposive sampling, we sample with a purpose in mind. We usually would have one or more specific predefined groups we are seeking, for instance, Caucasian females between 30-40 years old . One of the first things is to do is verify that the respondent does in fact meet the criteria for being in the sample. Purposive sampling can be very useful for situations where you need to reach a targeted sample quickly and where sampling for proportionality is not the primary concern. With a purposive sample, you are likely to get the opinions of your target population, but you are also likely to overweight subgroups in your population that are more readily accessible.

    • Modal Instance Sampling

    In statistics, the mode is the most frequently occurring value in a distribution. In sampling, when we do a modal instance sample, we are sampling the most frequent case, or the "typical" case.

    • Expert Sampling

    Expert sampling involves the assembling of a sample of persons with known or demonstrable experience and expertise in some area. Often, we convene such a sample under the auspices of a "panel of experts." There are actually two reasons you might do expert sampling. First, because it would be the best way to elicit the views of persons who have specific expertise. But the other reason you might use expert sampling is to provide evidence for the validity of another sampling approach you've chosen.The disadvantage is that even the experts can be, and often are, wrong.

  • Quota Sampling
  • In quota sampling, you select people nonrandomly according to some fixed quota. There are two types of quota sampling: proportional and non proportional. In proportional quota sampling you want to represent the major characteristics of the population by sampling a proportional amount of each. For instance, if you know the population has 40% women and 60% men, and that you want a total sample size of 100, you will continue sampling until you get those percentages and then you will stop. So, if you've already got the 40 women for your sample, but not the sixty men, you will continue to sample men but even if legitimate women respondents come along, you will not sample them because you have already "met your quota." The problem here (as in much purposive sampling) is that you have to decide the specific characteristics on which you will base the quota. Will it be by gender, age, education race, religion, etc.?

    Nonproportional quota sampling is a bit less restrictive. In this method, you specify the minimum number of sampled units you want in each category. here, you're not concerned with having numbers that match the proportions in the population. Instead, you simply want to have enough to assure that you will be able to talk about even small groups in the population. This method is the nonprobabilistic analogue of stratified random sampling in that it is typically used to assure that smaller groups are adequately represented in your sample.

    • Heterogeneity Sampling

    We sample for heterogeneity when we want to include all opinions or views, and we aren't concerned about representing these views proportionately. Another term for this is sampling for diversity. In many brainstorming or nominal group processes (including concept mapping), we would use some form of heterogeneity sampling because our primary interest is in getting broad spectrum of ideas, not identifying the "average" or "modal instance" ones. In effect, what we would like to be sampling is not people, but ideas. We imagine that there is a universe of all possible ideas relevant to some topic and that we want to sample this population, not the population of people who have the ideas. Clearly, in order to get all of the ideas, and especially the "outlier" or unusual ones, we have to include a broad and diverse range of participants. Heterogeneity sampling is, in this sense, almost the opposite of modal instance sampling.

    • Snowball Sampling

    In snowball sampling, you begin by identifying someone who meets the criteria for inclusion in your study. You then ask them to recommend others who they may know who also meet the criteria. Although this method would hardly lead to representative samples, there are times when it may be the best method available. Snowball sampling is especially useful when you are trying to reach populations that are inaccessible or hard to find. For instance, if you are studying the homeless, you are not likely to be able to find good lists of homeless people within a specific geographical area. However, if you go to that area and identify one or two, you may find that they know very well who the other homeless people in their vicinity are and how you can find them.

    Sampling

    by Irina 16. April 2007 12:15

    Sampling is the process of selecting units (e.g., people, organizations) from a population of interest .

    Let's begin by covering some of the

  • key terms in sampling
  • statistical Terms in Sampling

    Probability Sampling

    A probability sampling method is any method of sampling that utilizes some form of random selection. In order to have a random selection method, you must set up some process or procedure that assures that the different units in your population have equal probabilities of being chosen.

       

    Some Definitions

    • N = the number of cases in the sampling frame
    • n = the number of cases in the sample
    • NCn = the number of combinations (subsets) of n from N
    • f = n/N = the sampling fraction
    • Objective: To select n units out of N such that each NCn has an equal chance of being selected.
    • Procedure: Use a table of random numbers, a computer random number generator, or a mechanical device to select the sample.
  • more...

     

    Stratified Random Sampling

    Stratified Random Sampling, also sometimes called proportional or quota random sampling, involves dividing your population into homogeneous subgroups and then taking a simple random sample in each subgroup.

    There are several major reasons why you might prefer stratified sampling over simple random sampling. First, it assures that you will be able to represent not only the overall population, but also key subgroups of the population, especially small minority groups. If you want to be able to talk about subgroups, this may be the only way to effectively assure you'll be able to. If the subgroup is extremely small, you can use different sampling fractions (f) within the different strata to randomly over-sample the small group (although you'll then have to weight the within-group estimates using the sampling fraction whenever you want overall population estimates). When we use the same sampling fraction within strata we are conducting proportionate stratified random sampling. When we use different sampling fractions in the strata, we call this disproportionate stratified random sampling. Second, stratified random sampling will generally have more statistical precision than simple random sampling. This will only be true if the strata or groups are homogeneous. If they are, we expect that the variability within-groups is lower than the variability for the population as a whole. Stratified sampling capitalizes on that fact.

    Systematic Sampling

    This is random sampling with a system!  From the sampling frame, a starting point is chosen at random, and thereafter at regular intervals.For example, suppose you want to sample 8 houses from a street of 120 houses. 120/8=15, so every 15th house is chosen after a random starting point between 1 and 15. If the random starting point is 11, then the houses selected are 11, 26, 41, 56, 71, 86, 101, and 116. If there were 125 houses, 125/8=15.625, so should you take every 15th house or every 16th house? If you take every 16th house, 8*16=128 so there is a risk that the last house chosen does not exist. To overcome this the random starting point should be between 1 and 10. On the other hand if you take every 15th house, 8*15=120 so the last five houses will never be selected. The random starting point should now be between 1 and 20 to ensure that every house has some chance of being selected.

    Cluster (Area) Random Sampling

    In cluster sampling the units sampled are chosen in clusters, close to each other. Examples are households in the same street, or successive items off a production line. The population is divided into clusters, and some of these are then chosen at random. Within each cluster units are then chosen by simple random sampling or some other method. Ideally the clusters chosen should be dissimilar so that the sample is as representative of the population as possible.Clearly this strategy will help us to economize on our mileage, but the possible disadvantages that units close to each other may be very similar and so less likely to represent the whole population and usually sampling error is larger than in simple random sampling .

  • About the author

    Irina Spivak Irina Spivak
    Team Leader at G-Stat. More...


    Send mail Email

    Blogroll

      Disclaimer

      The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

      © Copyright 2012

      Sign in

      eXTReMe Tracker