Mode in SQL

by Irina 13. May 2007 12:11

Mode

A mode is type of statistic that refers to the most frequently occurring value in a sample set.

Calculating a Mode

Problem

You want to calculate a modal average of the bulb-life results in your database.

Solution:

SELECT COUNT(*) frequency,Hours mode FROM BulbLife
GROUP BY Hours
HAVING COUNT(*)>= ALL(
   SELECT COUNT(*)
   FROM BulbLife
   GROUP BY Hours)

Tags: mode, tera, sql

TERADATA

Median in SQL

by Irina 12. May 2007 08:43

Median

A median is yet another type of statistic that you could loosely say refers to the middle value in a sample set. To be more precise, the median is the value in a sample set that has the same number of cases below it as above it.

Calculating a Median

Problem

You want to calculate the median bulb life for all bulbs that have been tested.

Solution

To calculate the median of the light-bulb test results, use the following query:

				SELECT x.Hours median
				FROM BulbLife x, BulbLife y
				GROUP BY x.Hours
				HAVING 
				SUM(CASE WHEN y.Hours <= x.Hours 
				THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND
				SUM(CASE WHEN y.Hours >= x.Hours 
				THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1
		

Discussion

This query follows the definition of the median very closely and uses the solution published severalyears ago by David Rozenshtein, Anatoly Abramovich, and Eugene Birger. Their solution is still regarded as one of the classical solutions to the problem of finding the median value in a sample ;To understand their solution, it helps to look at the query in two phases. First, you have a GROUP BY query that returns the number of bulbs for each distinct lifetime. The following is a modified version of the first part of the solution query that returns the occurrence count corresponding to each distinct bulb-life value:

				SELECT COUNT(*) occurrences, x.Hours xhours
				FROM BulbLife x, BulbLife y
				GROUP BY x.Hours
				 
				occurrences xhours      
				----------- ----------- 
				40          1043
				20          1057
				20          1073
				20          1074
				20          1077
				20          1079
ETC...

Because these results represent a self-join of the BulbLife table with itself, each group represents a number of detail rows equivalent to the number of rows in the sample. The two groups of 40 occurrences each exist because the data contains 2 cases with values of 1043 and 2 cases with values of 1085.

To determine the median follow these steps:

  1. Count the cases where the y.hours value is less than or equal to the x.hours value.

  2. Count the cases where the x.hours value is less than or equal to the y.hours value.

  3. Compare the two results.

The simplified version of median calculating :

select 
max(income) as median_income      
from
(select income,

count(branch_cust_ip)       over (partition by  migzar_nalan) as sum_income,
RANK( ) OVER (PARTITION BY migzar_nalan  ORDER by income)  as   ranker_income,
(( ranker_income *2)/(sum_income+1)) +1    as R_L_hahn
from dwp_sas.ang_basis_migzar1_c )   a
where   R_L_hahn=1

Statistics in SQL

Categorising a continuous variable into deciles

by Irina 12. May 2007 07:53

This page demonstrates methods of categorising continuous variables into deciles with TERADATA Query.

The formula used to calculate the quantile rank of a value is :

FLOOR ( rank*k/(n+1) )

where rank is the value's rank, k is the number of groups specified with the GROUPS= option, and n is the number of observations having nonmissing values of the ranking variable.

Note that this means that when grouping variables into quintiles, the groups will be assigned values 0,1,2,3, and 4.
In the presence of ties, the default behaviour if that each value is assigned the same rank, which is given by the mean of the corresponding ranks. As such, tied values are always assigned to the same quantile.

The query is:

proc sql;
connect to teradata
(user=me password=aaa tdpid=DWPROD);
create table income_april as
select * from connection to teradata
    (select    a.*,
     1 as indikator
case when income>0 then 1 else 0 end as kod_income,
sum(kod_income)       over (partition by  c_age) as sum_income,
case when kod_income=1 then  RANK( ) OVER (PARTITION BY
c_age,kod_income  ORDER by income) end as   ranker_income
case when kod_income =1 then((ranker_income *9)/(sum_income+1)) +1 else 0 end   as R_L_income,

from dwp_sas.income  a    ) ;
disconnect from teradata;
quit;
 

In this case we want to each group of age create deciles and to separate the group with income=0.

About the author

Irina Spivak Irina Spivak
Team Leader at G-Stat. More...


Send mail Email

Authors

Blogroll

    Disclaimer

    The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

    © Copyright 2010

    Sign in

    eXTReMe Tracker