Median in SQL

by Irina 12. May 2007 08:43

Median

A median is yet another type of statistic that you could loosely say refers to the middle value in a sample set. To be more precise, the median is the value in a sample set that has the same number of cases below it as above it.

Calculating a Median

Problem

You want to calculate the median bulb life for all bulbs that have been tested.

Solution

To calculate the median of the light-bulb test results, use the following query:

				SELECT x.Hours median
				FROM BulbLife x, BulbLife y
				GROUP BY x.Hours
				HAVING 
				SUM(CASE WHEN y.Hours <= x.Hours 
				THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND
				SUM(CASE WHEN y.Hours >= x.Hours 
				THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1
		

Discussion

This query follows the definition of the median very closely and uses the solution published severalyears ago by David Rozenshtein, Anatoly Abramovich, and Eugene Birger. Their solution is still regarded as one of the classical solutions to the problem of finding the median value in a sample ;To understand their solution, it helps to look at the query in two phases. First, you have a GROUP BY query that returns the number of bulbs for each distinct lifetime. The following is a modified version of the first part of the solution query that returns the occurrence count corresponding to each distinct bulb-life value:

				SELECT COUNT(*) occurrences, x.Hours xhours
				FROM BulbLife x, BulbLife y
				GROUP BY x.Hours
				 
				occurrences xhours      
				----------- ----------- 
				40          1043
				20          1057
				20          1073
				20          1074
				20          1077
				20          1079
ETC...

Because these results represent a self-join of the BulbLife table with itself, each group represents a number of detail rows equivalent to the number of rows in the sample. The two groups of 40 occurrences each exist because the data contains 2 cases with values of 1043 and 2 cases with values of 1085.

To determine the median follow these steps:

  1. Count the cases where the y.hours value is less than or equal to the x.hours value.

  2. Count the cases where the x.hours value is less than or equal to the y.hours value.

  3. Compare the two results.

The simplified version of median calculating :

select 
max(income) as median_income      
from
(select income,

count(branch_cust_ip)       over (partition by  migzar_nalan) as sum_income,
RANK( ) OVER (PARTITION BY migzar_nalan  ORDER by income)  as   ranker_income,
(( ranker_income *2)/(sum_income+1)) +1    as R_L_hahn
from dwp_sas.ang_basis_migzar1_c )   a
where   R_L_hahn=1

Statistics in SQL

About the author

Irina Spivak Irina Spivak
Team Leader at G-Stat. More...


Send mail Email

Authors

Blogroll

    Disclaimer

    The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

    © Copyright 2010

    Sign in

    eXTReMe Tracker