Median
A median is yet another type of statistic that you could loosely say refers to the middle value in a sample set. To be more precise, the median is the value in a sample set that has the same number of cases below it as above it.
Calculating a Median
Problem
You want to calculate the median bulb life for all bulbs that have been tested.
Solution
To calculate the median of the light-bulb test results, use the following query:
SELECT x.Hours median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING
SUM(CASE WHEN y.Hours <= x.Hours
THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND
SUM(CASE WHEN y.Hours >= x.Hours
THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1
Discussion
This query follows the definition of the median very closely and uses the solution published severalyears ago by David Rozenshtein, Anatoly Abramovich, and Eugene Birger. Their solution is still regarded as one of the classical solutions to the problem of finding the median value in a sample ;To understand their solution, it helps to look at the query in two phases. First, you have a GROUP BY query that returns the number of bulbs for each distinct lifetime. The following is a modified version of the first part of the solution query that returns the occurrence count corresponding to each distinct bulb-life value:
SELECT COUNT(*) occurrences, x.Hours xhours
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
occurrences xhours
----------- -----------
40 1043
20 1057
20 1073
20 1074
20 1077
20 1079
ETC...
Because these results represent a self-join of the BulbLife table with itself, each group represents a number of detail rows equivalent to the number of rows in the sample. The two groups of 40 occurrences each exist because the data contains 2 cases with values of 1043 and 2 cases with values of 1085.
To determine the median follow these steps:
- Count the cases where the y.hours value is less than or equal to the x.hours value.
- Count the cases where the x.hours value is less than or equal to the y.hours value.
- Compare the two results.
The simplified version of median calculating :
select
max(income) as median_income
from
(select income,
count(branch_cust_ip) over (partition by migzar_nalan) as sum_income,
RANK( ) OVER (PARTITION BY migzar_nalan ORDER by income) as ranker_income,
(( ranker_income *2)/(sum_income+1)) +1 as R_L_hahn
from dwp_sas.ang_basis_migzar1_c ) a
where R_L_hahn=1
Statistics in SQL