Find the the preferable channel

by Irina 3. September 2007 10:33
proc sql exec;
connect to teradata as ter
(user=abc password=aa tdpid=DWPROD);
execute ( create volatile table peulot,
         no log as(  select       

           ir.branch_cust_ip ,
           case when Channel_Id in (1101,1102,1103,1105 ) then 1
           when Channel_Id in (1201 ,1203,1207,1208,1212,
           1220,1240,1401,1403,1499) then 2 
           when Channel_Id in (1301,1302,1304,1399) then 3
           when Channel_Id in (2501,1501,1601,1602) then 4 end 
           as code_channel,
           count(*) as counter
                       
from dwp_vall.V0500_1_FINANCIAL_EVENT V0500,   
     dwp_vall.V0X414_COST_TYPE_B as V0X414,
     dwp_sas.population ir    
where V0500.Event_Costing_Activity_Type_Co=V0X414.Cost_Activity_Type_Code
and V0500.branch_cust_ip=ir.branch_cust_ip   
and event_start_date ge 1070101
and event_start_date lt 1070901
group by 1,2    

with data on commit preserve rows)by ter;
 execute (commit work) by ter;
execute (

create volatile table chann_rank,
  no log as ( 
select   basis.branch_cust_ip,
           min( code_channel) as Channel_prior1,
           max( code_channel) as Channel_prior2 
       
from   ( select branch_cust_ip,
         code_channel,
         RANK(  ) OVER (PARTITION BY 
         branch_cust_ip ORDER by counter desc)  as Percentile_arutz
         from peulot
                        
         QUALIFY Percentile_arutz = 1  )  basis
        group by 1)

 with data on commit preserve rows)by ter;
 execute (commit work) by ter;

  create table m_irena as  
  select  * from connection to ter 
  (select * from chann_rank); 
  disconnect from ter;

quit;     
                      
)

QUALIFY an introduction to a conditional clause that, similar to HAVING, further filters rows from a WHERE clause. The major difference between QUALIFY and HAVING is that with QUALIFY the filtering is based on the result of performing various ordered analytical functions on the data.

SQL Procedure Pass-Through Facility: Teradata Specifics

by Irina 9. June 2007 13:31
You can use the pass-through facility of PROC SQL to build your own Teradata SQL statements and then pass them to the Teradata server for execution.

Example 1: Deleting and Recreating a Teradata Table:

proc sql;
connect to teradata
(user=abc password=aaa tdpid=DWPROD);
execute (drop table dwp_sas.ir_reshima_halvaot) by teradata;
execute (create table salary (current_salary float, name char(10))) 
            by teradata;
execute (insert into salary values (35335.00, 'Dan J.')) by teradata;
execute (commit) by teradata;
disconnect from teradata;
quit;
When your DBMS is Teradata, there is an option called FASTLOAD= which can be set to YES.
The DBA suggested weuse it, and it did make a difference in the load time.

Example 2.

data tera_sas.ir_list_loan(fastload=yes);
set population;
run;

Example 3.

The first couple of times that you collect "new" statistics on a table, the AMPs must read every row and take its demographics using a full table scan. However, after you have once collected various column and index statistics, all you need do is issue this command: COLLECT STATISTICS ON ;
It will then read all the rows one time and collect all the statistics. This will great improve performance over recollecting each statistic individually.
proc sql;
   connect to teradata
(user=abc password=aaa tdpid=DWPROD);
   execute (COLLECT STAT ON  dwp_sas.uhl_eutz_model COLUMN  branch_cust_ip) by teradata;  
   execute (commit) by teradata;
   disconnect from teradata;
quit;

Example 4.

FASTEXPORT
The FASTEXPORT utility must be licensed separately from NCR to be used by SAS 9. FASTEXPORT is invoked by using the DBSLICEPARM=ALL option. SAS 9 automatically generates the specialized script required for the FASTEXPORT utility and retrieves the data back from FASTEXPORT on sockets. FASTEXPORT is supported on explicit SQL as well.
libname tra Teradata user=terauser pw=XXXXXX server=boom;
proc freq data=tra.big(dbsliceparm=all);
table x1-x3;
run;
Following is an example using explicit SQL.
proc sql;
connect to teradata(user=terauser password=XXXXXX server=boom
dbsliceparm=all);
select * from connection to teradata
(select * from big);
quit;

Trend estimation

by Irina 25. May 2007 06:17

Trend estimation :

Trend in a time series is a slow, gradual change in some property of the series over the whole interval under investigation. Trend is sometimes loosely defined as a long term change in the mean, but can also refer to change in other statistical properties. For example, tree-ring series of measured ring width frequently have a trend in variance as well as mean.Identification of trend in a time series is subjective because trend in a sample cannot be unequivocally distinguished from low frequency fluctuations.
Curve-fitting.
If a time series changes in level gradually over time, it makes sense to consider as trend some simple function of time itself.
The simplest and most widely used function of time used in detrending is the least-squares-fit straight line, which treats linear trend. Simple linear regression is used to fit the model:

x = a + bt + et

where xt  is the original time series at time t , a is the regression constant, b is the regression  coefficient, and are the regression residuals.The advantage of the straight-line method is simplicity. The straight line may unrealistic, however, in restricting the functional form of the trend.

Trend estimation in Teradata:

SELECT CAST(REGR_SLOPE(deposit , period ) AS DECIMAL(8,4)) as beta,
sqrt(REGR_SXX( deposit , period))  as sxx,
sqrt(REGR_Syy( deposit , period )) as syy,
CAST(REGR_R2(deposit , period)  AS DECIMAL(8,4)) as r,
sqrt(1-r)*syy  as s_e,
cast(s_e/(sqrt(14)*sxx) AS DECIMAL(8,4)) as s_ee,
beta/s_ee as t,
case when abs(t)>1.96 then 1 else 0 end as significant,
case when beta>0  and significant=1 then 1 
     when beta<0  and significant=1 then -1
     else 0 end as trend

Trend estimation in SAS:


data leadprd;
      input date:monyy5. leadprod customer @@;
      format date monyy5.;
      title 'Lead Production Data';
      title2 '(in tons)';
      datalines;
   jan90 38500 1 feb90 37900  1 mar90 36900  1  apr90 38600  1 
   may90 36400 1 jun90 33300  1 jul90 34000  1  aug90 38000  1 
   sep90 37400 1 oct90 42300  1 nov90 36900  1  dec90 34800  1 
   jan91 33900 1 feb91 34000  1 mar91 37200  1  apr91 33300  1 
   may91 29800 1 jun91 24700  1 jul91 30800  1  aug91 31100  1 
   sep91 32400 1 oct91 32900  1 nov91 29100  1  dec91 31800  1 
   jan92 32100 1 feb92 30500  1 mar92 36800  1  apr92 30300  1 
   may92 29500 1 jun92 24700  1 jul92 27600  1  aug92 23800  1 
   sep92 21400 1 feb90 37900  2 mar90 36900  2  apr90 38600  2
   may90 36400 2 jun90 33300  2 jul90 34000  2   aug90 38000 2 
   sep90 37400 2 oct90 42300  2 nov90 36900  2   dec90 34800 2 
   jan91 33900 2 feb91 34000  2 mar91 37200  2   apr91 68800 2 
   may91 75000 2 jun91 85000  2 jul91 10555  2   aug91 11520 2 
   sep91 32400 2 oct91 22500  2 nov91 29100  2   dec91 31800 2 
   jan92 32100 2 feb92 23556  2 mar92 33505  2   apr92 43005 2 
   may92 66500 2 jun92 77550  2 jul92 88800  2   aug92 99990 2 
   ;
   run;

Next produce your forecasts and save their predicted values to SAS data sets. This example uses the forecasting capabilities of the FORECAST, the ARIMA, and the REG procedures. The OUT1STEP option of PROC FORECAST specifies that only the one-step-ahead forecasts are output to the data set LEADOUT1. The LEAD= option produces forecasts for 12 months beyond the sample period.
 proc forecast data=leadprd out=leadout1 out1step
      lead=12 interval=month;
      id date;
      var leadprod;
	  by customer;
   run;

    proc arima data=leadprd;
      i var=leadprod nlag=15;
      e p=1;
      f lead=12 interval=month id=date out=leadout2;
   by customer;
   run;
   quit;

To estimate a time trend for the lead prediction data, it is necessary to create a new variable T that spans both the sample and forecast periods.

data ttrend;
      set leadout2;
      t+1;
   run;
proc reg data=ttrend;
model leadprod = t;
output out=leadout3 p=ptrend;
ods output ParameterEstimates = estim;
ods output  FitStatistics=k;
ods output anova=n;
by customer;
run;
quit;
proc sql;
create table estim_trend as
select a.*,
case when Probt<0.05 then 1 else 0 end as significant,
case when Estimate>0  and calculated significant=1 then 1 
     when Estimate<0  and calculated significant=1 then -1
     else 0 end as trend
 from estim a
 where variable ne 'Intercept';
 quit;

 data final;
      merge leadout1(keep=date leadprod customer
                   rename=(leadprod=pfore)) 
        leadout2(keep=date leadprod forecast customer
                   rename=(leadprod=actual forecast=parima)) 
        leadout3(keep=date ptrend customer);
		by customer date;
   run;
    

Mode in SQL

by Irina 13. May 2007 12:11

Mode

A mode is type of statistic that refers to the most frequently occurring value in a sample set.

Calculating a Mode

Problem

You want to calculate a modal average of the bulb-life results in your database.

Solution:

SELECT COUNT(*) frequency,Hours mode FROM BulbLife
GROUP BY Hours
HAVING COUNT(*)>= ALL(
   SELECT COUNT(*)
   FROM BulbLife
   GROUP BY Hours)

Tags: mode, tera, sql

TERADATA

Median in SQL

by Irina 12. May 2007 08:43

Median

A median is yet another type of statistic that you could loosely say refers to the middle value in a sample set. To be more precise, the median is the value in a sample set that has the same number of cases below it as above it.

Calculating a Median

Problem

You want to calculate the median bulb life for all bulbs that have been tested.

Solution

To calculate the median of the light-bulb test results, use the following query:

				SELECT x.Hours median
				FROM BulbLife x, BulbLife y
				GROUP BY x.Hours
				HAVING 
				SUM(CASE WHEN y.Hours <= x.Hours 
				THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND
				SUM(CASE WHEN y.Hours >= x.Hours 
				THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1
		

Discussion

This query follows the definition of the median very closely and uses the solution published severalyears ago by David Rozenshtein, Anatoly Abramovich, and Eugene Birger. Their solution is still regarded as one of the classical solutions to the problem of finding the median value in a sample ;To understand their solution, it helps to look at the query in two phases. First, you have a GROUP BY query that returns the number of bulbs for each distinct lifetime. The following is a modified version of the first part of the solution query that returns the occurrence count corresponding to each distinct bulb-life value:

				SELECT COUNT(*) occurrences, x.Hours xhours
				FROM BulbLife x, BulbLife y
				GROUP BY x.Hours
				 
				occurrences xhours      
				----------- ----------- 
				40          1043
				20          1057
				20          1073
				20          1074
				20          1077
				20          1079
ETC...

Because these results represent a self-join of the BulbLife table with itself, each group represents a number of detail rows equivalent to the number of rows in the sample. The two groups of 40 occurrences each exist because the data contains 2 cases with values of 1043 and 2 cases with values of 1085.

To determine the median follow these steps:

  1. Count the cases where the y.hours value is less than or equal to the x.hours value.

  2. Count the cases where the x.hours value is less than or equal to the y.hours value.

  3. Compare the two results.

The simplified version of median calculating :

select 
max(income) as median_income      
from
(select income,

count(branch_cust_ip)       over (partition by  migzar_nalan) as sum_income,
RANK( ) OVER (PARTITION BY migzar_nalan  ORDER by income)  as   ranker_income,
(( ranker_income *2)/(sum_income+1)) +1    as R_L_hahn
from dwp_sas.ang_basis_migzar1_c )   a
where   R_L_hahn=1

Statistics in SQL

About the author

Irina Spivak Irina Spivak
Team Leader at G-Stat. More...


Send mail Email

Authors

Blogroll

    Disclaimer

    The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

    © Copyright 2010

    Sign in

    eXTReMe Tracker