Friday, April 06, 2007

How to Lie with Statistics

  • How to Lie with Statistics
  • Confidence intervals for the predicted values - logistic regression

    Using predict
    after logistic
    to get predicted probabilities and confidence intervals is somewhat tricky. The
    following two commands will give you predicted probabilities:


            . logistic ...
    . predict phat


    The following does not give you the standard error of the predicted
    probabilities:


            . logistic ...
    . predict se_phat, stdp


    Despite the name we chose, se_phat does not contain the
    standard error of phat. What does it contain? The standard error
    of the predicted index. The index is the linear combination of the estimated
    coefficients and the values of the independent variable for each observation
    in the dataset. Suppose we fit the following logistic
    regression model:


            . logistic y x 


    This model estimates b0 and b1 of the following model:


    P(y = 1) = exp(b0+b1*x)/(1 + exp 0+b1*x))
    Here the index is b0 + b1*x. We could get
    predicted values of the index and its standard error as follows:

            . logistic y x
    . predict lr_index, xb
    . predict se_index, stdp


    We could transform our predicted value of the index into a predicted
    probability as follows:


    . gen p_hat = exp(lr_index)/(1+exp(lr_index))


    This is just what predict does by default after a logistic regression
    if no options are specified. Using a similar procedure, we can get a 95%
    confidence interval for our predicted probabilities by first generating the
    lower and upper bounds of a 95% confidence interval for the index and then
    converting these to probabilities:



    . gen lb = lr_index - invnorm(0.975)*se_index
    . gen ub = lr_index + invnorm(0.975)*se_index
    . gen plb = exp(lb)/(1+exp(lb))
    . gen pub = exp(ub)/(1+exp(ub))


    Generating the confidence intervals for the index and then
    converting them to probabilities to get confidence intervals for the predicted
    probabilities is better than estimating the standard error of the predicted
    probabilities and then generating the confidence intervals directly from that
    standard error. The distribution of the predicted index is
    closer to normality than the predicted probability.

  • Confidence intervals for the predicted values - logistic regression-stata
  • Wednesday, March 28, 2007

    Hosmer and Lemeshow Test

    Hosmer-Lemeshow
    test of goodness-of-fit can be performed by using the lackfit option after the model statement. This test divides subjects into deciles based on predicted probabilities, then computes a chi-square from observed and expected frequencies.
    It tests the null hypothesis that there is no difference between the observed and predicted values of the response variable.Therefore, when the test is not significant, as in this example, we can not reject the null hypothesis and say that the model fits the data well. We can also request the generalized R-square measure for the model by
    using rsquare option after the model statement. SAS gives the likelihood-based
    pseudo R-square measure and its rescaled measure.

    Categorical Data Analysis Using The SAS System
    , by M. Stokes, C. Davis
    and G. Koch offers more details on how the generalized R-square measures that
    you can request are
    constructed and how to interpret them.
    proc logistic
    data = hsb2;
    class prog(ref='1') /param = ref;
    model hiwrite(event='1') = female prog read math / rsq lackfit;
    run;

    Thursday, March 22, 2007

    Cumulation of the data

    %let name=200311+200312+200401+200402+200403+200404;

    %let first_d=1031101+1031201+1040101+1040201+1040301+1040401;

    %let last_d=1031201+1040101+1040201+1040301+1040401+1040501;
    options mprint mlogic;

    %macro peulot;

    data nohesh.netunim_200311_200404;delete ;run;

    %do i=1 %to 6;

    proc sql;

    connect to teradata

    (user=xxx password=123 tdpid=DWPROD);

    create table nohesh.peulot_%scan(&name,&i,+) as

    select * from connection to teradata

    (select branch_cust_ip,

    count(*) as peulot

    from bo_vall.V0500_1_FINANCIAL_EVENT as a,

    bo_vall.VBM845_FINANCIAL_EVENT_CUST as b

    where event_start_date ge %scan(&first_d,&i,+)

    and event_start_date lt %scan(&last_d,&i,+)

    and a.event_id=b.event_id

    group by 1

    );

    disconnect from teradata;

    quit;

    data nohesh.netunim_200311_200404;

    set nohesh.netunim_200311_200404

    nohesh.peulot_%scan(&name,&i,+);

    run;

    %end;

    %mend;

    %peulot;

    Jarque-Bera hypothesis test of normality

    Function JBTest(ReturnVector, SignificanceLevel)

    Jarque-Bera hypothesis

    test of normality:



    ' Andreas Steiner, March 2006
    ' http://www.andreassteiner.net/performanceanalysis

    n = WorksheetFunction.Max(ReturnVector.Columns.Count, ReturnVector.Rows.Count)

    ReturnVectorMean = WorksheetFunction.Average(ReturnVector)
    ReturnVectorStDev = WorksheetFunction.StDev(ReturnVector)

    ' Normalize returns
    ReDim NormalizedReturns(1 To n)
    For i = 1 To n
    NormalizedReturns(i) = (ReturnVector(i) - ReturnVectorMean) / ReturnVectorStDev
    Next i

    ' Calculate 3rd and 4th moments (skewness and kurtosis)
    S = 0
    K = 0
    For i = 1 To n
    S = S + NormalizedReturns(i) ^ 3
    K = K + NormalizedReturns(i) ^ 4
    Next i
    S = S / n
    K = K / n - 3

    JB = n * ((S ^ 2) / 6 + (K ^ 2) / 24)
    pValue = WorksheetFunction.ChiDist(JB, 2)

    JBTest = (SignificanceLevel < pValue)

    End Function
    Function JBCriticalValue(ReturnVector, SignificanceLevel)
    ' Jarque-Bera hypothesis test of normality.
    '
    ' Andreas Steiner, March 2006
    ' http://www.andreassteiner.net/performanceanalysis

    JBCriticalValue = WorksheetFunction.ChiInv(SignificanceLevel, 2)

    End Function
    Function JBpValue(ReturnVector, SignificanceLevel)
    ' Jarque-Bera hypothesis test of normality.
    '
    ' Andreas Steiner, March 2006
    ' http://www.andreassteiner.net/performanceanalysis

    n = WorksheetFunction.Max(ReturnVector.Columns.Count, ReturnVector.Rows.Count)

    ReturnVectorMean = WorksheetFunction.Average(ReturnVector)
    ReturnVectorStDev = WorksheetFunction.StDev(ReturnVector)

    ' Normalize returns
    ReDim NormalizedReturns(1 To n)
    For i = 1 To n
    NormalizedReturns(i) = (ReturnVector(i) - ReturnVectorMean) / ReturnVectorStDev
    Next i

    ' Calculate 3rd and 4th moments (skewness and kurtosis)
    S = 0
    K = 0
    For i = 1 To n
    S = S + NormalizedReturns(i) ^ 3
    K = K + NormalizedReturns(i) ^ 4
    Next i
    S = S / n
    K = K / n - 3

    JB = n * ((S ^ 2) / 6 + (K ^ 2) / 24)
    JBpValue = WorksheetFunction.ChiDist(JB, 2)

    End Function
    Function JBStat(ReturnVector, SignificanceLevel)
    ' Jarque-Bera hypothesis test of normality.
    '
    ' Andreas Steiner, March 2006
    ' http://www.andreassteiner.net/performanceanalysis

    n = WorksheetFunction.Max(ReturnVector.Columns.Count, ReturnVector.Rows.Count)

    ReturnVectorMean = WorksheetFunction.Average(ReturnVector)
    ReturnVectorStDev = WorksheetFunction.StDev(ReturnVector)

    ' Normalize returns
    ReDim NormalizedReturns(1 To n)
    For i = 1 To n
    NormalizedReturns(i) = (ReturnVector(i) - ReturnVectorMean) / ReturnVectorStDev
    Next i

    ' Calculate 3rd and 4th moments (skewness and kurtosis)
    S = 0
    K = 0
    For i = 1 To n
    S = S + NormalizedReturns(i) ^ 3
    K = K + NormalizedReturns(i) ^ 4
    Next i
    S = S / n
    K = K / n - 3

    JBStat = n * ((S ^ 2) / 6 + (K ^ 2) / 24)

    End Function
    EXCEL FUNCTION

    Descriptive statistics

    Measures of Skewness and Kurtosis



    Skewness is a measure of symmetry, or more precisely, the lack of
    symmetry. A distribution, or data set, is symmetric if it looks the
    same to the left and right of the center point.

    Thursday, February 08, 2007

    Tuesday, February 06, 2007

    SAS/Excel Tricks

    ods noresults;
    ods listing close;
    ods html body="c:\temp\classods.xls";
    proc print data=sashelp.class(obs=10);
    run;
    ods html close;
    ods html body="c:\temp\shoesods.xls";
    proc print data=sashelp.shoes(obs=10);
    run;
    ods html close;
    ods html body="c:\temp\zipcodeods.xls";
    proc print data=sashelp.zipcode(obs=10);
    run;
    ods html close;
    ods listing;
    ods results;



    Macro to Combine Worksheets:



    %macro many2one(in=,out=);
    options noxwait;
    x erase "&out";
    options xwait;

    data _null_;
    file "c:\temp\class.vbs";
    put 'Set XL = CreateObject("Excel.Application")' /
    'XL.Visible=True';
    %let n=1;
    %let from=%scan(&in,&n," ");
    %do %while("&from" ne "");
    %let fromwb=%scan(&from,1,"!");
    %let fromws=%scan(&from,2,"!");
    put "XL.Workbooks.Open ""&fromwb""";
    %if &n=1 %then
    put "XL.ActiveWorkbook.SaveAs ""&out"", -4143"%str(;);
    %else %do;
    put "XL.Workbooks(""%scan(&fromwb,-1,'\')"").Sheets(""&fromws"").Copy ,XL.Workbooks(""%scan(&out,-1,'\')"").Sheets(%eval(&n-1))";
    put "XL.Workbooks(""%scan(&fromwb,-1,'\')"").Close";
    %end;
    %let n=%eval(&n+1);
    %let from=%scan(&in,&n, " ");
    %end;
    put "XL.Workbooks(""%scan(&out,-1,'\')"").sheets(1).activate";
    put "XL.Workbooks(""%scan(&out,-1,'\')"").Save";
    put "XL.Quit";
    run;

    x 'c:\temp\class.vbs';
    %mend;

    Example:

    %many2one(in=c:\temp\classods.xls!classods
    c:\temp\shoesods.xls!shoesods c:\temp\zipcodeods.xls!zipcodeods,
    out=c:\temp\combined.xls);

    sas-excel-tricks

    Stratified Random Sampling

    Stratified Random Sampling, also sometimes called proportional or quota random sampling, involves dividing your population into homogeneous subgroups and then taking a simple random sample in each subgroup. In more formal terms:



    Objective: Divide the population into non-overlapping groups (i.e., strata)
    N1, N2, N3, ... Ni,
    such that N1 + N2 + N3 + ... + Ni = N.
    Then do a simple random sample of
    f = n/N in each strata.




    There are several major reasons why you might prefer stratified sampling over simple random sampling. First, it assures that you will be able to represent not only the overall population, but also key subgroups of the population, especially small minority groups.If the subgroup is extremely small, you can use different
    sampling fractions (f) within the different strata to randomly over-sample the small group although you'll then have to weight the within-group estimates using the sampling fraction whenever you want overall population estimates). When we use the same sampling raction within strata we are conducting proportionate stratified random sampling.
    When we use different sampling fractions in the strata, we call this disproportionate stratified random sampling. Second, stratified random sampling will generally have more statistical precision than simple random sampling. This will only be true if the strata or groups are homogeneous. If they are, we expect that the variability within-groups is lower than the variability for the population as a whole. Stratified sampling capitalizes on that fact.

    Probability Sampling



    A probability sampling method is any method of sampling that utilizes some form of random selection. In order to have a random selection method, you must set up some process or procedure that assures that the different units in your population have equal probabilities of being chosen.



    Some Definitions :


    N = the number of cases in the sampling frame
    n = the number of cases in the sample
    NCn = the number of combinations (subsets) of n from N
    f = n/N = the sampling fraction

    Many computer programs can generate a series of random numbers.
    After that you have rearrange the list in random order from the lowest to the highest random number. Then, all you have to do is take the first hundred names in this sorted list.Simple random sampling is not the most statistically efficient method of sampling and you may, just because of the luck of the draw, not get good representation of subgroups in a population

    Sunday, February 04, 2007

    The way of creating the same distribution

    Sometimes We want to create the same distribution .
    We can do it in this way (The ttt limits the commulative distribution of another variable,which we can receive from proc freq )

    DATA resh2;
    SET halvaot.resh2;
    IF TARGET =0 THEN DO;
    ttt=ranuni(31311115)*100;
    if ttt<=5 then kod=200502;
    else if ttt<=28 then kod=200503;
    else if ttt<=40 then kod=200504;
    else if ttt<=52 then kod=200505;
    else if ttt<=62 then kod=200506;
    else if ttt<=71 then kod=200507;
    else if ttt<=80 then kod=200508;
    else if ttt<=88 then kod=200509;
    else if ttt<=93 then kod=200510;
    else if ttt<=100 then kod=200511;
    end;
    if target=1 then kod=100* year( Loan_Value_Date)+month(Loan_Value_Date);
    run;


    This is more wise way to do the same:



    proc freq DATA=halvaot.new_halv;
    tables Loan_Value_Date /out=outkod outcum noprint;
    run;

    data _null_ ; length kod_str pct_str $5000;
    set outkod end=eof;
    retain kod_str pct_str;
    kod_str=compress(kod_str||','||Loan_Value_Date);
    pct_str=compress(pct_str||','||cum_pct);
    if eof then do;
    call symput('a1',substr(pct_str,2));
    call symput('a2',substr(kod_str,2));
    call symput('nn',_n_);
    end;
    run;

    DATA resh222;
    SET halvaot.resh2;
    array a1 {&nn} _temporary_ (&a1);
    array a2 {&nn} _temporary_ (&a2);
    IF TARGET =0 THEN DO;
    ttt=ranuni(31311115)*100;
    do i=1 to dim(a1);
    if i=1 then do;
    if ttt<=a1[i] then Loan_Value_Date=a2[i];
    end;
    else do;
    if a1[i-1]<ttt<=a1[i] then Loan_Value_Date=a2[i];
    end;
    end;
    end;

    kod=100* year( Loan_Value_Date)+month(Loan_Value_Date);
    run;

    Saturday, February 03, 2007

    Factor Analyses with Sas

    In the beginning of the process we have to transpose the data in order to receive one column to each category.
    proc transpose data =jjj1 out=toz prefix=peul;
    by branch_cust_ip;
    id Event_Costing_Activity_Type_Co;
    var count;
    run;

    We want fill missing values with 0:


    data toz;
    set toz;
    array toz{*} _NUMERIC_ ;
    do i = 1 to dim(toz);
    if toz{i} = . then toz{i} = 0;
    end;
    drop i;
    run;

    Definition of factors:
    proc factor score data=rehishot.ishit method=p rotate=orthomax nfactors=10 outstat=fact_ish;
    var peul: ;
    run;
    Scoring of the data:
    proc score data=rehishot.ishit score=fact_ish out=scores_ishit;
    var peul: ;
    run;

    In the end we want to find the most influent (max Factor)
    and the less influent (min Factor)
    data scores_ishit;
    set scores_ishit ;
    max=max(Factor1,Factor2,
    Factor3,Factor4,Factor5,
    Factor6,Factor7,Factor8,
    Factor9,Factor10)
    ;
    min=min(Factor1,Factor2,
    Factor3,Factor4,Factor5,
    Factor6,Factor7,Factor8,
    Factor9,Factor10)
    ;
    run;

    data scores_ishit;
    set scores_ishit;
    array factor Factor1-factor10;
    do i=1 to dim(factor);
    if max=factor [i] then factor_max=i;
    if min=factor [i] then factor_min=i;
    end;
    run;

    Friday, February 02, 2007

    proc logistic

    ods trace on;

    It helps for us to receive all possible outputs
    for example Type3

    proc logistic data = k outest=jj;
    class ses race schtyp/param =glm;
    model female = age ses race schtyp science write/outroc =roc lackfit;
    units age=35 45 55;
    ods output ParameterEstimates = model_female
    Type3=chisq;
    run;

    I want to keep only significent variables:

    data chisq1;
    set chisq;
    /*format ProbChiSq;*/
    if ProbChiSq>0.05 then delete;
    run;


    PROC SQL ;
    SELECT effect INTO :mm separated BY " "
    from chisq1;
    quit;




    %put &mm;
    proc logistic data = k outest=jj;
    class &mm/param =glm;
    model female = &mm/outroc =roc lackfit;
    units age=35 45 55;
    ods output ParameterEstimates = model_female
    TypeIII=chisq;
    run;

    Export using scan

    %let name=1+3+6+12;
    %macro stam;
    %do
    i=1 %to 4;
    data kim_%scan(&name,&i,+) ;
    set kim;
    if vetek=%scan(&name,&i,+)*1;
    run;

    PROC EXPORT DATA= kim_%scan(&name,&i,+)
    OUTFILE= "\\c\kim_%scan(&name,&i,+).csv"
    DBMS=CSV REPLACE;
    RUN;
    %end
    ;
    %mend
    ;
    %stam;

    Wednesday, August 30, 2006

    Teradata extract function

    case when b.Branch_Cust_IP is null then extract(year from a.Ruler_Date)
    when a.Branch_Cust_IP is null then extract(year from b.Ruler_Date)
    else extract(year from a.Ruler_Date) end as shana

    Sas proc mi-fill missing

    proc mi data=events.no_gil seed=555536 simple nimpute=1 out=ll;
    em itprint outem=outem;
    var gil_max sum_PASIV sum_maskoret sum_KIZBA_YELED sum_KIZBAOT ;
    run;

    Sas proc tabulate

    proc tabulate data=shotef.matah ;
    class division_att_Desc Nalan_Cluster_Desc L_GLISHA_YESHIRA;
    var schum internet emda phone pakid_all;
    table ((division_att_Desc='name' all)*Nalan_Cluster_Desc='name'),L_GLISHA_YESHIRA=''*(schum='name'*(sum='name'*f=14. n)
    internet='name'*(mean='name'*f=5.1 n)
    emda='name'*(mean='name'*f=5.1 n)
    phone='name'*(mean='name'*f=5.1 n)
    pakid_all='name'*(mean='name'*f=5.1 n));
    format format Nalan_Cluster_Desc $nalan. L_GLISHA_YESHIRA membr.;
    keylabel all='total';
    run;

    Sas array demi

    data a;
    input id name $;
    cards;
    10 -11
    10 222
    10 -33
    23 333
    23 -33
    76 -11
    76 222
    76 -33
    ;
    run;
    data b;
    input names $;
    cards;
    -11
    -33
    222
    333
    999
    888
    555
    666
    777
    ;
    run;
    data bb; length nam_nam $200;set b end=eof;
    if _n_=1 then nam_nam=' ';
    retain nam_nam;
    nam_nam=trim(compress(nam_nam||"','"||names));
    if eof then do;
    nam_nam=compress(substr(nam_nam,3)||"'");
    output;
    end;
    call symput('name_macro',nam_nam);
    run;
    proc sort data=a; by id name; run;
    options symbolgen mprint;
    data b;
    set a; by id;
    array nam [9] $ _temporary_ (&name_macro);
    array vars v1-v9;
    retain v1-v9;
    do jj=1 to dim(vars);
    vars[jj]=.;
    end;
    do j=1 to dim(vars);
    if name=nam[j] then vars[j]=1;
    end;
    if last.id then output;
    drop jj j;
    run;

    Sas array transpose

    data average_lag6;
    set niud.ovr_lag_all_200110_200201;
    array product ms_kartisim
    maskorot
    l_miuazim
    miuazim
    ms_hk
    sch_hk
    halvaot
    misgeret_ashrai
    osher
    pasiv;

    array name{10} $ 16;
    do i=1 to dim(product);
    name[i]=vname(product[i]);
    schum=product [i];
    varname=name[i];
    month_lag=month_lag6;
    output;
    end;
    keep schum varname month_lag;
    run;


    proc sort data=dug;
    BY b;
    run;

    proc sql;
    select distinct
    year into: tkufa
    separated by '+'
    from dug;
    quit;



    %macro dugma;
    %do i=1 %to 2;
    DATA stam
    (DROP=I year x y year1 year2 z);
    ;
    ARRAY years {2} year1- year2;
    ARRAY xs {2} x1-x2;
    ARRAY ys {2} y1-y2;
    DO I=1 TO 2 UNTIL (LAST.b);
    SET dug;
    BY b;
    years {I}=year;
    x_%scan(&tkufa,&i,+)=xs{I};

    xs{I}=x;
    ys{I}=y;
    END;

    run;
    %end;
    %mend;
    %dugma;

    Tuesday, August 29, 2006

    Sas array combination

    data zevet1.arr;
    set zevet1.for_model;
    array dich[15] ms_harshot_d sum_mispar_hor_keva_d arhrai_d maskorot1 mashkanta1 TIH_HAP1 TIH_TASH1 avg_sch_kranot1 avg_sch_ne1 avg_pkl_pkd1 avg_pkl_pkd_zm1 GEMEL_NZL1 halvaot1 modern1 count;

    array name{15} $ _temporary_ ('ms_hars','hor_keva_d','arhrai_d',
    'maskorot1','mashkanta1','TIH_HAP1','TIH_TASH1', 'sch_kranot1',
    'sch_ne1','pkl_pkd1','pkl_pkd_zm1','GEMEL_NZL1','halvaot1',
    'modern1', 'count' );

    tzvt_activ_form=put(tzvt_activ,zev.);
    kod_vetek=put(vetek_lak,vetek.);
    do i=1 to 15;
    category=dich [i];
    prod=name[i];
    num_product=i;
    output;
    end;
    keep sd10_numerator category num_product tzvt_activ tzvt_activ_form prod kod_vetek;
    run;


    data b4_300;
    set zevet1.for_model(where=(tzvt_activ_form='300'));
    array aa ms_harshot_d sum_mispar_hor_keva_d arhrai_d maskorot1
    avg_pkl_pkd1 ;
    m=0;
    do i=1 to 5;
    do j=i+1 to 5;
    do k=j+1 to 5;

    comb=aa[i]+aa[j]+aa[k];
    name=compress(vname(aa[i])||'_'||vname(aa[j])||'_'||vname(aa[k]));
    m+1;
    if comb<3 then continue;
    else output;


    end;
    end;
    end;
    run;