Friday, April 06, 2007

How to Lie with Statistics

  • How to Lie with Statistics
  • Confidence intervals for the predicted values - logistic regression

    Using predict
    after logistic
    to get predicted probabilities and confidence intervals is somewhat tricky. The
    following two commands will give you predicted probabilities:


            . logistic ...
    . predict phat


    The following does not give you the standard error of the predicted
    probabilities:


            . logistic ...
    . predict se_phat, stdp


    Despite the name we chose, se_phat does not contain the
    standard error of phat. What does it contain? The standard error
    of the predicted index. The index is the linear combination of the estimated
    coefficients and the values of the independent variable for each observation
    in the dataset. Suppose we fit the following logistic
    regression model:


            . logistic y x 


    This model estimates b0 and b1 of the following model:


    P(y = 1) = exp(b0+b1*x)/(1 + exp 0+b1*x))
    Here the index is b0 + b1*x. We could get
    predicted values of the index and its standard error as follows:

            . logistic y x
    . predict lr_index, xb
    . predict se_index, stdp


    We could transform our predicted value of the index into a predicted
    probability as follows:


    . gen p_hat = exp(lr_index)/(1+exp(lr_index))


    This is just what predict does by default after a logistic regression
    if no options are specified. Using a similar procedure, we can get a 95%
    confidence interval for our predicted probabilities by first generating the
    lower and upper bounds of a 95% confidence interval for the index and then
    converting these to probabilities:



    . gen lb = lr_index - invnorm(0.975)*se_index
    . gen ub = lr_index + invnorm(0.975)*se_index
    . gen plb = exp(lb)/(1+exp(lb))
    . gen pub = exp(ub)/(1+exp(ub))


    Generating the confidence intervals for the index and then
    converting them to probabilities to get confidence intervals for the predicted
    probabilities is better than estimating the standard error of the predicted
    probabilities and then generating the confidence intervals directly from that
    standard error. The distribution of the predicted index is
    closer to normality than the predicted probability.

  • Confidence intervals for the predicted values - logistic regression-stata