Minitab/ multiple regression


Hi

Im trying to get to grips with the program Minitab, mainly multiple
regression. My question is could someone explain each area of a
results table shown below especially their meaning. This is just an
example so i just need to know how each value works i.e what are p
values and their effect etc.

I need this in the next 4 hours thanks very much

Regression Analysis: Total£ versus GFarea, Bedrooms

The regression equation is
Total£ = - 36280 + 84.2 GFarea + 20629 Bedrooms

79 cases used 2 cases contain missing values

Predictor        Coef     SE Coef          T        P
Constant       -36280       20143      -1.80    0.076
GFarea         84.198       9.779       8.61    0.000
Bedrooms        20629        4903       4.21    0.000

S = 49455       R-Sq = 66.5%     R-Sq(adj) = 65.7%

Analysis of Variance

Source            DF          SS          MS         F        P
Regression         2 3.69762E+11 1.84881E+11     75.59    0.000
Residual Error    76 1.85878E+11  2445767700
Total             78 5.55641E+11

Source       DF      Seq SS
GFarea        1 3.26460E+11
Bedrooms      1 43302466403

Unusual Observations

Obs     GFarea     Total£         Fit      SE Fit    Residual    St
Resid
  8       2504     381984      257070        9787      124914
2.58R
 18       1996     100300      214297        6362     -113997
-2.32R
 49       3076     274958      284602       17274       -9644
-0.21 X
 53       1980     374256      295467       19406       78789
1.73 X
 54       3501     453744      382274       17160       71470
1.54 X

R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Hello,

In answering this question, I shall precede your initial question with
>'s

> The regression equation is
> Total£ = - 36280 + 84.2 GFarea + 20629 Bedrooms 

In regression, we are trying to get an estimate which best predicts
the outcome.  The outcome in this case is Total£.  If we know GFArea
and Bedrooms, our best guess at Total£ is given by this equation.

> 79 cases used 2 cases contain missing values 

You had 81 rows in your data, however two of these rows contained
missing data, for at least one of the variables, and so were not used
in the analysis.

> Predictor        Coef     SE Coef          T        P
> Constant       -36280       20143      -1.80    0.076
> GFarea         84.198       9.779       8.61    0.000
> Bedrooms        20629        4903       4.21    0.000 

This next part gives some of the details of the equation.

Each of the estimates (coefficients, indicated with Coef) has a
standard error - this is a measure of how variable the estiamte is
likely to be.  To gain the 95% confidence intervals of the
coefficient, we multiply the standard error by 1.96, and add and
subrtract this from the coefficient.

So our best guess at GFArea is 84.  Howevever, this estimate has a
standard error of (approximately) 10.  So the confidence intervals are
given by 84 + 1.96 x 10 = 104 and 84 - 1.96 x 10 = 64.  If we were to
say that the true (population) value for the coefficient is likely to
be from 64 to 104, there is only a 5% chance (1 in 20) that we are
wrong.  That is, one more unit of GFArea adds between 64 and 104 units
of Total£.

The next part of the model is T.  T is given by Coef / SE.   

So: -36280 / 20143 = -1.80    

T isn't very useful on its own, but it does give us P - that is the
probability of the result occurring, if the real value in the
population is zero.  The fact that GFarea and Bedrooms both have low
probabilities (less than 0.0005) means that it is very unlikely you
would have found this result, if in fact they had no effect.

The constant is a special variable - this is the estimated value of
Total£ when all of the other predictors are zero.  It often makes no
sense - as is the case here (I am guessing).  The value of a house
with 0 bedrooms and a 0 ground floor area is -36280, is obviously a
silly thing to say.

> S = 49455       R-Sq = 66.5%     R-Sq(adj) = 65.7% 

OK, so this equation tells us the best guess at Total£, but the next
question is, how good is that?  This is given by R-Sq - or R-Squared.
R-Squared is the proportion of variance in the Total£ which is
explained by the predictors - in this case, it is 66.5% - quite a high
prediction.  If you take this as 0.665, and find the square root, it
is 0.81.  This is the correlation between the predicted score (given
by the equation) and the actual score.

The next question to ask is whether this is a good prediction - i.e.
is this prediction better than chance.

> Source            DF          SS          MS         F        P
> Regression         2 3.69762E+11 1.84881E+11     75.59    0.000
> Residual Error    76 1.85878E+11  2445767700
> Total             78 5.55641E+11 

> Source       DF      Seq SS
> GFarea        1 3.26460E+11
> Bedrooms      1 43302466403 

This is answered by the next section.  This is usually not reported in
depth, so I am not going to cover it here, but request clarification
if you need it.  The P value again tells us whether we can make a
significant prediction, or whether we are better off guessing.
because this p-value is very low, it is highly significant, and better
than guessing.  The most common thing here is to report the p-value
(again, it's <0.0005, it's not 0.000).  you might also want to report
the F, and the DF, in which case it's F=75.6, df = 2, 76.

> Obs  GFarea   Total£      Fit        SE Fit    Residual    StResid
>  8    2504    381984      257070        9787      124914   2.58R
> 18    1996    100300      214297        6362     -113997  -2.32R
> 49    3076    274958      284602       17274       -9644  -0.21 X
> 53    1980    374256      295467       19406       78789   1.73 X
> 54    3501    453744      382274       17160       71470   1.54 X

>R denotes an observation with a large standardized residual
>X denotes an observation whose X value gives it large influence.

The final part of the output is some diagnostics, to help you to
interpret the equation.  Minitab has selected some cases it believes
you might want to look at.  It bases this on the residuals and the
influence.

First, the residuals.  The residual is the difference between the
value we would expect, given GFArea and Bedrooms, and what we actually
have.    Large residuals are marked with an R. Case 8 has a very large
residual - its value for Total£ is 124914 higher than would be
expected.  Similarly, 18 has a much lower value than would be
expected.  It is worth looking at these to see if there has been an
error entering the data, or there is something unusual about them.
Maybe they are in a very different area to the others, maybe they are
paved with gold, or come with a free farm (I am guessing, because I
have no idea what the data are about - I am sure you can think of
something more sensible).

Second, the influential cases, marked with an X.  An influential case
is more important than the others in determining the values of the
coefficients.  It isn't necessarily anything to worry about, but again
is worth checking.  (I am not going to go into detail on this, because
of your time limit, but if you would like, just request
clarification).

I have written this fairly swiftly, because of your time limit, so if
you think that I have missed anything out, or would like clarification
on anything, please ask, before rating the question.

I will recheck the page fairly regularly for the rest of the day, to
see fi a clarification request appears.

Here's a useful site:
http://www.fw.umn.edu/biochr/assoc/dho/107/notes/minitab/REGRESS1.HTM

Here's a useful book.  :)
http://www.amazon.co.uk/exec/obidos/ASIN/0761962301/qid%3D1006938682/sr%3D1-1/ref%3Dsr%5Fsp%5Fre/026-9955570-8940457

jeremymiles-ga

Request for Answer Clarification by san007-ga on 28 Apr 2003 09:40 PDT 

hi is it possible to have further clarification on what is the f part
of this is please

it would help me out greatly

> Source            DF          SS          MS         F        P
> Regression         2 3.69762E+11 1.84881E+11     75.59    0.000
> Residual Error    76 1.85878E+11  2445767700
> Total             78 5.55641E+11  

> Source       DF      Seq SS
> GFarea        1 3.26460E+11
> Bedrooms      1 43302466403

Clarification of Answer by jeremymiles-ga on 28 Apr 2003 10:39 PDT 

Sure.  F is a statistic that is used in ANOVA and regression.

It's kind of hard to tell with your numbers (because they are very
large), but F is given by MSregression / MSError, where MS stands for
"mean squares".

The mean squares are given by the SS (sums of squares) divided by the
degrees of freedom.

So, what are the sums of squares?   The sums of squares are what
regression is all about - you might have heard of regression being
called OLS (ordinary least squares) regression - these are the squares
that make the sums of squares.  The sums of squares are short for sums
of squared deviation from the regression.

The first sum of squares we have are the total sum of squares.  These
are calculated by finding the residual (difference) between each value
and the mean, squaring it, and then adding them up.

The mean is a bit like a regression equation with no predictor
variables.  If you want to find a value, such that you minimise the
sum of squares, you will end up with the mean.  Here's an example:

Take the numbers:

3
4
5
6

Calculate the mean: 4.5

Find the difference between each value and the mean:
3 -1.5
4 -0.5
5  0.5
6  1.5

Square them:
3 -1.5 2.25
4 -0.5 0.25
5  0.5 0.25
6  1.5 2.25

Add them up; 5.

This gives the total sum of squares.  It is also impossible to pick
another number and find a lower value for the sum of squares, so we
can think of the mean as a least squares estimator.  (But we are
getting off the point).  (The sum of squares is also given by the
variance * N; the variance is the square of the standard deviation).

Now, where were we.

Now we get our regression equation, and we calculate a predicted
variable for each case and find the residual. If we square the
residuals, and add them up, we have the error sum of squares - the
point of regression is to find values for the equation that give the
error sum of squares - this is the least squares that we are talking
about.  This will always be lower than the total sum of squares, but
our question is, how much lower.

We have SStotal, and SSerror.  We can calculate the regression sum of
squares by subtracting error from total:

SSregression = SStotal - SSerror.

[You might notice at this point that the R-Sq is equal to the
SSregression / SStotal - hence it is the proportion of variance -
(SStotal being variance).]

This gives the three SS that we have in the table.  The trouble with
sums of squares is that they are dependent on things like the number
of people, and the number of variables, so we need to divide the
regression by the regression df, which is given by k (where k is the
number of predictor variables) and error df is given by N - k - 1.

We divide the SSregression by the SSerror, and we have F. So, the
smaller the error the larger the F.  In addition, more predictors, for
the same value of R-Sq gives a lower F.

F is a test statistic, which has a distribution which depends on the
degrees of freedom, which is why you need to report the degrees of
freedom when you report F.  What you are asking is "if there was no
effect in my data, what is the probability of getting a value of F
this high?"  High values of F are associated with low probabilities,
which means that it is unlikely that your data are a fluke, and that
you have a real result.

Again, please feel free to request clarification if any of this is
unclear.

jeremymiles-ga

Request for Answer Clarification by san007-ga on 28 Apr 2003 11:16 PDT 

you have been very helpful by the way!!

if you can and want to could you clarify what you think the
effectiveness is of this model or how i can obtain this.

i think im going to have to alter it

do you know how the hypothesis testing (null etc) would configure into
this model??

Clarification of Answer by jeremymiles-ga on 28 Apr 2003 11:23 PDT 

The model has a pretty good R-Sq, most people would be happy with that
- it's rare to see one higher.  This is the most common way to
statistically consider a model.  The cases with high residuals and
influence statistics are not a cause of excessive worry, IMHO.

What do you plan to do to alter the model?

[Given your time limit, I will post this and continue to the next part
of your question.]

Clarification of Answer by jeremymiles-ga on 28 Apr 2003 11:26 PDT 

The null hypothesis is *usually* the hypothesis of zero effect.  There
are two sets of null hypotheses tested in a regression analysis.
First is the null hypothesis that your predictions are no better than
chance.  This is tested using the F test of the R-Sq.  Your data have
easily rejected this null hypothesis.

Second, is the set of null hypotheses about the parameter estimates or
 coefficients.  Here, for GFArea and Bedrooms, you have rejected the
null hypotheses of no effect.  For the constant (sometimes called the
intercept) the null hypothesis is not rejected (if you use 0.05 as
your cut-off), however, as already discussed, the intercept in this
case isn't especially useful.

Request for Answer Clarification by san007-ga on 28 Apr 2003 13:20 PDT 

i was thinking of taking the rooms variable out as the r sq goes higher.

is it possible to take out the constant??

Clarification of Answer by jeremymiles-ga on 29 Apr 2003 02:16 PDT 

R-Sq is higher without the rooms variable?  That surprises me.  Can
you post the correlations between the variables?  I an calculate the
R-Sq then.

You can remove the intercept - this is called "regression through the
origin" - I am not sure if you can do this in minitab, but it is
usually not recommended. This page:

http://mallit.fr.umn.edu/fr5218/reg_refresh/origin.html

Says:

"Regression through the origin should not be used indiscriminately.
The variability of the data about the regression line should be
compared to that for the equation including b0 (that is, compare
corresponding sy.xs). A plot of studentized residuals versus X should
be made as well. If a non-horizontal linear trend is apparent, a
nonzero intercept should be suspected."

This one:
http://www.graphpad.com/instatman/Whentoforcearegressionlinethroughtheorigin_.htm

Says:

"Therefore you might be tempted to force the regression line through
the origin. But you don't particularly care where the line is in the
vicinity of the origin. You really care only that the line fits the
standards very well near the unknowns. You will probably get a better
fit by not constraining the line."

Just a stab in the dark here, but have you considered the non-linear
effect of GFArea - a small house getting an extra 10m^2 might make a
bigger difference than a large house getting an extra 10m^2.

Also, have you considered an interaction effect?  That is, the effect
of GFarea may not be constant across all numbers of bedrooms.
(Although these effects may have to be quite strong to detect them
with your sample sie).  These are two ways to increase the value of
R-Sq.

jeremymiles-ga

tAKEN fROM

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s