REGRESSION ANALYSIS

A. Overview of regression analysis
B. Step-by-step instructions for doing regressions in Excel
C. Interpreting your Excel output

A. OVERVIEW OF REGRESSION ANALYSIS

Inferential statistics is all about trying to generalize about a population on the basis of a sample. We have to be very careful about the inferences we make when we do research. How sure are we that the results of our study are representative of the whole population?

Take the following example from macroeconomics:

Suppose we want to know how consumption and income are related over time. Well, first we collect a sample of data. This may be a few years or a few decades. Either way, it is only a sample of time. How sure are we that the relationship we find between consumption and disposable income in our sample holds for all time? Or, is it just a quirk of the time period that we chose. these are important questions to answer if we want to understand how the economy works, not just this year or this decade, but anytime.

Remember, there is always a chance that sample is not representative of the population. In this case, the population of time. Sampling distributions tell us that if we were to take a lot of samples, the "average" sample would be unbiased, and well-representative of the population. But, we don't take a lot of samples at a time. Typically, we take one sample. Fortunately, the Central Limit Theorem tells us that if we take a big enough sample, the distribution of the samples will follow the Student t-distribution. Therefore, the t distribution will allow us to calculate the probability that our sample statistic (e.g., sample mean) falls within a certain range of the population parameter (i.e., the "real" answer we are looking for, but don't know).

Since we don't know the "true answer", we are left to theorize and hypothesize. Returning to our example from above, let's say that some brilliant economist theorizes that an increase in income this year will lead to an increase in consumption this year. The theory makes a pretty specific claim as to the true value of this relationship. Even though there no exact number given, the theory predicts that this number is positive (i.e., not negative or zero). Suppose further that you find that, specifically, in your sample, a $1.00 increase in income leads to a $0.50 increase in consumption. Does this support the theory? Well, it does to a degree, but how certain are we that this result is not simply a quirk of the sample? Maybe another sample would yield an estimate of $0.00 instead of $0.50. How likely is it that we could get a result of $0.50 if this theory were not true?

Before we answer this question, we must decide on a criteria. Every time you are going to judge something, or someone, you must first decide on what criterion will this be judged? In a jury trail, there are two conflicting theories. the prosecution claims "guilt" while the defense claims "not guilty". In our society, we have decided that the burden of proof must fall on the prosecution. That is, the jury is to assume that the person is not guilty, until such time as there is a preponderance of evidence which suggest otherwise. the same is true of science. If you have a new theory, the burden of proof on you to show that your theory is correct. So, the researcher (the jury) must assume (we call this assumption the null hypothesis) that the theory is incorrect until there is a preponderance of evidence to suggest otherwise. Conventionally, we tend to require that the researcher be at least 90% certain that the theory is correct (i.e., less than 10% chance that the null hypothesis right).

The way we calculate the probability that the null hypothesis is wrong is by calculating a t value, and comparing it to the t distribution. Remember from above that the t distribution describes the distribution of all possible samples. Therefore, it can tell us the probability that we got the sample estimate we did if the null hypothesis was actually correct. Since we only have a sample here, there is always a chance that we can get a sample answer which is real far from the hypothesis, even though the hypothesis is true (it would imply that our sample is very, very biased, and therefore not representative of the whole population).


B. HOW TO PERFORM A REGRESSION IN EXCEL

Linear Regression allows you to determine linear relationships between two or more variables. Regression analysis works by fitting a line which best describes the relationship. In general, the equation of a line is written as Y = bo + b1*X + e; where Y is the dependent variable, X is the independent variable, bo is the intercept and b1 is the coefficient of change on X. b1 measures how much Y changes when X changes by 1.

To perform a regression:

1. From the menu at the top of the program, select: TOOLS, DATA ANALYSIS, REGRESSION

2. Input the Y Range (where the Y data is located; e.g., "P15:B155") Input the X Range (where the Xs are located; e.g., "O15:O155" for one independent var. or "N15:O155" if you had 2 independent variables, etc...). Remember, the "Y" variable is the dependent variable (e.g., Real Consumption) and the "X" variable is the independent variable (e.g., "Real Disposable Income")

**IMPORTANT** "click" on the "Labels" box if you have included the column NAMES in your Y and X ranges

Note: In this case, since there are no observations for disposable GDP prior to March 1960, the range cannot begin before row 15.

You also have the choice of several options, like charting a "Line Fit Plot" which will give you a scatter plot of your regression.

3. Hit return once you are ready and the output will be calculated on a separate sheet:

This part is the most important! it tells you the estimated coefficients of change associated with each of the X variables. Now, you can write the above output in the form of the general equation,

General Model: Y = b0 + b1*X

as a consumption function given by,

Estimated Model:

Real Cons = -49.42 + 0.802*(Real Disp GDP)

These are merely estimates. We need to know if the X variable, Disposable Income "significantly affects the Y variable (Real Consumption). To "test" this, we must perform a t-test. When Excel reports the t-test it assumes that each of the coefficients = 0. That is, the the "null hypothesis".

Of course, the exact critical t depends on the degrees of freedom. At the 10% level the critical t will be about +/- 1.67, while the 5% critical t is about +/- 1.96. In addition, Excel gives you the actual critical level of significance (P VALUE). We usually require that we are 90% (P VALUE of .10 or less) certain of the estimate in order to conclude a "significant" relationship between the Y and X variable. In some cases, 80% might be notable.


**To get the critical t's when you are in Excel, simply go to FUNCTION, and select TINV(...). You need only tell Excel the critical level you wish to use (.10 or .05) and the degrees of freedom.


C. INTERPRETATION OF THE REGRESSION RESULTS

In this case we must do 2 t-tests: one for the intercept and one for the slope parameter. the null hypothesis are:

H0: b0 = 0
H0: b1 = 0

Though Excel gives you the t score, it remember that it calculates t based on the null that the coefficient is 0. It could also be found by using the t test formula:

t = (b1 - 0)/SE

Because the t scores are relatively large (-2.73 and 155, respectively), we will reject both hypothesis (at the 5% level, the critical t is about +/- 1.96). This means that neither the intercept or the slope parameter are zero. So, we feel confident writing the consumption function as,

Real Cons = -49.42 + 0.802*(Real Disp GDP)

Real disposable income has a significant, positive effect on real consumption. As income increases by $1.00, consumption increases by $0.80.

Sometimes, a theory might claim that the coefficient is not only positive, but equal to a certain value. Let's say there is a theory that the coefficient on income should be exactly equal to 1.00. You could also test to see if the coefficient is significantly different from 1.00, just like we tested above to see if it was significantly different from zero. Only, in this case the null hypothesis is

H0: b1 = 1.0

In this case the t test is written as,

t = (b1 - 1)/SE

In this case, t=.802-1/.005 = -39.6. Again, this is a clear rejection of the null. So, we can say that the theory was wrong; the coefficient on income is significantly smaller than 1.00.

AN EXAMPLE OF MULTIPLE REGRESSION

This output is from a data set containing information on the price of wine, its quality (measured by a rating, or score), and the supply of each brand of wine (measured by the number of cases produced). The question was what are the significant factors which determine the price of wine. On the one hand, increases in demand caused by a high quality would likely lead to higher prices; on the other hand, as the winery produces more, we'd expect a drop in prices from the increased supply.

Regression Statistics

Multiple R

0.655389

R Square

0.429535

Adjusted R Square

0.417773

Standard Error

13.26669

Observations

100

The R-Square tells us how much of the variance of "Y" we have explained in the regression. In this case, about 42%.

ANOVA

         
 

df

SS

MS

F

Significance F

Regression

2

12854.82

6427.409

36.51832

1.50E-12

Residual

97

17072.49

176.0051

   

Total

99

29927.31

     

 

 

Coefficients

Standard Error

TSTAT

P-value

Lower 95%

Upper 95%

Intercept

-475.902

62.80827

-7.57707

2.10E-11

-600.559

-351.25

Score

5.466908

0.680143

8.03788

2.23E-12

4.117013

6.8168

# of Cases Produced

-0.00013

0.000128

-0.98055

0.329256

-0.00038

0.00013

Now, you can write the above output in the form of a predicted equation as follows

Y = -475.902 + 5.47*(SCORE) - 0.00013*(# OF CASES)

These are merely estimates. We need to know if the X variables, SCORE and # OF CASES "significantly affect the Y variable (in this example the PRICE OF WINE). To "test" this, we must perform a t-test. The t-test assumes that each of the coefficients = 0. This the the "null hypothesis".

Of course, the exact critical t depends on the degrees of freedom. At the 10% level the critical t will be about +/- 1.67, while the 5% critical t is about +/- 1.96. In addition, Excel gives you the actual critical level of significance (P VALUE). We usually require that we are 90% (P VALUE of .10 or less) certain of the estimate in order to conclude a "significant" relationship between the Y and X variable. In some cases, 80% might be notable.

In this example, we would "fail to reject the hypothesis that # OF CASES DOES NOT significantly affect the PRICE OF WINE. in other words the # OF CASES produced does not appear to affect the PRICE OF WINE. On the other hand, the SCORE does - we reject the hypothesis that SCORE does not affect the PRICE at least the 1% level (99% sure it affects it). In conclusion, we would say that (i) CASES has no effect on PRICES and (ii) SCORE has a positive, and significant effect.


Elon University 
Campus Box 2700
Elon, NC 27244 (800) 334-8448
E-mail: web@elon.edu 
 

Last Modified: 08/05/03
Copyright Elon University