R version 2.10.1 (2009-12-14)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> options(defaultPackages=c("car", "hmisc") )
> # Notes on now to use the  free statistical package R in Windows.
> #I will show how to load data into it and run regressions.
> 
> # The hash market means this is a comment line.
> # February 24, 2010.
> # Professor  Eric Rasmusen, erasmuse@Indiana.edu.
> #Some of this draft is untested.
> 
> #R is a wonderfully crafted statistics packages with
> #amazingly bad documentation.
> #These notes show how to make it do the basics of regressions.
> 
> ######################################
> 
> # To download R go to: The Comprehensive R Archive Network
> # at   http://cran.case.edu/
> #Download it and install it on your computer, which is easy to do.
> 
> ######################################
> 
> # GETTING STARTED
> 
> # Click on the R icon that the installation program created on
> #your desktop. A window will open up.
> 
> #To find out what directory R is writing files to, reset it to the D:
> # root directory, and check again to see if it worked,  type:
> 
>  getwd()
[1] "D:/My Documents"
>  setwd("D:\\")
>  getwd()
[1] "D:/"
> 
> #Notice how slashes need to be doubled for R to read them.
> 
> #A subdirectory would look like
> # "D:\\_Take-to-office\\rfiles\\
> 
> #Now let's fix it so R doesn't show us lots
> # of meaningless numbers.
>  #  Scientific notation discouraged at intensity 6.
> #4 digits suggested (there is no way to require only 4)
>   options(scipen = 6)
>   options(digits = 4)
> 
> #Only  uncomment the next line if you know why.
>   #options(defaultPackages=c("car", "hmisc") )
> 
> 
> ######################################
> 
> #READING IN DATA
> 
> #A data file is just plain text. The format is like this
> #file called r-data.txt:
> 
> #   price floor area rooms age cent.heat state
> #1          52.00 111.0 830    5 6.2           no Illinois
> # 2 60    128.0 710 5 7.5 no Illinois
> #3 35  101.0 1000 5 4.2 no Illinois
> #4 50 131.0 690 6 8.8 no Indiana
> #5 20 93.0 900 5 1  yes Ohio
> #6 57  101.0        1000 5 4  no Illinois
> #7 80 100  690 6 8              no Indiana
> #8 30  90   800 5 2  yes Ohio
> 
> #Notice the lack of an observation number on the first line.
> # Capitalization matters. price and Price could
> # be different variables.
> #Spacing and tab symbols  don't matter, but
> #linebreaks do, I think.
> #Notice how binary yes/no variables and string variables
> # are easily accommodated.
> 
> 
> 
> 
> #The first read-in command creates an object called "data2":
> 
> data2 <- read.table("r-data.txt")
> 
> #That uses the file D:/r-data.txt, since we set D:/ as
> #the working directory.
> 
> 
> # In case we want to use  lags or time series functions,   transform
> #the dataset thus:
> 
>  tdata2 <- ts(data2)
> 
> 
> #Then  create some lags if you want--- one  and four  periods
> #here:
> 
> temp1= lag(tdata2,-1)
> temp4= lag(tdata2,-4)
>  tdata3 <- ts.union(tdata2, temp1, temp4)
> 
> 
> 
> # Now convert it back to a normal dataset:
> 
> data3 <-  data.frame(tdata3)
> 
> #Unfortunately, R has given stupid names to many of the
> #variables.  To see the stupid names, type
> 
>  colnames(data3)
 [1] "tdata2.price"     "tdata2.floor"     "tdata2.area"      "tdata2.rooms"     "tdata2.age"      
 [6] "tdata2.cent.heat" "tdata2.state"     "temp1.price"      "temp1.floor"      "temp1.area"      
[11] "temp1.rooms"      "temp1.age"        "temp1.cent.heat"  "temp1.state"      "temp4.price"     
[16] "temp4.floor"      "temp4.area"       "temp4.rooms"      "temp4.age"        "temp4.cent.heat" 
[21] "temp4.state"     
> 
> #Thus, let's convert at least some of them back.
> #R is very bad at renaming.
> #First, let's get R to put  our original names in quotes:
> 
> colnames(data2)
[1] "price"     "floor"     "area"      "rooms"     "age"       "cent.heat" "state"    
> 
> #Then let's cut-and-paste those  into a command to fix up data3:
> 
> colnames(data3)[1:7] <- c ( "price",     "floor",     "area",
+ "rooms",     "age",       "cent.heat", "state")
> 
> #Cut and paste from that last command and insert 1's and 4's:
> 
>  colnames(data3)[8:14] <- c ( "price1",     "floor1",     "area1",
+ "rooms1",     "age1",       "cent.heat1", "state1")
> 
>   colnames(data3)[15:21] <- c ( "price4",     "floor4",     "area4",
+ "rooms4",     "age4",       "cent.heat4", "state4")
> 
> # Now look at your dataset to see if the names and  lags turned
> #out right.  Notice that the cent.heat and state variables have
> #been made numerical.
> 
> data3
   price floor area rooms age cent.heat state price1 floor1 area1 rooms1 age1 cent.heat1 state1 price4 floor4
1     52   100  830     5 6.2         1     1     NA     NA    NA     NA   NA         NA     NA     NA     NA
2     60   148  210     5 7.5         1     1     52    100   830      5  6.2          1      1     NA     NA
3     35    50 1000     8 4.2         1     1     60    148   210      5  7.5          1      1     NA     NA
4     50   102  690     6 8.8         1     2     35     50  1000      8  4.2          1      1     NA     NA
5     20    38  900     8 2.0         2     3     50    102   690      6  8.8          1      2     52    100
6     57   118 1000     5 4.0         1     1     20     38   900      8  2.0          2      3     60    148
7     80   160  690     8 8.0         1     2     57    118  1000      5  4.0          1      1     35     50
8     30    90  800     6 4.0         1     3     80    160   690      8  8.0          1      2     50    102
9     40    84  800     4 4.0         2     3     30     90   800      6  4.0          1      3     20     38
10    45    70  900     5 2.0         2     3     40     84   800      4  4.0          2      3     57    118
11    52   112  830     5 6.8         1     1     45     70   900      5  2.0          2      3     80    160
12    60   148  710     5 7.5         1     1     52    112   830      5  6.8          1      1     30     90
13    35    44 1000     5 4.2         1     1     60    148   710      5  7.5          1      1     40     84
14    50   100  690     6 8.8         1     2     35     44  1000      5  4.2          1      1     45     70
15    22    38  990     8 1.0         2     3     50    100   690      6  8.8          1      2     52    112
16    57   128 1000     5 4.0         1     1     22     38   990      8  1.0          2      3     60    148
17    80   154  290     8 8.0         2     2     57    128  1000      5  4.0          1      1     35     44
18    32    48  800     5 3.0         1     3     80    154   290      8  8.0          2      2     50    100
19    42    64  800     4 4.0         2     3     32     48   800      5  3.0          1      3     22     38
20    48   160  440     5 2.0         1     3     42     64   800      4  4.0          2      3     57    128
21    54   126  690     6 8.8         1     2     48    160   440      5  2.0          1      3     80    154
22    24    36  400     5 1.0         2     3     54    126   690      6  8.8          1      2     32     48
23    57   120 1000     5 4.0         1     1     24     36   400      5  1.0          2      3     42     64
24    84   160  690     6 8.0         1     2     57    120  1000      5  4.0          1      1     48    160
25    32    90  400     5 3.0         2     3     84    160   690      6  8.0          1      2     54    126
26    42    64  900     4 4.0         2     3     32     90   400      5  3.0          2      3     24     36
27    44    60  700     5 2.0         2     3     42     64   900      4  4.0          2      3     57    120
28    34    56  400    14 3.0         1     3     44     60   700      5  2.0          2      3     84    160
29    42    64  900     8 4.0         1     3     34     56   400     14  3.0          1      3     32     90
30    44    62  100     5 2.0         1     3     42     64   900      8  4.0          1      3     42     64
31    NA    NA   NA    NA  NA        NA    NA     44     62   100      5  2.0          1      3     44     60
32    NA    NA   NA    NA  NA        NA    NA     NA     NA    NA     NA   NA         NA     NA     34     56
33    NA    NA   NA    NA  NA        NA    NA     NA     NA    NA     NA   NA         NA     NA     42     64
34    NA    NA   NA    NA  NA        NA    NA     NA     NA    NA     NA   NA         NA     NA     44     62
   area4 rooms4 age4 cent.heat4 state4
1     NA     NA   NA         NA     NA
2     NA     NA   NA         NA     NA
3     NA     NA   NA         NA     NA
4     NA     NA   NA         NA     NA
5    830      5  6.2          1      1
6    210      5  7.5          1      1
7   1000      8  4.2          1      1
8    690      6  8.8          1      2
9    900      8  2.0          2      3
10  1000      5  4.0          1      1
11   690      8  8.0          1      2
12   800      6  4.0          1      3
13   800      4  4.0          2      3
14   900      5  2.0          2      3
15   830      5  6.8          1      1
16   710      5  7.5          1      1
17  1000      5  4.2          1      1
18   690      6  8.8          1      2
19   990      8  1.0          2      3
20  1000      5  4.0          1      1
21   290      8  8.0          2      2
22   800      5  3.0          1      3
23   800      4  4.0          2      3
24   440      5  2.0          1      3
25   690      6  8.8          1      2
26   400      5  1.0          2      3
27  1000      5  4.0          1      1
28   690      6  8.0          1      2
29   400      5  3.0          2      3
30   900      4  4.0          2      3
31   700      5  2.0          2      3
32   400     14  3.0          1      3
33   900      8  4.0          1      3
34   100      5  2.0          1      3
> 
> #Make this dataset the default dataset to use for this session:
> 
> attach(data3)
> 
> ######################################
> 
> # SUMMARY STATISTICS AND PLOTTING
> 
>  #We can get some summary statistics.
> #Use your original dataset without the lags, thus:
> 
> summary(data2)
     price          floor            area          rooms            age       cent.heat      state   
 Min.   :20.0   Min.   : 36.0   Min.   : 100   Min.   : 4.00   Min.   :1.00   no :20    Illinois: 9  
 1st Qu.:35.0   1st Qu.: 60.5   1st Qu.: 690   1st Qu.: 5.00   1st Qu.:3.00   yes:10    Indiana : 6  
 Median :44.5   Median : 90.0   Median : 800   Median : 5.00   Median :4.00             Ohio    :15  
 Mean   :46.8   Mean   : 93.1   Mean   : 718   Mean   : 5.97   Mean   :4.66                          
 3rd Qu.:56.2   3rd Qu.:124.5   3rd Qu.: 900   3rd Qu.: 6.00   3rd Qu.:7.33                          
 Max.   :84.0   Max.   :160.0   Max.   :1000   Max.   :14.00   Max.   :8.80                          
> 
> #To get a correlation matrix, use your original dataset without the
> #lags, thus:
> 
> cor(data2)
             price   floor    area     rooms       age cent.heat state
price      1.00000  0.8666 -0.1688 -0.075465  0.720971        NA    NA
floor      0.86662  1.0000 -0.2398 -0.133545  0.671402        NA    NA
area      -0.16884 -0.2398  1.0000 -0.127165 -0.091508        NA    NA
rooms     -0.07547 -0.1335 -0.1272  1.000000 -0.000967        NA    NA
age        0.72097  0.6714 -0.0915 -0.000967  1.000000        NA    NA
cent.heat       NA      NA      NA        NA        NA         1    NA
state           NA      NA      NA        NA        NA        NA     1
Warning message:
In cor(data2) : NAs introduced by coercion
> 
> #The cor() command ignores our request to only have 2 digits.
> #To get it down to 2 digits, read it into Excel, and change it there.
> 
> 
> #To plot two variables   type
> 
> plot(price,floor, main="Figure 2: Price and Floor")
> 
> #To save a figure as a jpg file, go to FILE, SAVE-AS, jpg on the
> #top menu.
> 
> 
> 
> 
> ######################################
> 
> # RUNNING A REGRESSION
> 
> #To do a regression, create a regression output object called
> #"output4":
> 
> output4 <- lm(price~ price1 + floor +area+  floor*area +
+ cent.heat )
> 
> #Then display the output:
> 
> summary(output4)

Call:
lm(formula = price ~ price1 + floor + area + floor * area + cent.heat)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.006  -4.561  -0.857   4.687  15.957 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) 21.3477199 13.9330798    1.53   0.1391   
price1      -0.1088636  0.1132283   -0.96   0.3463   
floor        0.3154225  0.1119530    2.82   0.0098 **
area        -0.0024662  0.0158005   -0.16   0.8773   
cent.heat    0.5845869  3.9450931    0.15   0.8835   
floor:area   0.0000322  0.0001680    0.19   0.8499   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.74 on 23 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.765,      Adjusted R-squared: 0.714 
F-statistic:   15 on 5 and 23 DF,  p-value: 0.00000136 

> 
> 
> 
> #If you want the residuals, variance-covariance matrix, or
> #predicted values, type
> 
> residuals5  <- output4$resid
> residuals5
       2        3        4        5        6        7        8        9 
 -3.4354   4.6868  -0.8566  -7.9399  -1.3030  11.9570 -11.9535  -5.9343 
      10       11       12       13       14       15       16       17 
  4.9519  -1.3030  -4.5818   6.7723  -0.1814  -5.8279  -4.5611  14.3924 
      18       19       20       21       22       23       24       25 
  4.3747   3.1063 -21.0063  -3.5440  -3.4700  -1.5627  15.9570  -9.9315 
      26       27       28       29       30 
  3.1472   7.5059  -0.5398   3.9495   7.1310 
> 
> vcov5 <-vcov(output4)
> vcov5
            (Intercept)       price1       floor         area   cent.heat
(Intercept)  194.130712 -0.607870644 -0.97602197 -0.139641489 -24.0600659
price1        -0.607871  0.012820657 -0.00302164 -0.000312992   0.0738479
floor         -0.976022 -0.003021641  0.01253348  0.001548094  -0.0414259
area          -0.139641 -0.000312992  0.00154809  0.000249656  -0.0119694
cent.heat    -24.060066  0.073847934 -0.04142587 -0.011969435  15.5637593
floor:area     0.000937  0.000006449 -0.00001719 -0.000002378   0.0001890
                floor:area
(Intercept)  0.00093699746
price1       0.00000644884
floor       -0.00001718529
area        -0.00000237768
cent.heat    0.00018903695
floor:area   0.00000002822
> 
> pred5  <- output4$fitted.values
> pred5
    2     3     4     5     6     7     8     9    10    11    12    13    14 
63.44 30.31 50.86 27.94 58.30 68.04 41.95 45.93 40.05 53.30 64.58 28.23 50.18 
   15    16    17    18    19    20    21    22    23    24    25    26    27 
27.83 61.56 65.61 27.63 38.89 69.01 57.54 27.47 58.56 68.04 41.93 38.85 36.49 
   28    29    30 
34.54 38.05 36.87 
> 
> #Thesummary(), residuals(), and vcov() commands partly ignore
> #our request to only have  2 digits.
> #To get them down to 2 digits, cut-and-paste into Excel.
> 
> ######################################
> 
> #ODDS AND ENDS
> 
> #We can plot a data histogram.
> 
> #First get rid of all the observations with missing data.
> 
> data3a<-na.omit(data3)
> attach(data3a)

        The following object(s) are masked from data3 :

         age age1 age4 area area1 area4 cent.heat cent.heat1 cent.heat4 floor floor1 floor4 price price1 price4 rooms rooms1 rooms4 state state1 state4 

> 
> #Then create and plot the histogram: 
> 
> var234<-hist(price)
> plot(var234, main="Figure 1: A histogram  of price")
> 
> # We can also do kernel densities, for variables that
> #are close to continuous. Not true here, but try it anyway: 
> 
> var123 <- density (price)
> plot(var123, main="Figure 3: A kernel density estimate of price")
> 
> #Finally, let's go back to our original dataset: 
> 
> attach(data3)

        The following object(s) are masked from data3a :

         age age1 age4 area area1 area4 cent.heat cent.heat1 cent.heat4 floor floor1 floor4 price price1 price4 rooms rooms1 rooms4 state state1 state4 


        The following object(s) are masked from data3 ( position 4 ) :

         age age1 age4 area area1 area4 cent.heat cent.heat1 cent.heat4 floor floor1 floor4 price price1 price4 rooms rooms1 rooms4 state state1 state4 

> 
> ##################
> 
> #If you don't want an intercept in a regression, type
> 
> output6 <- lm(price~ 0+  floor +area+  cent.heat)
> summary(output6)

Call:
lm(formula = price ~ 0 + floor + area + cent.heat)

Residuals:
   Min     1Q Median     3Q    Max 
-20.70  -3.80   1.97   5.18  15.62 

Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
floor      0.38387    0.02893   13.27  2.4e-13 ***
area       0.00793    0.00492    1.61     0.12    
cent.heat  3.79167    2.48627    1.53     0.14    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.54 on 27 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared: 0.973,      Adjusted R-squared: 0.97 
F-statistic:  326 on 3 and 27 DF,  p-value: <2e-16 

> 
> ##################
> 
> #To plot two variables and the regression line for a
> #TWO-VARIABLE regression:
> 
> output5 <- lm(price~ floor   )
>  plot( floor,price,abline(output5$coef))
> 
> ##################
> 
> #I don't have fixed state effects working yet.
> #What is below is mistaken.
> # To put in fixed state effects,   type
> 
> output7 <- lm(price~    floor +area+  age + factor(state))
> summary(output7)

Call:
lm(formula = price ~ floor + area + age + factor(state))

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2520  -6.1342  -0.0477   7.2591  10.6495 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    25.3487029 10.9750612    2.31     0.03 *  
floor           0.2611227  0.0499754    5.23 0.000024 ***
area            0.0000736  0.0064021    0.01     0.99    
age            -0.3401386  1.2951651   -0.26     0.80    
factor(state)2  8.8925386  5.5317332    1.61     0.12    
factor(state)3 -6.2284701  5.0679449   -1.23     0.23    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 7.63 on 24 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared: 0.814,      Adjusted R-squared: 0.775 
F-statistic:   21 on 5 and 24 DF,  p-value: 0.0000000463 

> 
>  #The dummy variables for "state" will be created automatically.
> 
> 
> ##################
> 
> # To run a logit,  convert the y-variable to have only
> #values of 0 or 1 thus:
> 
> cent.heat = cent.heat -1
> cent.heat
 [1]  0  0  0  0  1  0  0  0  1  1  0  0  0  0  1  0  1  0  1  0  0  1  0  0  1
[26]  1  1  0  0  0 NA NA NA NA
> 
> #Then run the regression
> 
>  output8 <- glm(cent.heat~   floor +area, family =binomial(link =
+ "logit") )
> summary(output8)

Call:
glm(formula = cent.heat ~ floor + area, family = binomial(link = "logit"))

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.489  -0.777  -0.492   1.046   2.068  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.50676    1.87313    1.34    0.181  
floor       -0.02709    0.01265   -2.14    0.032 *
area        -0.00120    0.00178   -0.67    0.500  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 38.191  on 29  degrees of freedom
Residual deviance: 32.279  on 27  degrees of freedom
  (4 observations deleted due to missingness)
AIC: 38.28

Number of Fisher Scoring iterations: 4

> 
> ##################
> 
> # If you have your data in a spreadsheet form, save
> # it as a *.csv file,  which   is comma separated.  Then
>  #your R  data input command  would be
> 
> data2a <- read.csv("D:\\_Take-to-office\\r-data.csv")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'D:\_Take-to-office\r-data.csv': No such file or directory
> 
>  #I haven't tested this one.
> 
> ##################
> 
> 
> #We can write the data  out to a file called data3.csv,
> #  no variable column names,  comma-separated:
> 
> write.table(data3, col.names = FALSE,  sep = ",",
+ file="data3.csv")
> 
> #We can write the data  out to a file called data3.txt,
> #  with variable column names,  space-separated:
> 
> write.table(data3, col.names =TRUE,  sep = " ",
+ file="data3.txt")
> 
> ######################################
> 
> #USING PACKAGES FOR SPECIAL COMMANDS
> 
> 
> ##################
> 
> # To do an F-Test for b2 + b3 =1, first download
> # the "car" "package".
> 
> #On the top menu, pick
> # Packages, Select Repository, (pick any site), Install Package, car
> 
>  #Then, whenever you want to use "car" pick
> # Packages, Load Package, car.
> 
> #Finally, type (BUT THIS DOESN'T WORK YET)
> 
> output6 <- lm(price~ floor +area+  cent.heat)
> rhs <- c(1)
> restricted6  <- rbind(c(0,0,1,1))
> linear.hypothesis(output6, restricted6, rhs)
Error: could not find function "linear.hypothesis"
> 
> ##################
> 
> #To run a Durbin-Watson test for autocorrelation, load the car
> #package  as described above and then  type
> 
> output9 <-  lm(price ~ floor+ area)
> durbin.watson(output7, max.lag=2)
Error: could not find function "durbin.watson"
> 
> ##################
> 
> #To save regression output in a Latex file, download and load the
> #"hmisc" package in the same way as the "car" package described
> #above and type
> 
> output10 <-  lm(price~ floor +area+  floor*area +  cent.heat)
> latex(summary(output10)$coef)
Error: could not find function "latex"
> 
> #You'll get something pretty close to latex, with
> # standard errors, t-stats, and p-values.
> #It will be in a file called "summary.tex" in your
> # working directory.
> 
> 
> ######################################
> 
> #REFERENCES
> 
> # Mark Gardener, "Using R for statistical analyses" is at:
> 
> #http://www.gardenersown.co.uk/Education/Lectures/R/regression.htm#what_is_R"
> 
>  #Gardener  is the best documentation, giving full step-by-step
> #procedures.
> 
> # Grant Farnsworth's "Econometrics in R"  is at:
> 
> #http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf
> 
> #Farnsworth is the best longer  reference for economists
> 
> # MAT 356 R Tutorial, Spring 2004 is at:
> 
> #  http://math.illinoisstate.edu/dhkim/rstuff/rtutor.html
> 
> #I learned some things from it.
> 
> 
> #"An Introduction to R", by Venables, Smith, et al. is bad.
> 
> 
> 
> #B. D. Ripley and D. J. Murdoch "R for Windows FAQ" is at:
> 
> #  http://cran.r-project.org/bin/windows/base/rw-FAQ.html
> 
> #It probably won't be useful.
> 
> ######################################
> 
> 
> 
> 
> 
> 
> 
> 
>