R version 2.10.1 (2009-12-14) Copyright (C) 2009 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > options(defaultPackages=c("car", "hmisc") ) > # Notes on now to use the free statistical package R in Windows. > #I will show how to load data into it and run regressions. > > # The hash market means this is a comment line. > # February 24, 2010. > # Professor Eric Rasmusen, erasmuse@Indiana.edu. > #Some of this draft is untested. > > #R is a wonderfully crafted statistics packages with > #amazingly bad documentation. > #These notes show how to make it do the basics of regressions. > > ###################################### > > # To download R go to: The Comprehensive R Archive Network > # at http://cran.case.edu/ > #Download it and install it on your computer, which is easy to do. > > ###################################### > > # GETTING STARTED > > # Click on the R icon that the installation program created on > #your desktop. A window will open up. > > #To find out what directory R is writing files to, reset it to the D: > # root directory, and check again to see if it worked, type: > > getwd() [1] "D:/My Documents" > setwd("D:\\") > getwd() [1] "D:/" > > #Notice how slashes need to be doubled for R to read them. > > #A subdirectory would look like > # "D:\\_Take-to-office\\rfiles\\ > > #Now let's fix it so R doesn't show us lots > # of meaningless numbers. > # Scientific notation discouraged at intensity 6. > #4 digits suggested (there is no way to require only 4) > options(scipen = 6) > options(digits = 4) > > #Only uncomment the next line if you know why. > #options(defaultPackages=c("car", "hmisc") ) > > > ###################################### > > #READING IN DATA > > #A data file is just plain text. The format is like this > #file called r-data.txt: > > # price floor area rooms age cent.heat state > #1 52.00 111.0 830 5 6.2 no Illinois > # 2 60 128.0 710 5 7.5 no Illinois > #3 35 101.0 1000 5 4.2 no Illinois > #4 50 131.0 690 6 8.8 no Indiana > #5 20 93.0 900 5 1 yes Ohio > #6 57 101.0 1000 5 4 no Illinois > #7 80 100 690 6 8 no Indiana > #8 30 90 800 5 2 yes Ohio > > #Notice the lack of an observation number on the first line. > # Capitalization matters. price and Price could > # be different variables. > #Spacing and tab symbols don't matter, but > #linebreaks do, I think. > #Notice how binary yes/no variables and string variables > # are easily accommodated. > > > > > #The first read-in command creates an object called "data2": > > data2 <- read.table("r-data.txt") > > #That uses the file D:/r-data.txt, since we set D:/ as > #the working directory. > > > # In case we want to use lags or time series functions, transform > #the dataset thus: > > tdata2 <- ts(data2) > > > #Then create some lags if you want--- one and four periods > #here: > > temp1= lag(tdata2,-1) > temp4= lag(tdata2,-4) > tdata3 <- ts.union(tdata2, temp1, temp4) > > > > # Now convert it back to a normal dataset: > > data3 <- data.frame(tdata3) > > #Unfortunately, R has given stupid names to many of the > #variables. To see the stupid names, type > > colnames(data3) [1] "tdata2.price" "tdata2.floor" "tdata2.area" "tdata2.rooms" "tdata2.age" [6] "tdata2.cent.heat" "tdata2.state" "temp1.price" "temp1.floor" "temp1.area" [11] "temp1.rooms" "temp1.age" "temp1.cent.heat" "temp1.state" "temp4.price" [16] "temp4.floor" "temp4.area" "temp4.rooms" "temp4.age" "temp4.cent.heat" [21] "temp4.state" > > #Thus, let's convert at least some of them back. > #R is very bad at renaming. > #First, let's get R to put our original names in quotes: > > colnames(data2) [1] "price" "floor" "area" "rooms" "age" "cent.heat" "state" > > #Then let's cut-and-paste those into a command to fix up data3: > > colnames(data3)[1:7] <- c ( "price", "floor", "area", + "rooms", "age", "cent.heat", "state") > > #Cut and paste from that last command and insert 1's and 4's: > > colnames(data3)[8:14] <- c ( "price1", "floor1", "area1", + "rooms1", "age1", "cent.heat1", "state1") > > colnames(data3)[15:21] <- c ( "price4", "floor4", "area4", + "rooms4", "age4", "cent.heat4", "state4") > > # Now look at your dataset to see if the names and lags turned > #out right. Notice that the cent.heat and state variables have > #been made numerical. > > data3 price floor area rooms age cent.heat state price1 floor1 area1 rooms1 age1 cent.heat1 state1 price4 floor4 1 52 100 830 5 6.2 1 1 NA NA NA NA NA NA NA NA NA 2 60 148 210 5 7.5 1 1 52 100 830 5 6.2 1 1 NA NA 3 35 50 1000 8 4.2 1 1 60 148 210 5 7.5 1 1 NA NA 4 50 102 690 6 8.8 1 2 35 50 1000 8 4.2 1 1 NA NA 5 20 38 900 8 2.0 2 3 50 102 690 6 8.8 1 2 52 100 6 57 118 1000 5 4.0 1 1 20 38 900 8 2.0 2 3 60 148 7 80 160 690 8 8.0 1 2 57 118 1000 5 4.0 1 1 35 50 8 30 90 800 6 4.0 1 3 80 160 690 8 8.0 1 2 50 102 9 40 84 800 4 4.0 2 3 30 90 800 6 4.0 1 3 20 38 10 45 70 900 5 2.0 2 3 40 84 800 4 4.0 2 3 57 118 11 52 112 830 5 6.8 1 1 45 70 900 5 2.0 2 3 80 160 12 60 148 710 5 7.5 1 1 52 112 830 5 6.8 1 1 30 90 13 35 44 1000 5 4.2 1 1 60 148 710 5 7.5 1 1 40 84 14 50 100 690 6 8.8 1 2 35 44 1000 5 4.2 1 1 45 70 15 22 38 990 8 1.0 2 3 50 100 690 6 8.8 1 2 52 112 16 57 128 1000 5 4.0 1 1 22 38 990 8 1.0 2 3 60 148 17 80 154 290 8 8.0 2 2 57 128 1000 5 4.0 1 1 35 44 18 32 48 800 5 3.0 1 3 80 154 290 8 8.0 2 2 50 100 19 42 64 800 4 4.0 2 3 32 48 800 5 3.0 1 3 22 38 20 48 160 440 5 2.0 1 3 42 64 800 4 4.0 2 3 57 128 21 54 126 690 6 8.8 1 2 48 160 440 5 2.0 1 3 80 154 22 24 36 400 5 1.0 2 3 54 126 690 6 8.8 1 2 32 48 23 57 120 1000 5 4.0 1 1 24 36 400 5 1.0 2 3 42 64 24 84 160 690 6 8.0 1 2 57 120 1000 5 4.0 1 1 48 160 25 32 90 400 5 3.0 2 3 84 160 690 6 8.0 1 2 54 126 26 42 64 900 4 4.0 2 3 32 90 400 5 3.0 2 3 24 36 27 44 60 700 5 2.0 2 3 42 64 900 4 4.0 2 3 57 120 28 34 56 400 14 3.0 1 3 44 60 700 5 2.0 2 3 84 160 29 42 64 900 8 4.0 1 3 34 56 400 14 3.0 1 3 32 90 30 44 62 100 5 2.0 1 3 42 64 900 8 4.0 1 3 42 64 31 NA NA NA NA NA NA NA 44 62 100 5 2.0 1 3 44 60 32 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 34 56 33 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 42 64 34 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 44 62 area4 rooms4 age4 cent.heat4 state4 1 NA NA NA NA NA 2 NA NA NA NA NA 3 NA NA NA NA NA 4 NA NA NA NA NA 5 830 5 6.2 1 1 6 210 5 7.5 1 1 7 1000 8 4.2 1 1 8 690 6 8.8 1 2 9 900 8 2.0 2 3 10 1000 5 4.0 1 1 11 690 8 8.0 1 2 12 800 6 4.0 1 3 13 800 4 4.0 2 3 14 900 5 2.0 2 3 15 830 5 6.8 1 1 16 710 5 7.5 1 1 17 1000 5 4.2 1 1 18 690 6 8.8 1 2 19 990 8 1.0 2 3 20 1000 5 4.0 1 1 21 290 8 8.0 2 2 22 800 5 3.0 1 3 23 800 4 4.0 2 3 24 440 5 2.0 1 3 25 690 6 8.8 1 2 26 400 5 1.0 2 3 27 1000 5 4.0 1 1 28 690 6 8.0 1 2 29 400 5 3.0 2 3 30 900 4 4.0 2 3 31 700 5 2.0 2 3 32 400 14 3.0 1 3 33 900 8 4.0 1 3 34 100 5 2.0 1 3 > > #Make this dataset the default dataset to use for this session: > > attach(data3) > > ###################################### > > # SUMMARY STATISTICS AND PLOTTING > > #We can get some summary statistics. > #Use your original dataset without the lags, thus: > > summary(data2) price floor area rooms age cent.heat state Min. :20.0 Min. : 36.0 Min. : 100 Min. : 4.00 Min. :1.00 no :20 Illinois: 9 1st Qu.:35.0 1st Qu.: 60.5 1st Qu.: 690 1st Qu.: 5.00 1st Qu.:3.00 yes:10 Indiana : 6 Median :44.5 Median : 90.0 Median : 800 Median : 5.00 Median :4.00 Ohio :15 Mean :46.8 Mean : 93.1 Mean : 718 Mean : 5.97 Mean :4.66 3rd Qu.:56.2 3rd Qu.:124.5 3rd Qu.: 900 3rd Qu.: 6.00 3rd Qu.:7.33 Max. :84.0 Max. :160.0 Max. :1000 Max. :14.00 Max. :8.80 > > #To get a correlation matrix, use your original dataset without the > #lags, thus: > > cor(data2) price floor area rooms age cent.heat state price 1.00000 0.8666 -0.1688 -0.075465 0.720971 NA NA floor 0.86662 1.0000 -0.2398 -0.133545 0.671402 NA NA area -0.16884 -0.2398 1.0000 -0.127165 -0.091508 NA NA rooms -0.07547 -0.1335 -0.1272 1.000000 -0.000967 NA NA age 0.72097 0.6714 -0.0915 -0.000967 1.000000 NA NA cent.heat NA NA NA NA NA 1 NA state NA NA NA NA NA NA 1 Warning message: In cor(data2) : NAs introduced by coercion > > #The cor() command ignores our request to only have 2 digits. > #To get it down to 2 digits, read it into Excel, and change it there. > > > #To plot two variables type > > plot(price,floor, main="Figure 2: Price and Floor") > > #To save a figure as a jpg file, go to FILE, SAVE-AS, jpg on the > #top menu. > > > > > ###################################### > > # RUNNING A REGRESSION > > #To do a regression, create a regression output object called > #"output4": > > output4 <- lm(price~ price1 + floor +area+ floor*area + + cent.heat ) > > #Then display the output: > > summary(output4) Call: lm(formula = price ~ price1 + floor + area + floor * area + cent.heat) Residuals: Min 1Q Median 3Q Max -21.006 -4.561 -0.857 4.687 15.957 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.3477199 13.9330798 1.53 0.1391 price1 -0.1088636 0.1132283 -0.96 0.3463 floor 0.3154225 0.1119530 2.82 0.0098 ** area -0.0024662 0.0158005 -0.16 0.8773 cent.heat 0.5845869 3.9450931 0.15 0.8835 floor:area 0.0000322 0.0001680 0.19 0.8499 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.74 on 23 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.765, Adjusted R-squared: 0.714 F-statistic: 15 on 5 and 23 DF, p-value: 0.00000136 > > > > #If you want the residuals, variance-covariance matrix, or > #predicted values, type > > residuals5 <- output4$resid > residuals5 2 3 4 5 6 7 8 9 -3.4354 4.6868 -0.8566 -7.9399 -1.3030 11.9570 -11.9535 -5.9343 10 11 12 13 14 15 16 17 4.9519 -1.3030 -4.5818 6.7723 -0.1814 -5.8279 -4.5611 14.3924 18 19 20 21 22 23 24 25 4.3747 3.1063 -21.0063 -3.5440 -3.4700 -1.5627 15.9570 -9.9315 26 27 28 29 30 3.1472 7.5059 -0.5398 3.9495 7.1310 > > vcov5 <-vcov(output4) > vcov5 (Intercept) price1 floor area cent.heat (Intercept) 194.130712 -0.607870644 -0.97602197 -0.139641489 -24.0600659 price1 -0.607871 0.012820657 -0.00302164 -0.000312992 0.0738479 floor -0.976022 -0.003021641 0.01253348 0.001548094 -0.0414259 area -0.139641 -0.000312992 0.00154809 0.000249656 -0.0119694 cent.heat -24.060066 0.073847934 -0.04142587 -0.011969435 15.5637593 floor:area 0.000937 0.000006449 -0.00001719 -0.000002378 0.0001890 floor:area (Intercept) 0.00093699746 price1 0.00000644884 floor -0.00001718529 area -0.00000237768 cent.heat 0.00018903695 floor:area 0.00000002822 > > pred5 <- output4$fitted.values > pred5 2 3 4 5 6 7 8 9 10 11 12 13 14 63.44 30.31 50.86 27.94 58.30 68.04 41.95 45.93 40.05 53.30 64.58 28.23 50.18 15 16 17 18 19 20 21 22 23 24 25 26 27 27.83 61.56 65.61 27.63 38.89 69.01 57.54 27.47 58.56 68.04 41.93 38.85 36.49 28 29 30 34.54 38.05 36.87 > > #Thesummary(), residuals(), and vcov() commands partly ignore > #our request to only have 2 digits. > #To get them down to 2 digits, cut-and-paste into Excel. > > ###################################### > > #ODDS AND ENDS > > #We can plot a data histogram. > > #First get rid of all the observations with missing data. > > data3a<-na.omit(data3) > attach(data3a) The following object(s) are masked from data3 : age age1 age4 area area1 area4 cent.heat cent.heat1 cent.heat4 floor floor1 floor4 price price1 price4 rooms rooms1 rooms4 state state1 state4 > > #Then create and plot the histogram: > > var234<-hist(price) > plot(var234, main="Figure 1: A histogram of price") > > # We can also do kernel densities, for variables that > #are close to continuous. Not true here, but try it anyway: > > var123 <- density (price) > plot(var123, main="Figure 3: A kernel density estimate of price") > > #Finally, let's go back to our original dataset: > > attach(data3) The following object(s) are masked from data3a : age age1 age4 area area1 area4 cent.heat cent.heat1 cent.heat4 floor floor1 floor4 price price1 price4 rooms rooms1 rooms4 state state1 state4 The following object(s) are masked from data3 ( position 4 ) : age age1 age4 area area1 area4 cent.heat cent.heat1 cent.heat4 floor floor1 floor4 price price1 price4 rooms rooms1 rooms4 state state1 state4 > > ################## > > #If you don't want an intercept in a regression, type > > output6 <- lm(price~ 0+ floor +area+ cent.heat) > summary(output6) Call: lm(formula = price ~ 0 + floor + area + cent.heat) Residuals: Min 1Q Median 3Q Max -20.70 -3.80 1.97 5.18 15.62 Coefficients: Estimate Std. Error t value Pr(>|t|) floor 0.38387 0.02893 13.27 2.4e-13 *** area 0.00793 0.00492 1.61 0.12 cent.heat 3.79167 2.48627 1.53 0.14 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.54 on 27 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.973, Adjusted R-squared: 0.97 F-statistic: 326 on 3 and 27 DF, p-value: <2e-16 > > ################## > > #To plot two variables and the regression line for a > #TWO-VARIABLE regression: > > output5 <- lm(price~ floor ) > plot( floor,price,abline(output5$coef)) > > ################## > > #I don't have fixed state effects working yet. > #What is below is mistaken. > # To put in fixed state effects, type > > output7 <- lm(price~ floor +area+ age + factor(state)) > summary(output7) Call: lm(formula = price ~ floor + area + age + factor(state)) Residuals: Min 1Q Median 3Q Max -12.2520 -6.1342 -0.0477 7.2591 10.6495 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.3487029 10.9750612 2.31 0.03 * floor 0.2611227 0.0499754 5.23 0.000024 *** area 0.0000736 0.0064021 0.01 0.99 age -0.3401386 1.2951651 -0.26 0.80 factor(state)2 8.8925386 5.5317332 1.61 0.12 factor(state)3 -6.2284701 5.0679449 -1.23 0.23 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.63 on 24 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.814, Adjusted R-squared: 0.775 F-statistic: 21 on 5 and 24 DF, p-value: 0.0000000463 > > #The dummy variables for "state" will be created automatically. > > > ################## > > # To run a logit, convert the y-variable to have only > #values of 0 or 1 thus: > > cent.heat = cent.heat -1 > cent.heat [1] 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 [26] 1 1 0 0 0 NA NA NA NA > > #Then run the regression > > output8 <- glm(cent.heat~ floor +area, family =binomial(link = + "logit") ) > summary(output8) Call: glm(formula = cent.heat ~ floor + area, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -1.489 -0.777 -0.492 1.046 2.068 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.50676 1.87313 1.34 0.181 floor -0.02709 0.01265 -2.14 0.032 * area -0.00120 0.00178 -0.67 0.500 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 38.191 on 29 degrees of freedom Residual deviance: 32.279 on 27 degrees of freedom (4 observations deleted due to missingness) AIC: 38.28 Number of Fisher Scoring iterations: 4 > > ################## > > # If you have your data in a spreadsheet form, save > # it as a *.csv file, which is comma separated. Then > #your R data input command would be > > data2a <- read.csv("D:\\_Take-to-office\\r-data.csv") Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file 'D:\_Take-to-office\r-data.csv': No such file or directory > > #I haven't tested this one. > > ################## > > > #We can write the data out to a file called data3.csv, > # no variable column names, comma-separated: > > write.table(data3, col.names = FALSE, sep = ",", + file="data3.csv") > > #We can write the data out to a file called data3.txt, > # with variable column names, space-separated: > > write.table(data3, col.names =TRUE, sep = " ", + file="data3.txt") > > ###################################### > > #USING PACKAGES FOR SPECIAL COMMANDS > > > ################## > > # To do an F-Test for b2 + b3 =1, first download > # the "car" "package". > > #On the top menu, pick > # Packages, Select Repository, (pick any site), Install Package, car > > #Then, whenever you want to use "car" pick > # Packages, Load Package, car. > > #Finally, type (BUT THIS DOESN'T WORK YET) > > output6 <- lm(price~ floor +area+ cent.heat) > rhs <- c(1) > restricted6 <- rbind(c(0,0,1,1)) > linear.hypothesis(output6, restricted6, rhs) Error: could not find function "linear.hypothesis" > > ################## > > #To run a Durbin-Watson test for autocorrelation, load the car > #package as described above and then type > > output9 <- lm(price ~ floor+ area) > durbin.watson(output7, max.lag=2) Error: could not find function "durbin.watson" > > ################## > > #To save regression output in a Latex file, download and load the > #"hmisc" package in the same way as the "car" package described > #above and type > > output10 <- lm(price~ floor +area+ floor*area + cent.heat) > latex(summary(output10)$coef) Error: could not find function "latex" > > #You'll get something pretty close to latex, with > # standard errors, t-stats, and p-values. > #It will be in a file called "summary.tex" in your > # working directory. > > > ###################################### > > #REFERENCES > > # Mark Gardener, "Using R for statistical analyses" is at: > > #http://www.gardenersown.co.uk/Education/Lectures/R/regression.htm#what_is_R" > > #Gardener is the best documentation, giving full step-by-step > #procedures. > > # Grant Farnsworth's "Econometrics in R" is at: > > #http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf > > #Farnsworth is the best longer reference for economists > > # MAT 356 R Tutorial, Spring 2004 is at: > > # http://math.illinoisstate.edu/dhkim/rstuff/rtutor.html > > #I learned some things from it. > > > #"An Introduction to R", by Venables, Smith, et al. is bad. > > > > #B. D. Ripley and D. J. Murdoch "R for Windows FAQ" is at: > > # http://cran.r-project.org/bin/windows/base/rw-FAQ.html > > #It probably won't be useful. > > ###################################### > > > > > > > > >