NMoc. 7 2012 Eric Rasmusen, erasmuse@indiana.edu This file is now at http://www.rasmusen.org/a/stata-rasmusen.txt SOME STATA COMMANDS I USE This file is for notes on STATA commands that I use. I use Stata 9. See also the Indiana U. Libra and Steel computers. I can go to STATMATH for quick email answers, since I am at IU. ------------------------------------------ CATEGORICAL TO DUMMY (FACTOR, INDICATOR) VARIABLES xi RACE i.RACE This will create _IRACE1, _IRACE2, IRACE3, etc., a dummy for each value. Or, use regress POLVIEWS i.RACE which will create dummies for all values of RACE and use them in the regression. ------------------------------------------------------------- DROPPING DON'T KNOW, MISSING, ETC. VALUES OF VARIABLES foreach varname of varlist VIOLHSPS VIOLASNS { drop if `varname' == 8 } -------------------------------------------------------------- AND OR ALL correlate _all will correlate all the variable sin the dataset And is & and Or is | ________________________________ -------------------------------------- PROGRAMMING Use the WHILE command. ---------------------------------------- STATA 12 IUanyware This has no 12300 observation limit. cd \\Client\C$\__PAPERS-1stCURRENT1st-tier\Test-data\ --------------------------------------------------------------------- TIME SERIES DIAGRAMS insheet using jan6a.csv *label variable govdebtgross "Gross gov. debt" save jan6a scatter rgdp year, title("Real GDP: 1980-2010") yscale(range(0)) ytitle("Real GDP", orientation(horizontal) ) xtitle("") ylabel(#6,angle(horizontal)) graph export rgdp80-09.png,replace *This makes the y scale start at 0 and go to a Stata-chosen max *The y-scale has a horizontal title, and the 6 tick labels are horizontal too. *There is no title on the x-axis. *Really, given the title, the y-axis doesn't need a title. scatter rgdp year if (year>1989 & year <2008) , title("Real GDP: 1990-2007") yscale(range(0)) ytitle("Real GDP", orientation(horizontal) ) xtitle("")ylabel(#6,angle(horizontal)) graph export rgdp90-07.png,replace *Really, given the title, the y-axis doesn't need a title. * I couldn't figure out how to make the graph axis end at 2007 instead of 2010 *xscale(range(. 2007)) does not change teh length of the axis. --------------------------------------------------------------------- ADDING A LABEL OR RELABEL To add a more descriptive lable, for use on graphs and such, do this: label variable gdpc "gdp per capita growth" No replace is needed even if there is already a label. --------------------------------------------------------------------- INPUTTING STRING DATA infile str24 name using temp.txt That is the STATA command to infile a 24-letter-or-less string variable. --------------------------------------------------------------------- SCATTERPLOTS, FIGURES, DIAGRAMS scatter unrate year if (year>=1923 & year <= 1950), yscale(range(0 25)) xlabel(#10) c(1) ; graph save gd2.gph, replace; graph export gd2.tif, replace; scatter unrate year if (year>=1923 & year <= 1950), c(1) legend(off) || scatter unrate1 year if (year>=1923 & year <= 1950), yscale(range(0 25)) c(1) clwidth(medthick) clcolor(black) clpattern(dot) legend(off) ; graph save gd3.gph, replace; graph export gd3.tif, replace; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; DESCRIBING DATA describe is the command for listing variables and their labels describe, simple will give just the variable names, no labels summarize all will give means for all variables summarize percapincome, details will give medians and suchlike tab percapincome will give all the values of thevarible percapincome correlate var1 var2 var3 will give a correlation matrix. --------------------------------------------------------------------- TRIMMING OUTLIERS: _pctile pop, percentile(1(1)99); scalar p95=r(r95); scalar p5= r(r5); scalar p10=r(r10); scalar p90=r(r90); drop if pop>p95; --------------------------------------------------------------------- HEKCMAN TECHNIQUE You can get it as a stata command directly, but here's how to do it to replace tobit, I think. generate win1= 0; replace win1 = 1 if winrate ==100; probit win1 prosrate prosrate2 indexcrime ; predict pred1,p; generate pred1size=invnormal(pred1); generate pred1density= normalden(pred1size); generate invmills =pred1density/pred1; drop if win1==1; reg winrate prosrate prosrate2 indexcrime budget cpsalary invmills ; --------------------------------------------------------------------- STATA LIST Site for STATA regression advice: http://www.stata.com/statalist/ The archives for past solutions: http://www.stata.com/statalist/archive/ --------------------------------------------------------------------- HELP: FINDIT If I type Findit rreg from the command screen, I will get Web files that are useful, including, e.g. UCLA statistics help files. --------------------------------------------------------------------- BASIC STATA *jkjlkjlk is a way to do comments cd d:/mystuff/mydirectory * Gets to the directory with your data insheet using crimedata.csv *Inputs your spreadsheet data regress crime population prison *Regresses crime on population and prison xi:regress crime prison i.state i.year *Regresses crime on prison and dummies for states and years generate lnprison = ln(prison) *Creates a new variable with a math formula replace logprison = 1000*lnprison if logprison== 1 *replace an observation of the variable if the condition is met --------------------------------------------------------------------- HISTOGRAM DATA FIGURES GRAPHS FORMAT The "scheme" for graphics called lean2 must be installed from teh web. Try stata help for lean2 to find out how it's easy). THen use these commands: set scheme lean2; histogram flunks if judge==0, discrete freq ylabel(,nogrid) xtitle("Flunks by Lawyers") saving(flunks-l,replace) ; *This one has no y-axis grid lines; histogram flunks if judges>0, discrete freq xtitle("Flunks-- Lawyers") ytitle("First line of the title" "Continues to a second line") saving(flunks-l,replace) ; *This one has y-axis grid lines, and a special y-title with two lines of text; graph export flunks-l.ps, replace; Stata will make the y-axis label vertical writing, rather than horizontal. I go to Adobe Illustrator to change that. Use help twoway_options to find out how to use the options. --------------------------------------------------------------------- USING WEIGHTS IN A REGRESSION * If I want to weight observations 1 unless the judge variables equals 1, and then use .04; generate wj = 1; replace wj = .04 if judge==1 ; dprobit judge utokyo ukyoto Flunks413 Post93 Post93UT Post93UK Post93 Fl413 [pweight=wj] ---------------------------------------------------------------------- MERGING IN A NEW VARIABLE INTO A STATA DATASET Mark tells me that this command does it, putting the new variables in wherever the joinby prefecture using "file name.dta" ---------------------------------------------------------------------- FORMATTING REGRESSION OUTPUT: OUTREG AND EXCEL Stata's standard regression output has too many decimal places and some useless statistics. To convert one or more regressions to a table form of the kind customary in economics, use one of two methods: the Outreg program, or cutting and pasting to Excel. The best way to do it is to use outreg, with output in a file like temp.tex (with long enough lines, few carriage returns). In the temp.tex file change all the left parentheses to yyy. Then load it into Excel simply by opening temp.txt and choosing comma delimiters. Then copy into WORD or LATEX. THen change all the yyy to left parentheses. FIRST, OUTREG: The add-on program outreg.ado is available from: http://econpapers.repec.org/software/bocbocode/s375201.htm Documentation is at: http://www.kellogg.northwestern.edu/researchcomputing/docs/outreg.pdf I think you can download and install it in one step by issuing the command in stata ssc install outreg In my PC's Stata 7.0, tho, that doesn't work. Instead, first type net from http://fmwww.bc.edu/repec/bocode/o/ and then type net install outreg Here is how to use the program. After each regression command such as "regress yvar xvar x2 x3 x4" type, for the first regression, outreg using table1.txt, replace bdec(2) and for every succeeding similar regression, outreg using table1.txt, append bdec(2) What this does is to add the latest regression results as a new column in a table in the file table1.txt. The top row will have the Y variable (which can differ). Coefficients and t-stats will be to 2 decimal places. The tabs won't work properly, so the table's spacing won't be right, though. To get proper spacing, you can also outreg into a comma-separated file, by adding an option like this: outreg using table1.txt, replace bdec(2) comma Then cut and paste to MS Word. Select all the text except the notes at the bottom and click on Table, Convert Text to Table. It will format it nicely. Then if you cut and paste to a plain text file, the good spacing remains. If you use dprobit dtobit, etc., you will get only MARGINAL EFFECTS reported. That is usually best. Then outreg will automatically report those too. Otherwise, try: margin[(u|c|p)] specifies that the marginal effects rather than the coefficient estimates are reported. It can be used after truncreg,marginal from STB 52 or dtobit from STB 56. One of the parameters u, c, or p is required after dtobit, corresponding to the unconditional, conditional, and probability marginal effects, respectively. It is not necessary to specify margin after dprobit, dlogit2, dprobit2, and dmlogit2. SECOND, EXCEL The other way to format a regression table is to cut-and-paste from teh STATA log on teh screen (not from the *.log file) using EDIT and then SAVE TABLE (NOT the regular Save;SAVE TABLE put in tabs between columns.) Paste it into a plain text file, or, perhpas into MS-WORD directly (and convert text to table then, I guess). Save the plain text file as temp.txt, then open it in Excel, and it will be a nice spreadsheet with columns and rows that can be cut and pasted. Note that in Excel you can also add columns of "(" and ")". Then save as a *.csv file, with comma delimiters. Edit that as plain text, to get the parentheses in the same columns as their contents, and upload into Excel again. USING BOTH Start with temp.txt. Then chagne all the commas to tabs (mabye this happens anyway if you don't use the COMMA option in OUTREG). Then make a spreadsheet file and pick TEXT as the default for Nubmer Format. Then copy and past the temp.txt file to the spreadsheet file. It will save with parentheses around the t-stats, instead of making them negative. BETTER: in the temp.txt file change all the left parentheses to yyy. Then load it into Excel simply by opening temp.txt and choosing comma delimiters. Then copy into WORD or LATEX. THen change all the yyy to left parentheses. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; X-WINDOWS If I am trying to use the Indiana U. computer remotely to use Stata 9.1, load up an x-terminal and then type: ssh erasmuse@libra.uits.iu.edu or ssh erasmuse@steel.ucs.indiana.edu and issue the command: xstata * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; /* This is a multline comment. No semicolons are needed. */ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; #delimit ; * This says that the semicolon denotes the end of a line of command. All lines must end with semicolons after this; log using marginal-effects.log, replace; set more 1; *This should stop the pauses; *To keep going regardless of errors, use DO mYFILE, nostop; you can also try: set more off * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; FIXED-EFFECTS REGRESSIONS This creates dummies out of verbal or numeric categories. xi: tobit yvar xvar1 xvar2 i.categoryvariable, ll(10000) ul(30000); OR xi: regress yvar xvar1 xvar2 i.categoryvariable, robust; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *PLOTTING TWO VARIABLES AGAINST EACH OTHER; graph murder black; graph murder black, symbol([state]); graph murder black, symbol([state]) psize(150); *this makes the symbol size 50 percent larger than the default graph score cashleft, symbol([rank]) xscale(0, 40000) yscale(30, 110) When I tried this in Stata 9.2 in dec. 2007 it didn't work anymore, though. For a simple scatterplot I had to use: twoway scatter exshpt exshare; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *DESCRIBING THE DISTRIBUTION OF A SINGLE VARIABLE; histogram flunks if judge==0, discrete saving(flunks-l,replace) ; *This is for the discrete variable flunks, if the judge variable-0. It saves the graph as flunks-l.gph. To convert the graph to TIFF, just right-click on it in STATA and save as TIFF. ; graph winrate, box saving (winrate-box,replace) ; * for a box and whiskers plot; graph winrate, histogram bin(20) saving (winrate-histogram,replace) ; kdensity winrate, saving (winrate-kdensity,replace) ; tab var1 This just lists all the values. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *MISSING VALUES; * FOr missing values in entries, use a period, . Other thigns cause trouble. Note that a missing value can be treated as being a very large number, for variable generation purposes. Watch out for things like replace var1 =4 if var2>6, because if var2 is missing, that command will make var1=4; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; INSHEET; This is to get spreadsheet data into STATA: * Save the spreadsheet as a tab file. Then say INSHEET USING MYNAME.TXT; OUTSHEET: This is to get spreadsheet data out of STATA. Outsheet using myname.csv, comma, replace That uses commas to separate the variables. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; CONVERTING STATA 9 DATASET TO STATA 4. Stata has set things up so later versions are incompatible with earlier ones. Stata 9 on Steel will read my co-authors' files, but not my Stata 7 on my PC. Here's how to solve that: Upload the file smith.dta to Stata 9. Use smith.dta outsheet using smith.txt Download the smith.txt to Stata 7. On the PC: set mem 50m *for a big dataset insheet using smith.txt save smith1.dta * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; * IF STATEMENTS AND STRING VARIABLES; replace appointed =1 if state =="NJ" ; replace appointed =1 if state =="AK" ; LOGICAL OPERATORS: "not equal to " is ~= Use & instead of "and" in logical statements. "or" is | * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; * SUMMARY STATISTICS; Do not use the SUMMARIZE or SUMMARIZE, DETAILS command. Instead, use tabstat, like this: tabstat budget felclosed felconv, stats(min p25 median mean p75 max) f(%7.2f) columns(statistics) ; INSPECT is a good command too. It gives you little histograms for each variable. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; * HOW TO HAVE ROBUST STANDARD ERRORS IN TOBIT IN STATA; The documentation at http://www.stata.com/support/faqs/stat/tobit.html is cryptic and seems to be missing some lines in the middle, so I am writing up these instructions. I still don't understand what is going on, but this at least makes the command work. Suppose you have regressed winrate using tobit on prosrate budpros term crimerate pop. You used tobit because winrate is censored at 0 and 100-- you have the observations at the extremes, but the values cannot be less than 0 or greater than 100. You used this command: tobit winrate prosrate budpros term pop, ll(0) ul(100) ; Another way to get exactly the same results is to use INTREG, the interval regression command. This is designed for when some of your data is intervals, as when for example you know that somebody's income is between 1 and 3 million, but not exactly where in between. Here, we use it differently. First, define two new variables winrate1 and winrate2. Each observation is in an interval [winrate1, winrate2]. Winrate1 will be winrate, or will be missing (representing negative infinity) if winrate takes its lower bound of 0. Winrate2 will be winrate, or will be missing (representing infinity) if winrate takes its upper bound of 100. Here are the Stata commands to generate those variables: gen winrate1=winrate; replace winrate1 =. if winrate <= 0; gen winrate2=winrate; replace winrate2 =. if winrate >= 100; Suppose our data was like this: (12, 0, 50, 100, 23) Then winrate 1 would be (12,. , 50, 100, 23) Then winrate 2 would be (12, 0, 50, ,, 23) The intervals in the regression would be ([12,12], [-infinity, 0], [50,50], [100, infinity], [23,23]). Then use the INTREG command. Its first two variables, winrate1 and winrate2, are the bounds that make up the intervals for each observation of the dependent variable, and the rest are independent variables. intreg winrate winrate2 prosrate budpros term pop; The regression above should give exactly the same results as the tobit command specified earlier. It doesn't quite, for me, but I put that down to different maximization algorithms for the two commands. Now, though, we can have robust standard errors as an option to correct for heteroskedasticity: intreg winrate winrate2 prosrate budpros term pop, robust; This last command has the exact same coefficient estimates, but different, and consistent, standard errors. What if you only left-censor, so winrate can't be less than 0, but it can be greater than 100? Then generate the bounds like this: gen winrate1=winrate; replace winrate1 =. if winrate <= 0; gen winrate2=winrate; What if you only right-censor, so winrate can be less than 0, but it can't be greater than 100? Then generate the bounds like this: gen winrate1=winrate; gen winrate2=winrate; replace winrate2 =. if winrate >= 100; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *MARGINAL EFFECTS IN TOBIT; This might be different in Stata 9. See if dtobit, dlogit2, dprobit work. They report marginal effects instead of coefficients. Use the command dprobit, at(xxxx) if you want it calculated at xxx- but xxx has to be a matrix, and I don't know how to get a median matrix. Also, dtobit has to be installed from the web, epsecially. . Otherwise, use the mfx command below. mfx is the marginal effects command. Use a dot if you are only left-censored or right-censored. predict(e(0, 100)) will give the marginal effect on the expected value of the dependent variable conditional on being uncensored, E(y|a