September 16, 2017 Eric Rasmusen, erasmuse@indiana.edu This file is now at http://www.rasmusen.org/a/stata-rasmusen.txt SOME STATA COMMANDS I USE This file is for notes on STATA commands that I use. I use Stata 13. I can go to STATMATH for quick email answers, since I am at IU. ___________________________ STATA 13 has a do-file editor. It doesn't reload do files if you don't use that editor, so don't try making changes with textpad outside and then running the do file again. -------------------------- RECODE To replace missing values, if 9 denotes missing value for all variables, use recode * (9 =.) -------------------------- MAPS * http://www.stata.com/support/faqs/graphics/spmap-and-maps/ *For installing stata packages at IUhttps://kb.iu.edu/d/azwf* *Do these next three commands just one time-- then they are installed; net set ado "J:\Burakumin-Ramseyer\data-and-regressions\ado-files *ssc install spmap; * ssc install shp2dta; * ssc install mif2dta; *you may have to install each into a separate, new, directory and then move them to ado-files.; *You don't need the created subdirectories, and they may even blcok it from working; adopath + "J:\Burakumin-Ramseyer\data-and-regressions\ado-files" *Find the mapshape files for your country, two of them and put them in the folder. ; shp2dta using jp_grid_ken_pgn, database(usdb) coordinates(uscoord) genid(id); *Note: the FAQ file has a bad typo---gen(id) instead of genid(id). I todl STATA about it; use usdb, clear; describe; -------------------------- IUANYWARE Installing MODULE PACKAGES needs special care, since they have to be installed locally. Suppose you are already in your working directory, e.g. you have already typed the command cd "J:\spectrum-liberalism-index\CCES2012data" Then to install the gvselect module, use this command: net set ado "stata-packages" ssc install gvselect adopath + "stata-packages" For a new module, the best source for info on the command is a Stata Jouranl article, if that is available. Google scholar has them. -------------------------- regress yvar x1 x2 x3 predict fittedvalues1, xb *That gives the PREDICTED VALUES. The xb is optional; default. predict resvar1, residuals *That gives the residuals. -------------------------- local myvars v1 v2 v3 v4 v5 display 'myvars' * a macro to create a list, of variables called myvars. They cannot be on separate lines, even if there is a semicolon ; to end. Then I can run, regress spectrum `myvars', instead of typing them all out. This ia macro. It can only be used within a Do file, maybe because it is LOCAL and doesn't survive the running of that do file. You have to put it in each do file. -------------------------- SEEING DATA VALUES. list var1 var2 in 1/10 will list the first 10 values of those two variables. -------------------------- findit mdesc *This looks on the web for the mdesc add-on mdesc * THis shows the number of missing values observations for each *variable in the dataset. -------------------------- SUPPRESSING TERMINAL OUTPUT: quietly and noisily turn output off and on. set output inform is useless-- it does NOT keep good terminal output. Use it in front of every command to be kept quiet eg. quietly regress yvar x1 x2 x3; There is no way to turn it on for a while and then off for a while. -------------------------- ds, alpha list all variables in alphabstical order, which is what the alpha option does. save foo,replace save the data in foo.dta, replacing whatever's there. saveold autoold save the data from STATA 13 for STATA 11 and 12 to read. _all is a term to mean all variables. export excel using myfile.xlsx, firstrow(variables) replace export delimited using myfile.csv, replace *no firstrow command needed capture log close *At the start of a do-file to close open logs. ------------------------------------------ BEST SUBSET REGRESSION gvselect ximmigpatrol ximmigpolice ximmigbusi , nmodels(1): regress conservative7 ; *What this does is to select the best 1, 2,3,etc. subsets of the variables ximmigpatrol ximmigpolice ximmigbusi when conservative7 is regressed on them. is meant literally; do not insert anything there.; *But vselect is better for linear regression; ------------------------------------------ CATEGORICAL TO DUMMY (FACTOR, INDICATOR) VARIABLES xi RACE i.RACE This will create _IRACE1, _IRACE2, IRACE3, etc., a dummy for each value. Or, use regress POLVIEWS i.RACE which will create dummies for all values of RACE and use them in the regression. ------------------------------------------------------------- DROPPING DON'T KNOW, MISSING, ETC. VALUES OF VARIABLES foreach varname of varlist VIOLHSPS VIOLASNS { drop if `varname' == 8 } -------------------------------------------------------------- AND OR ALL correlate _all will correlate all the variable sin the dataset And is & and Or is | ________________________________ -------------------------------------- PROGRAMMING Use the WHILE command. --------------------------------------------------------------------- TIME SERIES DIAGRAMS insheet using jan6a.csv *label variable govdebtgross "Gross gov. debt" save jan6a scatter rgdp year, title("Real GDP: 1980-2010") yscale(range(0)) ytitle("Real GDP", orientation(horizontal) ) xtitle("") ylabel(#6,angle(horizontal)) graph export rgdp80-09.png,replace *This makes the y scale start at 0 and go to a Stata-chosen max *The y-scale has a horizontal title, and the 6 tick labels are horizontal too. *There is no title on the x-axis. *Really, given the title, the y-axis doesn't need a title. scatter rgdp year if (year>1989 & year <2008) , title("Real GDP: 1990-2007") yscale(range(0)) ytitle("Real GDP", orientation(horizontal) ) xtitle("")ylabel(#6,angle(horizontal)) graph export rgdp90-07.png,replace *Really, given the title, the y-axis doesn't need a title. * I couldn't figure out how to make the graph axis end at 2007 instead of 2010 *xscale(range(. 2007)) does not change teh length of the axis. --------------------------------------------------------------------- ADDING A LABEL OR RELABEL To add a more descriptive lable, for use on graphs and such, do this: label variable gdpc "gdp per capita growth" No replace is needed even if there is already a label. --------------------------------------------------------------------- INPUTTING STRING DATA infile str24 name using temp.txt That is the STATA command to infile a 24-letter-or-less string variable. --------------------------------------------------------------------- SCATTERPLOTS, FIGURES, DIAGRAMS scatter unrate year if (year>=1923 & year <= 1950), yscale(range(0 25)) xlabel(#10) c(1) ; graph save gd2.gph, replace; graph export gd2.tif, replace; scatter unrate year if (year>=1923 & year <= 1950), c(1) legend(off) || scatter unrate1 year if (year>=1923 & year <= 1950), yscale(range(0 25)) c(1) clwidth(medthick) clcolor(black) clpattern(dot) legend(off) ; graph save gd3.gph, replace; graph export gd3.tif, replace; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; DESCRIBING DATA describe is the command for listing variables and their labels describe, simple will give just the variable names, no labels summarize all will give means for all variables summarize percapincome, details will give medians and suchlike tab percapincome will give all the values of thevarible percapincome correlate var1 var2 var3 will give a correlation matrix. ------------------------------------------------------------------ TRIMMING OUTLIERS: _pctile pop, percentile(1(1)99); scalar p95=r(r95); scalar p5= r(r5); scalar p10=r(r10); scalar p90=r(r90); drop if pop>p95; ------------------------------------------------------------------ HEKCMAN TECHNIQUE You can get it as a stata command directly, but here's how to do it to replace tobit, I think. generate win1= 0; replace win1 = 1 if winrate ==100; probit win1 prosrate prosrate2 indexcrime ; predict pred1,p; generate pred1size=invnormal(pred1); generate pred1density= normalden(pred1size); generate invmills =pred1density/pred1; drop if win1==1; reg winrate prosrate prosrate2 indexcrime budget cpsalary invmills; --------------------------------------------------------------------- STATA LIST Site for STATA regression advice: http://www.stata.com/statalist/ The archives for past solutions: http://www.stata.com/statalist/archive/ --------------------------------------------------------------------- HELP: FINDIT If I type Findit rreg from the command screen, I will get Web files that are useful, including, e.g. UCLA statistics help files. --------------------------------------------------------------------- BASIC STATA *jkjlkjlk is a way to do comments cd d:/mystuff/mydirectory * Gets to the directory with your data insheet using crimedata.csv *Inputs your spreadsheet data regress crime population prison *Regresses crime on population and prison xi:regress crime prison i.state i.year *Regresses crime on prison and dummies for states and years generate lnprison = ln(prison) *Creates a new variable with a math formula replace logprison = 1000*lnprison if logprison== 1 *replace an observation of the variable if the condition is met --------------------------------------------------------------------- HISTOGRAM DATA FIGURES GRAPHS FORMAT The "scheme" for graphics called lean2 must be installed from teh web. Try stata help for lean2 to find out how it's easy). THen use these commands: set scheme lean2; histogram flunks if judge==0, discrete freq ylabel(,nogrid) xtitle("Flunks by Lawyers") saving(flunks-l,replace) ; *This one has no y-axis grid lines; histogram flunks if judges>0, discrete freq xtitle("Flunks-- Lawyers") ytitle("First line of the title" "Continues to a second line") saving(flunks-l,replace) ; *This one has y-axis grid lines, and a special y-title with two lines of text; graph export flunks-l.ps, replace; Stata will make the y-axis label vertical writing, rather than horizontal. I go to Adobe Illustrator to change that. Use help twoway_options to find out how to use the options. --------------------------------------------------------------------- USING WEIGHTS IN A REGRESSION * If I want to weight observations 1 unless the judge variables equals 1, and then use .04; generate wj = 1; replace wj = .04 if judge==1 ; dprobit judge utokyo ukyoto Flunks413 Post93 Post93UT Post93UK Post93 Fl413 [pweight=wj] ---------------------------------------------------------------------- MERGING IN A NEW VARIABLE INTO A STATA DATASET Mark tells me that this command does it, putting the new variables in wherever the joinby prefecture using "file name.dta" ---------------------------------------------------------------------- FORMATTING REGRESSION OUTPUT: OUTREG AND EXCEL Stata's standard regression output has too many decimal places and some useless statistics. To convert one or more regressions to a table form of the kind customary in economics, use one of two methods: the Outreg program, or cutting and pasting to Excel. The best way to do it is to use outreg, with output in a file like temp.tex (with long enough lines, few carriage returns). In the temp.tex file change all the left parentheses to yyy. Then load it into Excel simply by opening temp.txt and choosing comma delimiters. Then copy into WORD or LATEX. THen change all the yyy to left parentheses. FIRST, OUTREG: The add-on program outreg.ado is available from: http://econpapers.repec.org/software/bocbocode/s375201.htm Documentation is at: http://www.kellogg.northwestern.edu/researchcomputing/docs/outreg.pdf I think you can download and install it in one step by issuing the command in stata ssc install outreg In my PC's Stata 7.0, tho, that doesn't work. Instead, first type net from http://fmwww.bc.edu/repec/bocode/o/ and then type net install outreg Here is how to use the program. After each regression command such as "regress yvar xvar x2 x3 x4" type, for the first regression, outreg using table1.txt, replace bdec(2) and for every succeeding similar regression, outreg using table1.txt, append bdec(2) What this does is to add the latest regression results as a new column in a table in the file table1.txt. The top row will have the Y variable (which can differ). Coefficients and t-stats will be to 2 decimal places. The tabs won't work properly, so the table's spacing won't be right, though. To get proper spacing, you can also outreg into a comma-separated file, by adding an option like this: outreg using table1.txt, replace bdec(2) comma Then cut and paste to MS Word. Select all the text except the notes at the bottom and click on Table, Convert Text to Table. It will format it nicely. Then if you cut and paste to a plain text file, the good spacing remains. If you use dprobit dtobit, etc., you will get only MARGINAL EFFECTS reported. That is usually best. Then outreg will automatically report those too. Otherwise, try: margin[(u|c|p)] specifies that the marginal effects rather than the coefficient estimates are reported. It can be used after truncreg,marginal from STB 52 or dtobit from STB 56. One of the parameters u, c, or p is required after dtobit, corresponding to the unconditional, conditional, and probability marginal effects, respectively. It is not necessary to specify margin after dprobit, dlogit2, dprobit2, and dmlogit2. SECOND, EXCEL The other way to format a regression table is to cut-and-paste from teh STATA log on teh screen (not from the *.log file) using EDIT and then SAVE TABLE (NOT the regular Save;SAVE TABLE put in tabs between columns.) Paste it into a plain text file, or, perhpas into MS-WORD directly (and convert text to table then, I guess). Save the plain text file as temp.txt, then open it in Excel, and it will be a nice spreadsheet with columns and rows that can be cut and pasted. Note that in Excel you can also add columns of "(" and ")". Then save as a *.csv file, with comma delimiters. Edit that as plain text, to get the parentheses in the same columns as their contents, and upload into Excel again. USING BOTH Start with temp.txt. Then chagne all the commas to tabs (mabye this happens anyway if you don't use the COMMA option in OUTREG). Then make a spreadsheet file and pick TEXT as the default for Nubmer Format. Then copy and past the temp.txt file to the spreadsheet file. It will save with parentheses around the t-stats, instead of making them negative. BETTER: in the temp.txt file change all the left parentheses to yyy. Then load it into Excel simply by opening temp.txt and choosing comma delimiters. Then copy into WORD or LATEX. THen change all the yyy to left parentheses. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; X-WINDOWS If I am trying to use the Indiana U. computer remotely to use Stata 9.1, load up an x-terminal and then type: ssh erasmuse@libra.uits.iu.edu or ssh erasmuse@steel.ucs.indiana.edu and issue the command: xstata * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; /* This is a multline COMMENT. No semicolons are needed. */ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; #delimit ; * This says that the semicolon denotes the end of a line of command. All lines must end with semicolons after this It only works in DO files, though, not interactively; log using marginal-effects.log, replace; set more 1; *This should stop the pauses; *To keep going regardless of errors, use DO mYFILE, nostop; you can also try, at hte start of the file, to stop showing output with pauses: set more off, permanently * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; FIXED-EFFECTS REGRESSIONS This creates dummies out of verbal or numeric categories. xi: tobit yvar xvar1 xvar2 i.categoryvariable, ll(10000) ul(30000); OR xi: regress yvar xvar1 xvar2 i.categoryvariable, robust; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *PLOTTING TWO VARIABLES AGAINST EACH OTHER; graph murder black; graph murder black, symbol([state]); graph murder black, symbol([state]) psize(150); *this makes the symbol size 50 percent larger than the default graph score cashleft, symbol([rank]) xscale(0, 40000) yscale(30, 110) When I tried this in Stata 9.2 in dec. 2007 it didn't work anymore, though. For a simple scatterplot I had to use: twoway scatter exshpt exshare; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *DESCRIBING THE DISTRIBUTION OF A SINGLE VARIABLE; histogram flunks if judge==0, discrete saving(flunks-l,replace) ; *This is for the discrete variable flunks, if the judge variable-0. It saves the graph as flunks-l.gph. To convert the graph to TIFF, just right-click on it in STATA and save as TIFF. ; graph winrate, box saving (winrate-box,replace) ; * for a box and whiskers plot; graph winrate, histogram bin(20) saving (winrate-histogram,replace) ; kdensity winrate, saving (winrate-kdensity,replace) ; tab var1 This just lists all the values. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *MISSING VALUES; * FOr missing values in entries, use a period, . Other thigns cause trouble. Note that a missing value can be treated as being a very large number, for variable generation purposes. Watch out for things like replace var1 =4 if var2>6, because if var2 is missing, that command will make var1=4; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; INSHEET; This is to get spreadsheet data into STATA: * Save the spreadsheet as a tab file. Then say INSHEET USING MYNAME.TXT; OUTSHEET: This is to get spreadsheet data out of STATA. Outsheet using myname.csv, comma, replace That uses commas to separate the variables. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; CONVERTING STATA 9 DATASET TO STATA 4. Stata has set things up so later versions are incompatible with earlier ones. Stata 9 on Steel will read my co-authors' files, but not my Stata 7 on my PC. Here's how to solve that: Upload the file smith.dta to Stata 9. Use smith.dta outsheet using smith.txt Download the smith.txt to Stata 7. On the PC: set mem 50m *for a big dataset insheet using smith.txt save smith1.dta * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; * IF STATEMENTS AND STRING VARIABLES; replace appointed =1 if state =="NJ" ; replace appointed =1 if state =="AK" ; LOGICAL OPERATORS: "not equal to " is ~= Use & instead of "and" in logical statements. "or" is | * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; * SUMMARY STATISTICS; Do not use the SUMMARIZE or SUMMARIZE, DETAILS command. Instead, use tabstat, like this: tabstat budget felclosed felconv, stats(min p25 median mean p75 max) f(%7.2f) columns(statistics) ; INSPECT is a good command too. It gives you little histograms for each variable. * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; * HOW TO HAVE ROBUST STANDARD ERRORS IN TOBIT IN STATA; The documentation at http://www.stata.com/support/faqs/stat/tobit.html is cryptic and seems to be missing some lines in the middle, so I am writing up these instructions. I still don't understand what is going on, but this at least makes the command work. Suppose you have regressed winrate using tobit on prosrate budpros term crimerate pop. You used tobit because winrate is censored at 0 and 100-- you have the observations at the extremes, but the values cannot be less than 0 or greater than 100. You used this command: tobit winrate prosrate budpros term pop, ll(0) ul(100) ; Another way to get exactly the same results is to use INTREG, the interval regression command. This is designed for when some of your data is intervals, as when for example you know that somebody's income is between 1 and 3 million, but not exactly where in between. Here, we use it differently. First, define two new variables winrate1 and winrate2. Each observation is in an interval [winrate1, winrate2]. Winrate1 will be winrate, or will be missing (representing negative infinity) if winrate takes its lower bound of 0. Winrate2 will be winrate, or will be missing (representing infinity) if winrate takes its upper bound of 100. Here are the Stata commands to generate those variables: gen winrate1=winrate; replace winrate1 =. if winrate <= 0; gen winrate2=winrate; replace winrate2 =. if winrate >= 100; Suppose our data was like this: (12, 0, 50, 100, 23) Then winrate 1 would be (12,. , 50, 100, 23) Then winrate 2 would be (12, 0, 50, ,, 23) The intervals in the regression would be ([12,12], [-infinity, 0], [50,50], [100, infinity], [23,23]). Then use the INTREG command. Its first two variables, winrate1 and winrate2, are the bounds that make up the intervals for each observation of the dependent variable, and the rest are independent variables. intreg winrate winrate2 prosrate budpros term pop; The regression above should give exactly the same results as the tobit command specified earlier. It doesn't quite, for me, but I put that down to different maximization algorithms for the two commands. Now, though, we can have robust standard errors as an option to correct for heteroskedasticity: intreg winrate winrate2 prosrate budpros term pop, robust; This last command has the exact same coefficient estimates, but different, and consistent, standard errors. What if you only left-censor, so winrate can't be less than 0, but it can be greater than 100? Then generate the bounds like this: gen winrate1=winrate; replace winrate1 =. if winrate <= 0; gen winrate2=winrate; What if you only right-censor, so winrate can be less than 0, but it can't be greater than 100? Then generate the bounds like this: gen winrate1=winrate; gen winrate2=winrate; replace winrate2 =. if winrate >= 100; * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--; *MARGINAL EFFECTS IN TOBIT; This might be different in Stata 9. See if dtobit, dlogit2, dprobit work. They report marginal effects instead of coefficients. Use the command dprobit, at(xxxx) if you want it calculated at xxx- but xxx has to be a matrix, and I don't know how to get a median matrix. Also, dtobit has to be installed from the web, epsecially. . Otherwise, use the mfx command below. mfx is the marginal effects command. Use a dot if you are only left-censored or right-censored. predict(e(0, 100)) will give the marginal effect on the expected value of the dependent variable conditional on being uncensored, E(y|a