September 16, 2017
Eric Rasmusen, erasmuse@indiana.edu
This file is now at http://www.rasmusen.org/a/stata-rasmusen.txt
SOME STATA COMMANDS I USE
This file is for notes on STATA commands that I use. I use Stata 13. I can go to STATMATH for quick email answers, since I am at IU.
___________________________
STATA 13 has a do-file editor. It doesn't reload do files if you don't use that editor, so don't try making changes with textpad outside and then running the do file again.
--------------------------
RECODE
To replace missing values, if 9 denotes missing value for all variables, use
recode * (9 =.)
--------------------------
MAPS
* http://www.stata.com/support/faqs/graphics/spmap-and-maps/
*For installing stata packages at IUhttps://kb.iu.edu/d/azwf*
*Do these next three commands just one time-- then they are installed;
net set ado "J:\Burakumin-Ramseyer\data-and-regressions\ado-files
*ssc install spmap;
* ssc install shp2dta;
* ssc install mif2dta;
*you may have to install each into a separate, new, directory and then move them to ado-files.;
*You don't need the created subdirectories, and they may even blcok it from working;
adopath + "J:\Burakumin-Ramseyer\data-and-regressions\ado-files"
*Find the mapshape files for your country, two of them and put them in the folder. ;
shp2dta using jp_grid_ken_pgn, database(usdb) coordinates(uscoord) genid(id);
*Note: the FAQ file has a bad typo---gen(id) instead of genid(id). I todl STATA about it;
use usdb, clear;
describe;
--------------------------
IUANYWARE
Installing MODULE PACKAGES needs special care, since they have to be installed locally. Suppose you are already in your working directory, e.g. you have already typed the command
cd "J:\spectrum-liberalism-index\CCES2012data"
Then to install the gvselect module, use this command:
net set ado "stata-packages"
ssc install gvselect
adopath + "stata-packages"
For a new module, the best source for info on the command is a Stata Jouranl article, if that is available. Google scholar has them.
--------------------------
regress yvar x1 x2 x3
predict fittedvalues1, xb
*That gives the PREDICTED VALUES. The xb is optional; default.
predict resvar1, residuals
*That gives the residuals.
--------------------------
local myvars v1 v2 v3 v4 v5
display 'myvars'
* a macro to create a list, of variables called myvars. They cannot be on separate lines, even if there is a semicolon ; to end.
Then I can run, regress spectrum `myvars', instead of typing them all out. This ia macro. It can only be used within a Do file, maybe because it is LOCAL and doesn't survive the running of that do file. You have to put it in each do file.
--------------------------
SEEING DATA VALUES. list var1 var2 in 1/10 will list the first 10 values of those two variables.
--------------------------
findit mdesc *This looks on the web for the mdesc add-on
mdesc * THis shows the number of missing values observations for each *variable in the dataset.
--------------------------
SUPPRESSING TERMINAL OUTPUT: quietly and noisily turn output off and on. set output inform is useless-- it does NOT keep good terminal output. Use it in front of every command to be kept quiet eg.
quietly regress yvar x1 x2 x3; There is no way to turn it on for a while and then off for a while.
--------------------------
ds, alpha list all variables in alphabstical order, which is what the alpha option does.
save foo,replace save the data in foo.dta, replacing whatever's there.
saveold autoold save the data from STATA 13 for STATA 11 and 12 to read.
_all is a term to mean all variables.
export excel using myfile.xlsx, firstrow(variables) replace
export delimited using myfile.csv, replace *no firstrow command needed
capture log close *At the start of a do-file to close open logs.
------------------------------------------
BEST SUBSET REGRESSION
gvselect ximmigpatrol ximmigpolice ximmigbusi , nmodels(1):
regress conservative7 ;
*What this does is to select the best 1, 2,3,etc. subsets of the variables ximmigpatrol ximmigpolice ximmigbusi when conservative7 is regressed on them. is meant literally; do not insert anything there.;
*But vselect is better for linear regression;
------------------------------------------
CATEGORICAL TO DUMMY (FACTOR, INDICATOR) VARIABLES
xi RACE i.RACE
This will create _IRACE1, _IRACE2, IRACE3, etc., a dummy for each value.
Or, use
regress POLVIEWS i.RACE
which will create dummies for all values of RACE and use them in the regression.
-------------------------------------------------------------
DROPPING DON'T KNOW, MISSING, ETC. VALUES OF VARIABLES
foreach varname of varlist VIOLHSPS VIOLASNS {
drop if `varname' == 8
}
--------------------------------------------------------------
AND OR ALL
correlate _all will correlate all the variable sin the dataset
And is & and Or is |
________________________________
--------------------------------------
PROGRAMMING
Use the WHILE command.
---------------------------------------------------------------------
TIME SERIES DIAGRAMS
insheet using jan6a.csv
*label variable govdebtgross "Gross gov. debt"
save jan6a
scatter rgdp year, title("Real GDP: 1980-2010") yscale(range(0)) ytitle("Real GDP", orientation(horizontal) ) xtitle("") ylabel(#6,angle(horizontal))
graph export rgdp80-09.png,replace
*This makes the y scale start at 0 and go to a Stata-chosen max
*The y-scale has a horizontal title, and the 6 tick labels are horizontal too.
*There is no title on the x-axis.
*Really, given the title, the y-axis doesn't need a title.
scatter rgdp year if (year>1989 & year <2008) , title("Real GDP: 1990-2007") yscale(range(0)) ytitle("Real GDP", orientation(horizontal) ) xtitle("")ylabel(#6,angle(horizontal))
graph export rgdp90-07.png,replace
*Really, given the title, the y-axis doesn't need a title.
* I couldn't figure out how to make the graph axis end at 2007 instead of 2010
*xscale(range(. 2007)) does not change teh length of the axis.
---------------------------------------------------------------------
ADDING A LABEL OR RELABEL
To add a more descriptive lable, for use on graphs and such, do this:
label variable gdpc "gdp per capita growth"
No replace is needed even if there is already a label.
---------------------------------------------------------------------
INPUTTING STRING DATA
infile str24 name using temp.txt
That is the STATA command to infile a 24-letter-or-less string variable.
---------------------------------------------------------------------
SCATTERPLOTS, FIGURES, DIAGRAMS
scatter unrate year if (year>=1923 & year <=
1950), yscale(range(0 25)) xlabel(#10) c(1) ;
graph save gd2.gph, replace;
graph export gd2.tif, replace;
scatter unrate year if (year>=1923 & year <=
1950), c(1) legend(off) || scatter unrate1 year if
(year>=1923
& year <=
1950), yscale(range(0 25)) c(1) clwidth(medthick)
clcolor(black) clpattern(dot)
legend(off) ;
graph save gd3.gph, replace;
graph export gd3.tif, replace;
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
DESCRIBING DATA
describe
is the command for listing variables and their labels
describe, simple
will give just the variable names, no labels
summarize
all will give means for all variables
summarize percapincome, details
will give medians and suchlike
tab percapincome
will give all the values of thevarible percapincome
correlate var1 var2 var3
will give a correlation matrix.
------------------------------------------------------------------
TRIMMING OUTLIERS:
_pctile pop, percentile(1(1)99);
scalar p95=r(r95);
scalar p5= r(r5);
scalar p10=r(r10);
scalar p90=r(r90);
drop if pop>p95;
------------------------------------------------------------------
HEKCMAN TECHNIQUE
You can get it as a stata command directly, but here's how to do it to replace tobit, I think.
generate win1= 0;
replace win1 = 1 if winrate ==100;
probit win1 prosrate prosrate2 indexcrime ;
predict pred1,p;
generate pred1size=invnormal(pred1);
generate pred1density= normalden(pred1size);
generate invmills =pred1density/pred1;
drop if win1==1;
reg winrate prosrate prosrate2 indexcrime budget cpsalary invmills;
---------------------------------------------------------------------
STATA LIST
Site for STATA regression advice:
http://www.stata.com/statalist/
The archives for past solutions:
http://www.stata.com/statalist/archive/
---------------------------------------------------------------------
HELP: FINDIT
If I type Findit rreg from the command screen, I will get Web files
that are useful, including, e.g. UCLA statistics help files.
---------------------------------------------------------------------
BASIC STATA
*jkjlkjlk is a way to do comments
cd d:/mystuff/mydirectory * Gets to the directory with your data
insheet using crimedata.csv *Inputs your spreadsheet data
regress crime population prison
*Regresses crime on population and prison
xi:regress crime prison i.state i.year
*Regresses crime on prison and dummies for states and years
generate lnprison = ln(prison)
*Creates a new variable with a math formula
replace logprison = 1000*lnprison if logprison== 1
*replace an observation of the variable if the condition is met
---------------------------------------------------------------------
HISTOGRAM DATA FIGURES GRAPHS FORMAT
The "scheme" for graphics called lean2 must be installed from teh web.
Try stata help for lean2 to find out how it's easy). THen use these
commands:
set scheme lean2;
histogram flunks if judge==0, discrete freq ylabel(,nogrid)
xtitle("Flunks by Lawyers") saving(flunks-l,replace) ;
*This one has no y-axis grid lines;
histogram flunks if judges>0, discrete freq xtitle("Flunks--
Lawyers") ytitle("First line of the title" "Continues to a second
line") saving(flunks-l,replace) ;
*This one has y-axis grid lines, and a special y-title
with two lines of text;
graph export flunks-l.ps, replace;
Stata will make the y-axis label vertical writing, rather than
horizontal. I go to Adobe Illustrator to change that.
Use help twoway_options to find out how to use the options.
---------------------------------------------------------------------
USING WEIGHTS IN A REGRESSION
* If I want to weight observations 1 unless the judge variables
equals 1, and
then use .04;
generate wj = 1;
replace wj = .04 if judge==1 ;
dprobit judge utokyo ukyoto Flunks413 Post93 Post93UT Post93UK
Post93
Fl413 [pweight=wj]
----------------------------------------------------------------------
MERGING IN A NEW VARIABLE INTO A STATA DATASET
Mark tells me that this command does it, putting the new variables in
wherever the
joinby prefecture using "file name.dta"
----------------------------------------------------------------------
FORMATTING REGRESSION OUTPUT: OUTREG AND EXCEL
Stata's standard regression output has too many decimal places and
some useless statistics. To convert one or more regressions to a
table form of the kind customary in economics, use one of two methods:
the Outreg program, or cutting and pasting to Excel.
The best way to do it is to use outreg, with output in a file like
temp.tex (with long enough lines, few carriage returns). In the
temp.tex file change all the left parentheses to yyy. Then load it
into Excel simply by opening temp.txt and choosing comma delimiters.
Then copy into WORD or LATEX. THen change all the yyy to left
parentheses.
FIRST, OUTREG:
The add-on program outreg.ado is available from:
http://econpapers.repec.org/software/bocbocode/s375201.htm
Documentation is at:
http://www.kellogg.northwestern.edu/researchcomputing/docs/outreg.pdf
I think you can download and install it in one step by issuing the
command in stata
ssc install outreg
In my PC's Stata 7.0, tho, that doesn't work. Instead, first type
net from http://fmwww.bc.edu/repec/bocode/o/
and then type
net install outreg
Here is how to use the program.
After each regression command such as "regress yvar xvar x2 x3 x4"
type, for the first regression,
outreg using table1.txt, replace bdec(2)
and for every succeeding similar regression,
outreg using table1.txt, append bdec(2)
What this does is to add the latest regression results as a new
column in a table in the file table1.txt. The top row will have the
Y variable (which can differ). Coefficients and t-stats will be to 2
decimal places. The tabs won't work
properly, so the table's spacing won't be right, though.
To get proper spacing, you can also outreg into a comma-separated
file, by adding an option like this:
outreg using table1.txt, replace bdec(2) comma
Then cut and paste to MS Word. Select all the text except the notes at
the bottom and click on Table, Convert Text to Table. It will format
it nicely. Then if you cut and paste to a plain text file, the good
spacing remains.
If you use dprobit dtobit, etc., you will get only MARGINAL EFFECTS
reported.
That is usually best. Then outreg will automatically report those too.
Otherwise, try:
margin[(u|c|p)] specifies that the marginal effects rather than the
coefficient
estimates are reported. It can be used after truncreg,marginal
from STB 52 or dtobit from STB 56. One of the parameters u, c, or
p
is required after dtobit, corresponding to the unconditional,
conditional, and probability marginal effects, respectively. It
is not
necessary to specify margin after dprobit, dlogit2, dprobit2,
and dmlogit2.
SECOND, EXCEL
The other way to format a regression table is to cut-and-paste
from teh
STATA log on teh screen (not from the *.log file) using EDIT and then
SAVE TABLE
(NOT the regular Save;SAVE TABLE put in tabs between columns.)
Paste it into a plain text file, or, perhpas into MS-WORD directly (and convert text to table then, I guess). Save the plain text file as temp.txt, then open it in Excel, and it will be a nice spreadsheet with columns and rows that can be cut and pasted.
Note that in Excel you can also add columns of "(" and ")". Then save as a *.csv file, with comma delimiters. Edit that as plain text, to get the parentheses in the same columns as their contents, and upload into Excel again.
USING BOTH
Start with temp.txt. Then chagne all the commas to tabs (mabye this happens anyway if you don't use the COMMA option in OUTREG). Then make a spreadsheet file and pick TEXT as the default for Nubmer Format. Then copy and past the temp.txt file to the spreadsheet file. It will save with parentheses around the t-stats, instead of making them negative.
BETTER: in the temp.txt file change all the left parentheses to yyy. Then load it into Excel simply by opening temp.txt and choosing comma delimiters. Then copy into WORD or LATEX. THen change all the yyy to left parentheses.
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
X-WINDOWS
If I am trying to use the Indiana U. computer remotely to use Stata
9.1, load
up an x-terminal and then type:
ssh erasmuse@libra.uits.iu.edu
or
ssh erasmuse@steel.ucs.indiana.edu
and issue the command:
xstata
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
/*
This is a multline COMMENT. No semicolons are
needed.
*/
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
#delimit ;
* This says that the semicolon denotes the end of a line of command.
All lines
must end with semicolons after this It only works in DO files, though, not interactively;
log using marginal-effects.log, replace;
set more 1;
*This should stop the pauses;
*To keep going regardless of errors, use DO mYFILE, nostop;
you can also try, at hte start of the file, to stop showing output with pauses:
set more off, permanently
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
FIXED-EFFECTS REGRESSIONS
This creates dummies out of verbal or numeric categories.
xi: tobit yvar xvar1 xvar2 i.categoryvariable, ll(10000) ul(30000);
OR
xi: regress yvar xvar1 xvar2 i.categoryvariable, robust;
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
*PLOTTING TWO VARIABLES AGAINST EACH OTHER;
graph murder black;
graph murder black, symbol([state]);
graph murder black, symbol([state]) psize(150);
*this makes the symbol size 50 percent larger than the
default
graph score cashleft, symbol([rank]) xscale(0, 40000) yscale(30,
110)
When I tried this in Stata 9.2 in dec. 2007 it didn't work anymore,
though. For a simple scatterplot I had to use:
twoway scatter exshpt exshare;
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
*DESCRIBING THE DISTRIBUTION OF A SINGLE VARIABLE;
histogram flunks if judge==0, discrete saving(flunks-l,replace) ;
*This is for the discrete variable flunks, if the judge variable-0. It
saves the
graph as flunks-l.gph. To convert the graph to TIFF, just right-click
on it in
STATA and save as TIFF. ;
graph winrate, box saving (winrate-box,replace) ;
* for a box and whiskers plot;
graph winrate, histogram bin(20) saving (winrate-histogram,replace) ;
kdensity winrate, saving (winrate-kdensity,replace) ;
tab var1
This just lists all the values.
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
*MISSING VALUES;
* FOr missing values in entries, use a period, . Other thigns
cause trouble.
Note that a missing value can be treated as being a very large
number,
for variable generation purposes. Watch out for things like
replace var1 =4 if var2>6, because if var2 is missing, that command
will
make var1=4;
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
INSHEET;
This is to get spreadsheet data into STATA:
* Save the spreadsheet as a tab file. Then say
INSHEET USING MYNAME.TXT;
OUTSHEET:
This is to get spreadsheet data out of STATA.
Outsheet using myname.csv, comma, replace
That uses commas to separate the variables.
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
CONVERTING STATA 9 DATASET TO STATA 4.
Stata has set things up so later versions are incompatible with
earlier ones. Stata 9 on Steel will read my co-authors' files, but not
my Stata 7 on my PC. Here's how to solve that:
Upload the file smith.dta to Stata 9.
Use smith.dta
outsheet using smith.txt
Download the smith.txt to Stata 7. On the PC:
set mem 50m *for a big dataset
insheet using smith.txt
save smith1.dta
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
* IF STATEMENTS AND STRING VARIABLES;
replace appointed =1 if state =="NJ" ;
replace appointed =1 if state =="AK" ;
LOGICAL OPERATORS:
"not equal to " is ~=
Use & instead of "and" in logical statements.
"or" is |
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
* SUMMARY STATISTICS;
Do not use the SUMMARIZE or SUMMARIZE, DETAILS command. Instead, use
tabstat, like this:
tabstat budget felclosed felconv, stats(min p25 median mean p75
max) f(%7.2f) columns(statistics) ;
INSPECT is a good command too. It gives you little histograms for
each variable.
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
* HOW TO HAVE ROBUST STANDARD ERRORS IN TOBIT IN STATA;
The documentation at
http://www.stata.com/support/faqs/stat/tobit.html
is cryptic and seems to be missing some lines in the middle, so I am
writing up these instructions. I still don't understand what is going
on, but this at least makes the command work.
Suppose you have regressed winrate using tobit on prosrate budpros
term crimerate pop. You used tobit because winrate is censored at 0
and 100-- you have the observations at the extremes, but the values
cannot be less than 0 or greater than 100. You used this command:
tobit winrate prosrate budpros term pop, ll(0) ul(100)
;
Another way to get exactly the same results is to use INTREG, the
interval regression command. This is designed for when some of your
data is intervals, as when for example you know that somebody's
income is between 1 and 3 million, but not exactly where in between.
Here, we use it differently.
First, define two new variables winrate1 and winrate2. Each
observation is in an interval [winrate1, winrate2]. Winrate1 will be
winrate, or will be missing (representing negative infinity) if
winrate takes its lower bound of 0. Winrate2 will be winrate, or
will be missing (representing infinity) if winrate takes its upper
bound of 100. Here are the Stata commands to generate those
variables:
gen winrate1=winrate;
replace winrate1 =. if winrate <= 0;
gen winrate2=winrate;
replace winrate2 =. if winrate >= 100;
Suppose our data was like this: (12, 0, 50, 100, 23)
Then winrate 1 would be (12,. , 50, 100, 23)
Then winrate 2 would be (12, 0, 50, ,, 23)
The intervals in the regression would be ([12,12], [-infinity, 0],
[50,50], [100, infinity], [23,23]).
Then use the INTREG command. Its first two variables, winrate1 and
winrate2, are the bounds that make up the intervals for each
observation of the dependent variable, and the rest are independent
variables.
intreg winrate winrate2 prosrate budpros term pop;
The regression above should give exactly the same results as the
tobit command specified earlier. It doesn't quite, for me, but I put
that down to different maximization algorithms for the two commands.
Now, though, we can have robust standard errors as an option to
correct for heteroskedasticity:
intreg winrate winrate2 prosrate budpros term pop, robust;
This last command has the exact same coefficient estimates, but
different, and consistent, standard errors.
What if you only left-censor, so winrate can't be less than 0, but it
can be greater than 100? Then generate the bounds like this:
gen winrate1=winrate;
replace winrate1 =. if winrate <= 0;
gen winrate2=winrate;
What if you only right-censor, so winrate can be less than 0, but it
can't be greater than 100? Then generate the bounds like this:
gen winrate1=winrate;
gen winrate2=winrate;
replace winrate2 =. if winrate >= 100;
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--;
*MARGINAL EFFECTS IN TOBIT;
This might be different in Stata 9.
See if dtobit, dlogit2, dprobit work. They report marginal effects
instead of
coefficients. Use the command dprobit, at(xxxx) if you want it
calculated at
xxx- but xxx has to be a matrix, and I don't know how to get a
median matrix.
Also, dtobit has to be installed from the web, epsecially.
. Otherwise, use the mfx command below.
mfx is the marginal effects command.
Use a dot if you are only left-censored or right-censored.
predict(e(0, 100)) will give the marginal effect on the expected
value of the dependent variable conditional on being uncensored,
E(y|a