Teaching with Stata Peter A. Lachenbruch & Alan C. Acock Oregon State University peter.lachenbruch@oregonstate.edu alan.acock@oregonstate.edu First Course Requirement—Data Entry • I want a first course to be able to do the things I want students to do: – Enter and edit data--must be “want to know topic” – Students can do a small survey to get data on topics of interest to them. • Voter poll • Attitudes toward diversity issues on campus • Beliefs about regulating the internet – Learn how to create a codebook, use codebook and codebook, compact • Where possible use “real” data WCSUG Presentation 2 First Course Requirement—Data Management • Balance statistical content with proper data management content—hard decision • Storing original dataset and creating a working dataset • Keeping a record of every data modification they make using do-file – Menu system is an aid – Do-files are the requirement • Missing values--distinguish types • Variable names, labels, and value labels WCSUG Presentation 3 First Course Requirements— Data Management • Transformations – log, , exp • Logical editing – beware of logical transformations when missing values are present (gen y = x < 10 leads to “.” transforming to 0) • Appending – Append student generated datasets • Merging – Merging two waves of data WCSUG Presentation 4 First Course Requirements— Data Management • Constructing Measures – When to use egen newvar =rowtotal(var1, var2, var3) – When to use egen newvar =rowmean(var1, var2, var3) – When to use misschk command, what it does • Suppose the variable category is 0 or 1 • If there are missing values in category, there is a difference between – – – – gen y = 1 if category gen y = 1 if (category==1) gen y = 1 if (category>0) The first and third will give scores of 1 for missing values. The second will give a score of 0 for missing values - BEWARE WCSUG Presentation 5 First Course Requirements— Data Management • edit command, insheet input, infile (csv files) • gen newvar = ln(oldvar) • Rarely use replace oldvar = sqrt(oldvar) – only when correcting an error – don’t replace data • merge ptid assessment using file, update (need for data to be sorted) WCSUG Presentation 6 First Course Requirement (2) – Data presentation, numerical summary measures – summarize, detail; list; browse; edit; describe; codebook; codebook, compact – Graphic presentation--bar chart, histogram, box plot seem minimum – Probability computations – binomial, binomialtail, chi2, chi2tail, F, Ftail, normal – use of the inverse functions for these. WCSUG Presentation 7 Examples • summarize sp,detail; list sp; describe s*; codebook s* • display binomial(10,3,0.1) for cumulative or display Binomial(10,3,.1) for reverse cumulative; Note disp 1binomial(10,2,.1) gives the same result (also binomialtail(10,3,.1) • display normal(1.2) WCSUG Presentation • gen y = 8 First Course Requirement (3) • Confidence intervals – Binomial – ci—ci variable – Normal – ci—ci variable – Poisson – ci—ci variable, poisson • Percentiles – – summarize,d – centile price, c(10(10)90) WCSUG Presentation 9 Examples • cii 20 4; – cii 20 4, agresti • Sometimes we want to use the Agresti formulation. The exact is usually preferable • ci varname, level(99) • summarize weakness, detail – Can use su weakn,d (i.e. abbreviate commands, options and variables) • centile weakness,c(20,40,60,80) – Or centile weakness,c(20(20)80) WCSUG Presentation 10 First Course Requirements (4) • Hypothesis Testing: – Normal r.v.s • One sample (including paired data) • Two sample - ttest • K samples – ANOVA – Binomial variables • One sample – proportion • Two samples – tabulate, chi2 WCSUG Presentation 11 Examples • ttest sp = 120 [one-sample] • ttest spmen = spfem [paired] • ttest spmen = spfem, unpaired unequal welch • ttest sp, by(sex) [unequal welch etc.] • Also immediate form – see help • anova sp agegrp WCSUG Presentation 12 Examples • bitest success = 0.8 [one sample binomial] • tabulate success group, chi2 row col • prtest success, by(group) [two sample binomial] WCSUG Presentation 13 First Course Requirements (5) • Hypothesis Testing (cont.) – Power considerations – sampsi (or spreadsheet – nice exercise for some good ones) – Nonparametric methods – sign, signrank, ranksum • Contingency tables – tabulate, epitab WCSUG Presentation 14 Examples • sampsi 132.86 127.44, p(0.8) r(2) sd1(15.34) sd2(18.23) • ranksum sp, by(survive) • signrank before = after • When should we supplement Stata with other software such as G*power 3 that is free and more flexible than sampsi or other software such as PASS or nQuery Advisor? WCSUG Presentation 15 First Course Requirements (6) • Simple linear regression – regress, rvfplot, other diagnostics • Correlation – corr, spearman, ktau – I tend not to use corr because of the sensitivity to the normality assumption for tests and confidence intervals • Only pwcorr and not corr provide test of significance WCSUG Presentation 16 Examples • regress mpg weight • rvfplot • Stata’s “type a little, get a little” very different from other packages • correlate mpg weight or pwcorr mpg weight (especially when you have more than 2 variables – can specify sig and obs—Note that these only work with pwcorr) • spearman mpg weight – would be nice to have Stata produce a Spearman correlation matrix WCSUG Presentation 17 Examples • It’s easy to use permutation tests . permute anyhcq t=r(t):ttest ald7 if adult==1 & assnum==1,by(anyhcq) (running ttest on estimation sample) Monte Carlo permutation results command: t: permute var: Number of obs = 97 ttest ald7, by(anyhcq) r(t) anyhcq --------------------------------------------------------------------------T | T(obs) c n p=c/n SE(p) [95% Conf. Interval] -------------+------------------------------------------------------------t | 1.648305 13 100 0.1300 0.0336 .071073 .2120407 --------------------------------------------------------------------------Note: confidence interval is with respect to p=c/n. Note: c = #{|T| >= |T(obs)|} • One can do similar things with the bootstrap • These are easy to use and intuitive for students WCSUG Presentation 18 Use of Stata in the Classroom • Use Stata sparingly – It’s not easy to follow commands typed or used from menus – students will get confused – Have handouts of what you do – make spacing large enough that students can annotate – even if only to write nasty things about the instructor – Balancing coverage of Stata, e.g. data management with coverage of Statistics is a constant issue – Remember – it’s a course in statistics, not in Stata WCSUG Presentation 19 Data Sets • Place data sets on a LAN or common drive or available for copying to flash drive or CD • Use real data – Not too many variables – May have missing values – but should not affect main analyses – unless you want to demonstrate the problems with missing values WCSUG Presentation 20 In the Classroom • Using CD rather than flash drive is better(?) – Many desktops have USB port located inconveniently (darn you Dell!) – Sometimes newer PCs have USB port on monitor, and laptops usually have an easy slot for the flash drive – Light level in the room should allow students to read easily – Days of dim projectors are over WCSUG Presentation 21 In the Classroom (2) • Enlarge the Stata font by using right mouse button – I have found that 14 point is pretty good – Be careful about wraparound of output – if needed, reduce point size temporarily – Don’t ever use red on blue font – See what I mean? It’s more difficult to read • Show how to move and fix windows WCSUG Presentation 22 In the Classroom (2) • Optimizing visibility with projector – Use rich color background – EditPreferencesGeneral preferences. Blue background option good but it relies on red for errors, green for Standard text, and doesn’t bold fonts. – Custom may be better because you can make fonts bold and pick colors that do not disadvantage students who are colorblind. WCSUG Presentation 23 Virtual Lab • A server supporting 30 simultaneous sessions of Stata is remarkably inexpensive. • A department can require students to have laptops or provide a cart with enough laptops • Because laptops are really “dumb” terminals with server, the laptops can be cheap and not updated very often • Any room becomes a lab • Students should have 24/7 access to the server WCSUG Presentation 24 Handouts and Data Sets • Have handouts of your lecture notes • Have handouts of your data analysis demonstrations – Include commands as well as output! • Data sets – On line – LAN or CD or Floppy disk --Lots of laptops don’t have floppy drives any more, flash drives are inexpensive • Include – Student generated datasets – Datasets with large Ns and relatively few variables WCSUG Presentation 25 Emphasis in Course • Lectures devoted to statistics • Labs to learning Stata and working on homework and discussion • Proper printing of output – Don’t split output between two pages if possible (at least, find a good break point) – Always use a monotype font (such as Courier New) WCSUG Presentation 26 Some Final Issues • Multiple testing can distort inference (i.e. doing 100 tests guarantees some significant results – but they may be meaningless) – Worry about this • Controlling the digits in the output. Use outreg, estout, esttab WCSUG Presentation 27 The End WCSUG Presentation 28