Computing for Research I Spring 2012 Stata Graphics February 16 Primary Instructor: Elizabeth Garrett-Mayer Basic syntax for commands • prefix: command varlist, options • Examples: – regress y x, level(90) – by race: sum y x, detail – ttest y, by(x) unequal Stata Graphics • Maybe we can just end class now! • Check out these links: – http://www.ats.ucla.edu/stat/stata/library/Graph Examples/default.htm – http://www.ats.ucla.edu/stat/stata/topics/graphic s.htm – http://data.princeton.edu/stata/graphics.html – http://www.stata.com/capabilities/graphics.html Basic univariate displays • • • • Boxplots Stem and leaf Histograms Density plots Ceramide Data • • • • Let’s look at the ceramide markers What are their distributions? Are there outliers? Should we consider taking logs, or using % change? Results of a phase II trial of gemcitabine plus doxorubicin in patients with recurrent head and neck cancers: serum Cāā-ceramide as a novel biomarker for monitoring response. Saddoughi SA, Garrett-Mayer E, Chaudhary U, O'Brien PE, Afrin LB, Day TA, Gillespie MB, Sharma AK, Wilhoit CS, Bostick R, Senkal CE, Hannun YA, Bielawski J, Simon GR, Shirai K, Ogretmen B. Clin Cancer Res. 2011 Sep 15;17(18):6097-105. Epub 2011 Jul 26. Histogram 0 .02 Density .04 .06 • hist c18 0 20 40 C18 ceramide 60 Let’s make it prettier * prettier histograms hist c18 , freq xaxis(1 2) ylabel(0(2)24) xlabel(20 "Twenty" 40 "Forty") hist c18, title("Histogram of C18 Ceramide") subtitle("PI: K. Shirai") hist c18, ytitle("number of patients") freq yline(0(10)20) hist c18, xaxis(1 2) xlabel(19.6 "mean" 11.9 "median", axis(2) grid) finding help on these can sometimes be tricky! e.g. help axis_choice_options 20 C18 ceramide 40 Histogram of C18 Ceramide 60 PI: K. Shirai 0 0 2 4 6 8 .02 Density .04 .06 10 12 14 16 18 20 22 24 0 Forty C18 ceramide 0 20 25 Twenty 40 C18 ceramide 60 C18 ceramide 0 0 5 .02 10 Density .04 15 20 .06 median mean 0 20 40 C18 ceramide 60 0 20 40 C18 ceramide 60 Boxplots 40 20 0 C18 ceramide 60 80 • graph box c18 Boxplots graph box c18, by(cycle) graph box c18, over(cycle) tab cycle graph box c18 if cycle<7, over(cycle) sort patient cycle merge m:1 patient using "Ptdata.GemDox.dta" graph box c18 if cycle<7, over(cycle) over(gender) graph hbox c18, over(initial) capsize(5) 0 0 10 10 30 40 50 50 0 0 0 20 40 60 80 20 40 60 80 11 15 19 1 3 5 40 80 9 C18 ceramide 60 0 7 20 20 40 60 80 5 30 40 3 20 C18 ceramide 20 1 Graphs by Cycle 1 1 3 5 3 f 7 5 9 11 1 15 3 m 19 5 CR CR PD PD PR PR SD SD 0 20 40 C18 ceramide 60 800 20 40 C18 ceramide 60 80 graph hbox c18, over(initial) capsize(5) graph hbox c18, over(initial) medtype(marker)medmarker(msymbol(+) msize(large)) graph hbox c18, over(initial) ytitle(“C18”) Labels • Sometimes xlabels cannot be applied (e.g. boxplots) • need to label your values • Example: cycle for boxplots – label define cycle 1 "cycle 1" 3 "cycle 3" 5 "cycle 5" 7 "cycle 7" – label values cycle cycle – graph box c18 if cycle<7, over(cycle) • (Hint: use this on the homework!) Stem and Leaf . stem c18 Stem-and-leaf plot for c18ceramide (C18 ceramide) c18ceramide rounded to nearest multiple of .1 plot in units of .1 0** 0** 1** 1** 2** 2** 3** 3** 4** 4** 5** 5** 6** 6** | | | | | | | | | | | | | | 42,43,44,46 57,57,67,81,89,90,96,98,99,99 01,06,08,08,14,15,19,20,35,44 62 03,15,16,18,19,19,22 82 17 23,49 58,68,68 37 86 Dotplot • Excellent way to show data across groups when you have a relatively small dataset • dotplot y, over(group) dotplot dotplot dotplot dotplot dotplot c18, c18, c18, c18, c18, over(cycle) over(gender) over(gender) nogroup over(gender) nogroup jitter(3) over(gender) nogroup median center 40 20 0 C18 ceramide 60 80 Dotplot, by gender f m gender Scatterplots • Two way graph • Syntax: – graph twoway scatter y x1 x2 – graph twoway scatter y x1 60 40 20 0 C18 ceramide – graph twoway scatter c18 totalceramide 80 • Example: 400 600 800 total ceramide levels 1000 1200 Regression example • • • • Scatterplot Residual plots Leverage Fitted line with raw data Code graph twoway scatter c18 totalcer regress c18 totalcer * residual plot * (residual vs. fitted) rvfplot * the long way * 1. generate a new variable from the regression, residuals predict resid, res * 2. generate a new variable from the regression, fitted values predict fit scatter res fit, yline(0) * leverage vs. residual plot lvr2plot * take transform of C18? gladder c18 boxcox c18 * generate new variable gen logc18=log(c18) scatter logc18 totalcer scatter logc18 totalcer, mlabel(gender) scatter logc18 totalcer, mlabel(gender) s(i) scatter logc18 totalcer, s(Oh) * redo regression regress logc18 totalcer rvfplot, yline(0) lvr2plot predict logfit * make plot of fitted model and raw data scatter logfit logc18 totalcer scatter logfit logc18 totalcer, s(i o) c(l .) graph twoway scatter logfit totalcer, s(i) c(l) || scatter logc18 totalcer, s(o) c(.) The next graph to create Fancier way to put regression lines infile str14 country setting effort change /// using http://data.princeton.edu/wws509/datasets/effort.raw graph graph graph graph twoway twoway twoway twoway scatter change setting (scatter change setting ) (lfit change setting ) (scatter change setting ) (qfit change setting ) (scatter change setting ) (lfitci change setting ) • scatter makes a scatterplot of the two variables • lfit plots the regression line of y on x • qfit plots a fitted quadratic model of y on x • lfitci plots the line AND a confidence interval! Fancier way to put regression lines Plot using qfit 0 -20 10 0 20 20 30 40 40 Plot using lfitci 40 40 60 80 60 100 80 setting setting change Fitted values 95% CI change Fitted values 100 (scatter change setting, mlabel(country) ) 40 graph twoway (lfitci change setting) Cuba CostaRica TrinidadTobago Chile 20 Colombia Panama Jamaica DominicanRep ElSalvador Brazil Nicaragua Paraguay Venezuela Peru Ecuador Bolivia Haiti Mexico -20 0 Honduras Guatemala 40 60 80 100 setting 95% CI change • • • Fitted values One slight problem with the labels is the overlap of Costa Rica and Trinidad Tobago (and to a lesser extent Panama and Nicaragua). We can solve this problem by specifying the position of the label relative to the marker using a 12-hour clock (so 12 is above, 3 is to the right, 6 is below and 9 is to the left) and the mlabv() option. We create a variable to hold the position set by default to 3 o'clock and then move Costa Rica to 9 o'clock and Trinidad Tobago to just a bit above that at 11 o'clock (we can also move Nicaragua and Panama up a bit, say to 2 o'clock). 40 Cuba TrinidadTobago CostaRica 20 Colombia DominicanRep Chile Panama Jamaica ElSalvador Brazil Nicaragua Paraguay Venezuela Peru Ecuador Bolivia Haiti Mexico -20 0 Honduras Guatemala 40 60 80 100 setting 95% CI change Fitted values gen pos=3 replace pos = 11 if country == "TrinidadTobago" replace pos = 9 if country == "CostaRica" replace pos = 2 if country == "Panama" | country == "Nicaragua“ graph twoway (lfitci change setting) /// (scatter change setting, mlabel(country) mlabv(pos) ) Legends Cuba TrinidadTobago CostaRica Honduras Guatemala Haiti Brazil Nicaragua Paraguay Peru Ecuador Bolivia -20 linear fit 40 60 80 setting Colombia Mexico DominicanRep 20 Panama Jamaica Honduras Guatemala 95% CI 100 Chile Panama Jamaica ElSalvador Venezuela 0 ElSalvador TrinidadTobago CostaRica Haiti Brazil Nicaragua Paraguay Mexico Venezuela Peru Ecuador Bolivia -20 20 DominicanRep Cuba Chile Fertility Decline Colombia 0 Fertility Decline by Social Setting 40 40 Fertility Decline by Social Setting 40 60 80 setting graph twoway (lfitci change setting) /// (scatter change setting, mlabel(country) mlabv(pos) ) /// , title("Fertility Decline by Social Setting") /// ytitle("Fertility Decline") /// legend(ring(0) pos(5) order(2 "linear fit" 1 "95% CI")) graph twoway (lfitci change setting) /// (scatter change setting, mlabel(country) mlabv(pos) ) /// , title("Fertility Decline by Social Setting") /// ytitle("Fertility Decline") /// legend(off) 100 Spaghetti plots Command available from UCLA: spagplot * spaghetti plots clear insheet using "I:\MUSC Oncology\Shirai, Keisuke\October2010\ceramide.csv" findit spagplot spagplot c18 cycle, id(patient) spagplot c18 cycle, id(patient) nofit * remove patients who only have cycle=1 sort patient cycle by patient: gen visit=_n egen maxvis=max(visit), by(patient) spagplot c18 cycle if maxvis>1, id(patient) nofit * or, use c(L) graph twoway scatter c18 cycle if maxvis>1, c(L) help connectstyle other neat stuff • graph matrix • saving graphs: click and save as desired format • saving and combining (see princeton site, section 3.3) – http://data.princeton.edu/stata/graphics.html • See GraphExamples on ucla site: – http://www.ats.ucla.edu/stat/stata/library/GraphExamples/