Lecture1.Intro - The Medical University of South Carolina

advertisement
Computing for Research I
Spring 2013
Lecture 1: January 8
Primary Instructor:
Elizabeth Garrett-Mayer
Introduction
•
Description: Students learn to use the primary statistical software packages for
data manipulation and analysis, including (but not limited to): R, R Bioconductor,
SAS, SAS macro, and Stata. Additionally, students will learn: how to use the
division's high speed cluster-computing environment, how to practice the
principles of reproducible research using Sweave in R, how to use LaTeX and
BibTeX for manuscript and presentation development, and how to create and
maintain a website. This is a three credit course.
•
Course Organization: This course is given by the entire division. Instructors will
take turns giving lectures in their areas of expertise.
•
Textbooks: No textbook. Reading material (primarily found on the web) will be
provided as necessary.
•
Prerequisites: Biometry 700
Evaluation
• Grading: Instructors will give short exercises to be completed and turned
into the primary instructor by the Wednesday of the week following when
it was assigned (e.g., assignments given on Tues Feb 5 and Thurs Feb 7 are
both due on Thurs Feb 14). Each assignment will count equally towards
75% of the course grade. There will be a final project which will account
for the remaining 20% of the course grade. The remaining 5% of the
course grade will reflect class participation.
• Homeworks Policy: Homeworks are due by 5pm on the due date. All
homeworks should be emailed to the primary instructor
(garrettm@musc.edu) or turned in at lecture time. Asking for extensions
on homeworks is discouraged. However, it is expected that, on occasion,
extenuating circumstances may arise. Therefore, the policy is that each
student may request an extension on homework twice and the extension
is to be no more than 2 days. After using two extensions, no more
extensions will be granted except with a medical note.
Classroom Etiquette
• Attention to material: Laptops are permitted in class, but it is expected
that if they are used, it is to follow along with the lecture. Email and web
browsers should not be visited during class time. The instructors are
giving their time and expertise. Be respectful and give them your
attention.
• Classroom disruptions: Many of us have small children and others who
we need to be able to be in contact with during lectures. It is acceptable
to bring pagers or cell phones to class. Please be sure they are on silent
mode. If you need to leave during lecture to take a phone call, or make a
phone call, please do so. However, this should be a relatively rare
occurrence. Texting and emailing during lecture time is not acceptable.
• Violations of classroom etiquette policies will result in a 0 for
class participation.
Contact
Primary Elizabeth Garrett-Mayer
Instructor:
Website: http://people.musc.edu/~elg26/teaching/statcomputing.2013/statcomputingI.2013.htm
Contact Info: Hollings Cancer Center, Rm. 118G
garrettm@musc.edu (preferred mode of contact is email)
792-7764
Time: Tuesdays and Thursdays, 2:00-3:30
Location: Cannon 301
Office Hours: by appointment
Teaching Katherine Nicholas
Assistant:
Office Hours: The primary instructor will be available by
appointment. Katherine Nicholas will also have office hours. However, given
the nature of the course, the primary instructor (or TA) may not be
knowledgeable regarding all of the topics covered. As a result, additional help
may be needed to complete assignments from the lecturers. Be considerate
and responsible in scheduling time with course instructors and recognize that
they all have busy schedules.
Course Objectives
Upon successful completion of the course, the student will be
able to
• Import, perform simple analyses and produce graphical
displays in Stata, SAS and R
• Create new functions or commands in each of R, Stata and
SAS
• Generate professional quality scientific manuscripts and
presentations using Latex along with statistical software
• Perform standard power and sample size calculations using
available software and simulations.
• Operate the division’s cluster computer with batch
computing
Schedule, briefly
•
•
•
•
•
•
•
•
SAS
designing website
STATA
data mangement
R
sample size/power calculations
Batch processing
Latex + Sweave
Detailed Schedule
Date
Tu Jan 8
Th Jan 10
Tu Jan 15
Th Jan 17
Tu Jan 22
Th Jan 24
Tu Jan 29
Th Jan 31
Tu Feb 5
Th Feb 7
Tu Feb 12
Th Feb 14
Tu Feb 19
Th Feb 21
Tu Feb 26
Th Feb 28
Lecturer
EGM
Katherine Nicholas
Ramesh
Katherine Nicholas
Valerie Durkalski
Nate Baker
Renee Martin
Jordan Elm
Sybil Prince-Nelson
EGM
EGM
EGM
EGM
EGM
Amy Wahlquist
EGM
Topic
Introduction; Overview and Principles
SAS: introduction
SAS: IML
SAS: ODS
SAS: proc tabulate and proc report
SAS: Gplot
SAS: macros
SAS: array processing
Designing your own website
STATA: introduction, “immediate” commands
STATA: graphical displays
STATA: exploratory data analysis;
STATA regression commands
STATA: programming and do files
Data management: RedCap
Data management principles & Excel
Detailed Schedule
Date
Tu Mar 5
Th Mar 7
Lecturer
EGM
Chiuzan, Cody
Tu Mar 19
Th Mar 21
Tu Mar 26
Th Mar 28
Tu Apr 2
Delia Voronca
Georgiana Onicescu
EGM
EGM
Yanqui Weng
Th Apr 4
Tu Apr 9
Th Apr 11
Tu Apr 16
Th Apr 18
Beth Wolf
EGM
Adrian Nida
Cody Chiuzan
Emily Kistner-Griffin
Topic
R: introduction to object-oriented programming
R: downloading packages/libraries; data input &
output
R: graphics
R: basic language structure (ifelse, where, looping)
R: exploratory data analysis; writing commands
R: : regression commands
R: simulations; random number generation; sampling
from distributions
R: bioconductor
Sample size calculation software packages
Cluster computing, etc.
Latex and Bibtex: manuscript production
Latex and Bibtex: presentations
Tu Apr 23
Th Apr 25
Betsy Hill
Caitlyn Ellerbe
Reproducible Research: Sweave
Mendeley
FINAL
PROJECT
DUE MAY 3
Housekeeping
• We are meeting in a regular classroom
• Bringing laptops is allowed
• Data, code, etc. needed for class will be on the
website prior to class
• For optimal interface, install packages ASAP
– R (http://cran.r-project.org/)
– Stata (DBE helpdesk request)
– SAS (DBE helpdesk request)
• Create a bookmark to the course website:
http://people.musc.edu/~elg26/teaching/statcomputing.2013/statcomputingI.2013.htm
Lecture Notes
• Every lecturer will have his/her own style
• Notes may be
– prepared ahead of time and posted
– Prepared and posted after the lecture
– Nonexistent
• Lecture notes will NOT be printed by the
instructors prior to lecture.
• If they are available and you would like a paper
copy, it is your responsibility to print them out.
Introduction
• 2013: to be a successful
biostatistician/epidemiologist, you MUST be
competent on the computer.
• Historically: students learned in labs from (older)
students
• Moving forward:
– many options for analysis and generation of results
– Efficiency in computing is essential.
– Your computer IS your lab!
Data analysis software
• In this course:
–R
– Stata
– SAS
• Many other options:
SPSS
S, Splus
Epi Info
GraphPad
JMP
Matlab
JAGS
Systat
Minitab
EGRET
BMDP
MedCalc
Mathematica
WinBugs
GLIM
….
SAS: History
• SAS was conceived by Anthony J. Barr in 1966. As a North Carolina
State University graduate student from 1962 to 1964, Barr had
created an analysis of variance modeling language. From 1966 to
1968, Barr developed the fundamental structure and language of
SAS.
• In January 1968, Barr and James Goodnight collaborated,
integrating new multiple regression and analysis of variance
routines developed by Goodnight into Barr's framework.
• By 1971, SAS was gaining popularity within the academic
community. One strength of the system was analyzing experiments
with missing data, which was useful to the pharmaceutical and
agricultural industries, among others.
• In 1976, SAS Institute, Inc. was incorporated.
• The latest version, SAS version 9.3, was released in July 2011
SAS: functioning
• SAS consists of a number of components,
which organizations separately license and
install as required.
• Licenses expire! Software cannot be used
after expiration (unless renewed)
Why (or why not) SAS?
• Most commonly used in pharma (although that may be changing!)
• FDA likes SAS
• Many jobs for MS statisticians and/or epidemiologists require SAS
expertise
• The most common language
• Becoming less the choice of academia
– Updates are less frequent than freeware
– ‘pros’ of competitors are starting to outweigh the ‘pros of SAS
•
•
•
•
Licensing costs
Slow to add new functionality
Lack of consistency with syntax
Learning curve is slower than other programs that now have similar capability
Stata
• Stata is a general-purpose statistical software
package created in 1985 by StataCorp.
• Most of its users work in research, especially
in the fields of economics, sociology, political
science, biomedicine and epidemiology.
• Relatively simple to learn yet powerful
• Latest version is Stata 12 (released July 2011).
• Lots of add-ons for epi users
Why (or why not) Stata?
• Relatively inexpensive (especially as student or singleuser)
• Biomedical focus: output and functions are tailored to
medical research
• Fast and big: can handle and manipulate large datasets
• Sophisticated with wide range of tools
• Easy to learn language with consistent syntax
• Graphics are not as good as other packages (although
that has improved)
• Programming (simulations, loops, etc.) is more
challenging
R: History
• R is a programming language and software environment for statistical
computing and graphics.
• The R language has become a de facto standard among statisticians for the
development of statistical software, and is widely used for statistical
software development and data analysis.
• R is an implementation of the S programming language. S was created by
John Chambers while at Bell Labs. R was created by Ross Ihaka and Robert
Gentleman, and is now developed by the R Development Core Team. R is
named partly after the first names of the first two R authors, and partly as
a play on the name of S.
• R source code is freely available under the GNU General Public License.
• The capabilities of R are extended through user-submitted packages,
which allow specialized statistical techniques, graphical devices, as well
as import/export capabilities to many external data formats.
• A core set of packages are included with the installation of R, with more
than 4000 (as of December 2012) available at the Comprehensive R
Archive Network (CRAN).
• The most recent version is R.2.15.2 released October 2012.
R: functionality
• Freeware: latest version can be installed
anywhere at anytime
• Packages (a.k.a. libraries) that are usercontributed allow additional
features/commands
• Relatively simple interface
Why (or why not) R?
•
•
•
•
•
•
•
•
•
•
Great for programming and simulations
Handles looping well
Flexible language
FREE!
User-contributed packages included in real-time (i.e., no delay in
their availability)
Most PhD Biostatistics programs teach their students R and
many/most academic statisticians in top programs use R.
Interfaces nicely with other programs such as Latex (Sweave),
WinBugs, C, Emacs.
Can be clunky for data management.
Memory is not as good as SAS and Stata
Quality-control on user-contributed packages not evident
Overview
• Not a question of which one.
• Question is “for my current problem, which
package makes the most sense to use?”
• Each has strengths and weaknesses
Data management
• Analysis of clean data is easy!
• The real world: you will get messy data most of
the time from your colleagues
• Data management tools will help you;
– Deal with messy data
– Set up data capture approaches for your colleagues to
minimize messiness
• Excel, RedCap and general principles of data
management for statistical analysis will be
covered
Example
Patient #
cycle #
0
3
2
0
3
5
534.8
461.6
527.3
148.4
182.8
151.4
9
10.8
11.5
16.4888889
16.9259259
13.1652174
3
0
760.5
214.5
12
17.875
4
0
3
5
359
375.9
475.6
167.3
125.3
116.2
4.3
4.6
4.4
38.9069767
27.2391304
26.4090909
5
0
394.1
163.1
5.7
28.6140351
6
0
3
848.7
1083.6
132.5
203.9
10.8
13.5
12.2685185
15.1037037
7
0
684.6
191.4
8.1
23.6296296
8
0
822.7
219.5
8.9
24.6629213
9
0
486.3
581.3
699.6
561.7
754
198
186.8
42.3
130.4
320.6
5.7
9.6
11.4
6.7
14.4
34.7368421
19.4583333
3.71052632
19.4626866
22.2638889
CR
CR
total ceramide levels S1P levels C18 ceramide S1P/C18
743.6
197.2
9.8
20.122449
625.6
177.9
9.9
17.969697
1
Latex and Sweave
• LaTeX is a document markup language and document preparation
system for the TeX typesetting program.
• The term LaTeX refers only to the language in which documents are
written, not to the editor used to write those documents. In order
to create a document in LaTeX, a .tex file must be created using
some form of text editor. (e.g. WinEdt)
• LaTeX is most widely used by mathematicians, scientists, engineers,
philosophers, lawyers, linguists, economists, researchers, and other
scholars in academia.
• LaTeX is used because of the high quality of typesetting achievable
by TeX. The typesetting system offers extensive facilities for
automating most aspects of typesetting and desktop publishing,
including numbering and cross-referencing, tables and figures, page
layout and bibliographies.
Latex and Sweave
• Sweave is a function in R that enables integration of R code into
LaTeX documents. The purpose is "to create dynamic reports, which
can be updated automatically if data or analysis change".
• The data analysis is performed at the moment of writing the report,
or more exactly, at the moment of compiling the Sweave code with
Sweave (i.e., essentially with R) and subsequently with LaTeX. This
can facilitate the creation of up-to-date reports for the author.
• Because the Sweave files together with any external R files that
might be sourced from them and the data files contain all the
information necessary to trace back all steps of the data analyses,
• Sweave also has the potential to make research more transparent
and reproducible to others. However, this is only the case to the
extent that the author makes the data and the R and Sweave code
available.
Sample size and power
• We don’t really use textbook formulas anymore
to do simple power calculations (just like we don’t
really invert matrices by hand when we analyze data).
• There are a number of packages that quickly and
easily perform simple power calculations
• R, SAS and Stata can do some.
• But, packages like Nquery, EAST and PASS do a lot
more.
• In some non-standard settings, simulations are
required to determine power.
Website development
• It is important in this day and age to ‘market’
yourself.
• It will be important for gaining recognition and
opportunities in your field and for making
your own work available.
• It isnt hard, but you do need to learn some
skills to set up and maintain your own site.
Before getting started…
• Types of files involved in statistical computing
–
–
–
–
–
–
Data files
Results files
Command/batch files
Function files
Graphics files
+ more(?)
• TIPS:
– develop a common nomenclature for naming files and
folders
– Organize projects within folders
Organization is key!
• DO NOT overwrite old files (especially data files)
• Save with a new name
– Mousedata.xls (file sent from colleague)
– Mousedata.clean.xls (your clean version of the data)
• Use a consistent approach, but think ahead
– Naming files *.new.* is not a good idea. You may have
a new ‘new’ next week
– Numerics are good, but if you think you may need
more than 9 versions, consider how data2 and data10
would be alphabetized.
Examples
• For each Principal Investigator I work with, I have
a folder
• Within the PI folder, for each project, I have a
folder
• For each time I get a new dataset (or work on a
new grant) for that project, I have a folder named
with month and year
• Example:
I:\\MUSC Oncology\\Kraft, Andrew\\VelcadeTrial\\May2008
I:\\MUSC Oncology\\Kraft, Andrew\\R01 June 2007
Examples
• Within each folder of data analysis or grant
development calculations, I use the same naming
conventions for files:
– Rbatch.R: a set of R commands that implement all of
the computation or analyses
– Rfunctions.R: a set of R functions that are used by
the batch file
– I always save the original data file from the investigator
before making any changes
– I add ‘clean’ to the datafile name and save it as a .csv
before use (e.g. mousedata.clean.csv)
– My Rbatch.R files always include a line sourcing in the
data, including the folder where the data resides.
Friends in Statistical Computing
1. Google is your friend
2. ‘Help’ functions and ‘see also’ links are your
friends
3. ‘examples’ are your friends
4. Your fellow students are your friends
Friends help friends figure out statistical
computing!
Using your noggin
• Example 1:
– SPSS is not included in this curriculum.
– Can you ever use it? YES!
– Will you be able to learn it better and faster after having taken this
course? YES!
• Example 2:
– We will probably not cover the R package nnc (Neareset Neighbor
Autocovariates)
– Does that mean you need to find someone to teach it to you? NO!
– Will you be able to teach it to yourself? YES!
• Example 3:
– None of your instructors are computer scientists.
– Does this mean that they are not qualified to teach you? NO!
– Most of them are self-taught with regards to these techniques
Final Thoughts for Today
• THIS COURSE WILL POINT YOU IN THE RIGHT
DIRECTION AND PROVIDE A SET OF TOOLS
• IT IS YOUR JOB TO MAKE THEM FIT TOGETHER
AND USE THEM AS A LAUNCHING PAD TO
SOLVE PROBLEMS
• Next up: Intro to SAS on Thursday!
References
• Some background info on R, SAS, Stata, Latex
and Sweave was all pilfered from Wikipedia.
Download