Hot Research Tools at Stanford

advertisement
Hot Research Tools at Stanford
Raymond R. Balise, Ph.D.
Health Research and Policy/SPECTRM
Todd Ferris, MD
Stanford Center for Clinical Informatics
Topics
•
•
•
•
Keeping your data secure
Finding patient populations
Tools for collecting and storing data
Tools for analysis
Safety First
•
•
•
•
•
Virus Scanner
Disk and File Encryption
Secure Email
File Transfer
Backup Tools
Getting Security Software
• Software licensed for the entire University can
be found here:
www.stanford.edu/services/ess/index.html
• You definitely want to have:
– Sophos Anti-Virus
– Stanford Desktop tools
– Security Self-Help Tool
– BigFix Client
• You may want AFS & PGP
– Sophos Anti-Virus (For both Windows & Mac OS)
• Watches for suspicious things and stops them until you
authorize the software
If your quarantine
has a file get help
You can submit
suspicious files
Stanford Desktop Tools
• This allows you to install and update BigFix,
Security Self-Help and Open AFS and other
tools.
– BigFix automatically checks for important
software updates.
– Security Self-Help checks and allows you to fix
security weaknesses on your machine.
– Open AFS lets you have access to your UNIX
account like it is just another Windows hard drive.
Stanford Desktop Tools
Your UNIX Account
• You have a website made for you already:
– www.stanford.edu/~YOUR_SUNET_ID
• UNIX stuff
– You can use Stanford Desktop Tools to mount your
UNIX drive just like another hard drive. I get stuff on
the web quickly with Open AFS
www.stanford.edu/services/afs/intro/index.html
www.stanford.edu/services/web/howto.leland.html
– If you do not want AFS you can also use SecureFX
which you can get from ESS.
– Do NOT put confidential/HIPAA sensitive stuff out
there.
After AFS is Installed
My UNIX Space
SecureFX
WebAFS
• Only plan to use your AFS space occasionally?
Or just want to be able to access your AFS
space from any
computer?
• Try WebAFS
• login to:
afs.stanford.edu
BigFix Client
• Instead of worrying about applying all the
patches you need, you can use BixFix.
• You will not typically notice it but it will
occasionally push a patch onto your machine
and tell you to reboot.
Security – Hard Drives
• Unless it is encrypted, all the files on your
computer’s hard drive are easily read.
• Stanford has licensed PGP whole disk encryption
software, and created a service called Stanford
Whole Disk Encryption (SWDE)
• Secures your entire hard drive and can encrypt USB
drives.
• If you have HIPAA sensitive information, you must
secure your computer:
www.stanford.edu/services/encryption/wholedisk/
Security - Email
• Email provides all the confidentiality of a
postcard.
• If you are sending HIPAA sensitive information,
you must secure your email:
www.stanford.edu/services/secureemail/
Back up your work!
• Each year, on average, one in five of my
students loses all their work. Plan on your
computer being destroyed at the worst
possible time this year.
– Coffee, computer worm or virus, small child with
refrigerator magnet, physical hard drive failure,
theft, bicycle crash, etc.
• Every day, back up your work to more than
one location.
Where to Backup
• PLEASE use removable media if you have no
network access –
– Floppy disk, CD, DVD, flash media
• NEVER backup or share confidential data (HIPAA
sensitive protected health information) on mobile
media without talking to security experts first.
• I used a Maxtor BlackArmor disk that has built in
encryption (but it does not work with PGP).
– Ask your Tech support person for recommendations.
Encrypted USB drives
• USB drives (also called thumb drives) are a very convenient
way to keep backups and allow you to move your data
around.
• However, they are very easy to lose! NEVER store
unencrypted, restricted data on a USB drive.
• You can encrypt at the file level (excel, winZip) – Good.
• You can encrypt the whole drive (PGP disk, TrueCypt) –
Better.
• You can have a hardware encrypted USB drive – BEST!
– There are many manufacturers, however, most are Windows
only.
– IronKey supports both Windows and Mac and is highly
recommended.
Backup Tools
• There are many options for backup (local drive, server,
online service).
• Properly implemented online backup services provide
the most safety, by storing backup away from original.
– Many departments use the Iron Mountain backup service.
– Individuals can get a Stanford discount for the Mozy
backup service (mozy.com/stanford).
• Online services work great for < 50 GB, but when you
have large amounts of data, you need to consider other
options.
• Why not just buy a 1 TB external USB drive and backup
there?
Backup Tools (cont’d)
• External drives (more than 1) can work well if they are
encrypted and you remember to physically remove them
from the location of the machine being backed up.
• Another option is to use software like CrashPlan
(www.crashplan.com).
– Allows backup to another machine (ideally in a different
location, but on the same network).
– Can encrypt the data being backed up.
– Is free for personal use.
• NOTE: Crashplan does not have a contract with Stanford.
Consequently, its online service is not already approved for
the storage of restricted data.
Building a Cohort
• Use the STRIDE cohort tool for work before
IRB.
– You can find out if you have enough subjects to
continue with your study idea.
• There are separate tools for looking at the
medical records after IRB.
Post IRB Tools
Collecting and Storing Data
• Excel
• REDCap
• Surveyor
What can a database do?
• Track who did what to every bit of information
in the data capture system and when they did
it
• Is every change logged?
• Can you roll back mistakes 2 days later?
• Controls what a user can see and modify
• Prevents you from entering garbage
• Can I possibly enter blue for gender?
Excel…
• I think Excel 2007 or 2008, in theory, can do all
these requirements if you have an
extraordinarily talented (VBA) programmer.
• I tried and I could not implement a
satisfactory database model.
• Anybody that is good enough to make it work
will tell you to use a different tool.
• Excel is NOT a database but it is not useless.
Excel 2003 vs. 2007
• Office 2007 file suffixes end with an x (.xlsx vs. .xls)
• New graphical user interface (ribbon instead of menus)
– Push F1 to start Excel Help then search for Interactive 2003 to find
where they moved stuff.
• Microsoft Help is no longer an oxymoron… lots of videos.
Setting up a Spreadsheet
• Use column headings
– Keep names short but meaningful
– No spaces
– No special characters
• ~!@#$%^&*()_-
– Use camelcase
• First letter of each word is capitalized
– Use verbs
Include a Dummy Record
• Include a fake first patient
– Make the width of the character fields as wide as
the widest possible value
• African-American is 16 letters wide so use it for the fake
subject’s race
• X234567890123456 is a nice way to force the width to
be 16 letters wide
NO Missing Data
• You want to have a value in every cell in your
spreadsheets. If something is unknown, code
it as “missing”, “unknown”, “refused”,
“illegible”, “N/A”, etc..
• You want a blank cell to be a clear indicator
that something is wrong.
Make it a Table
• If you have Excel 2007, convert the values to
be a table.
– Select the header record and the dummy record
• The context specific Table tools show up when
you have clicked anywhere inside of the table.
Give the table a name
Pick a color scheme
Data Entry Help
• Row or column banding helps a LOT with data
entry.
If you scroll down the table, the
column headings are still displayed.
Garbage In, Garbage Out
• Prevent bad data from getting into your system
with validation.
– In Excel 2003, click on the column then open the
Data menu and choose Validation…
– In Excel 2007, click a cell in the dummy record, then
click on the Data tab and choose Data Validation
Custom Validation
• By default, you can put anything in any cell.
• Change the IDs to only allow whole numbers
starting with 0.
Uncheck this
Validate Everything
Validation is Auto-filled
• The validation is filled-in down the table as
you add new records.
The triangles indicate a note
Custom Errors
• You can change and enhance the message. Click
the validated cell(s) you want to modify and click
Data Validation.
Excel is Still Problematic
• Even set up properly, Excel has significant
issues. Be aware that:
– It does not always plot data correctly.
– In some versions, math does not work correctly on
very large numbers.
– Exports into other packages do not work cleanly
and do not always generate error messages. (The
result is missing data without error/warning
messages.)
Excel 2007 … Awesome …
10
9
8
7
6
Series1
5
4
3
2
1
0
1
2
3
4
5
R
SAS
You can fix this.
• Make sure to follow these instructions carefully and/or ask for
help from your IT person. If you tweak the wrong thing in the
registry you can render your machine unable to reboot!
1. With XP, click the Windows Start menu and choose Run or in
Vista, search for and open regedit.
2. In the dialog, type regedit and click ok.
3. Open up the tree to this path
HKEY_LOCAL_MACHINE ► SOFTWARE ► Microsoft ► Jet ► 4.0 ► Engines ► Excel
4. Double click TypeGuessRows.
5. Type 0, that is zero not the letter o, in the DWORD editor and
click ok.
6. Repeat for this path
HKEY_LOCAL_MACHINE ► Software ► Microsoft ► Office ► 12.0 ► Access
Connectivity Engine ► Engines ► Excel
• Microsoft ACCESS will silently change this setting!
– So watch this setting if you use ACCESS.
Rather than Excel
• Rather than doing all the hard work of setting
up a validated Excel workbook, you can use a
tool provided by the School of Medicine,
REDCap.
• You use Excel to make an easy template that
describes the data you need to collect, then
SCCI does the rest.
redcap.stanford.edu
Click here.
Everyone with permission to
use REDCap can see the demo
database.
Your work will appear here
until it is on the final “build”.
Text
Date text
Notes
Dropdown lists
Radio buttons
Explore the Excel Tutorial
• This is the REDCap Data Dictionary Demo File.
• It is just an Excel file that REDCap uses to build
the database (inside of MySQL).
Watch these to learn how to set it up.
Start to Finish
• Figure out…
– how to break up the questionnaire into on-screen
forms
– if questions generate multiple answers
– what to name each question
PHI
These are not
mutually exclusive.
So you need many
yes and no variables.
This is an extra variable.
These are
mutually
exclusive so
only one
variable.
Other demographics and
medical information
This is an extra variable.
This is 3 variables.
last
dob
racewhite
First
age
country
raceblack raceasian raceeast
ishispanic
reason
reasonother
middle
raceother
racedetail
Screen shot of first build
Surveyor
med.stanford.edu/irt/survey/
• A great tool for collecting data into a safe
location
Surveyor
Analysis Tools
• R/R Commander
– R is the preferred statistical tool for most
statisticians at Stanford. Its help files are userhostile and the learning curve is a very rough
climb.
• SAS with SAS/Enterprise Guide
R 2.9
R is a modern programming language with user-hostile help files….
Learning R
• Finally, a great introductory book
for R book. It focuses exclusively
on the data manipulation and
graphics instead of mixing
statistics with the language.
• The index is not great but
otherwise, it is ideal.
Slides/Notes on R
• Notes from my five, two-hour-long introductory
talks are here:
www.stanford.edu/~balise/HowToDoBiostatistics.htm
• Notes from a two-hour-long introduction to R for
life sciences can be found here:
www.stanford.edu/~balise/SPCTRM/RProgrammingLifeSciences20090331.pptx
How it Works
• R has two main websites. One describes the project:
http://www.r-project.org/
• The other has all the stuff you could ever want to
download:
http://cran.r-project.org/
• Because the project has people working all over the
globe, the software download site is “mirrored”
everywhere. The closest mirror is USA CA1 (aka UC
Berkeley).
http://cran.cnr.berkeley.edu/
• There is an R installer for all the common
operating systems:
• cran.cnr.berkeley.edu/bin/windows/base/
• cran.cnr.berkeley.edu/bin/macosx/
• cran.cnr.berkeley.edu/bin/linux/
• Each is basically self explanatory.
Tell R to Update Now
• Update using Berkeley
Rcmdr
• Get the Rcmdr package.
– Go to the Packages menu in the R Console
window.
– Click Install packages.
– Tell it to use a mirror (Berkeley).
– click Rcmdr then push OK.
• It will download Rcmdr plus a lot of other
packages that Rcmdr depends on.
• You only need to do this once but you want
to regularly use the Update option on the
Packages menu.
Starting Rcmdr
• Remember: capitalization matters … In the R
console, type:
library(Rcmdr)
Then push enter
to get a new
window.
R Commander 1.4
Rcmdr is a friendly, but incomplete, graphical user interface (GUI) for R.
Importing Data into R
• You can easily import data into R with Rcmdr
by using the Data menu. I will assume that
you have data stored in Excel.
– Excel is NOT a good way to enter and store data
but it is what is commonly used. Do not take my
use of Excel as an endorsement of the product.
Importing Excel
1) Pick Excel from the Data menu
2) Type a short name. I suggest
you use a capital letter for a
dataset name.
4) Click on the table name and
push OK.
Data from Glenn A. Walker’s Common Statistical Methods for Clinical Research with SAS Examples
3) Navigate to where the file is
located on your hard drive.
Adding a New Variable
• For the first analysis,
you want to compare
the Body Mass Index of
the subjects to a
population value (28.4).
The formula for BMI is:
Compute the BMI
and save it to the
dataset.
WeightKG
 HeightCm 


 100

2
Adding a New Value(2)
• Type in the formula (you can double click the
variable names to save on typos).
• After hitting OK you can browse the new data
set by clicking on the View data set button.
Look at the Data
• Never, EVER do a statistical test before you have looked at
your data graphically and with a numeric summary.
• Ask for a numeric summary of the entire dataset, not just
the ones that are in the analysis.
– Common sense applies here (summarizing 10,000 variables is not
a great idea) but it is a very good idea to look at everything to
see if any one variable suggests a problem.
If you want to code….
summary() is a
smart generic
function. If you
apply it to a data
set, you get
summaries of
each variable. If
you apply it to a
model, you get
information on
each of the
predictors.
SAS 9.2 TS2
SAS is an old programming language where you
type commands and run a bunch of things at once.
Enterprise Guide 4.2
EG is a newish programming environment where you type commands or point and click.
Lots of Tools
• If you have questions about tools for data
management and analysis please ask.
med.stanford.edu/spctrm/biostatistician.html
clinicalinformatics.stanford.edu/consultation/
Download