Hot Research Tools at Stanford Raymond R. Balise, Ph.D. Health Research and Policy/SPECTRM Todd Ferris, MD Stanford Center for Clinical Informatics Topics • • • • Keeping your data secure Finding patient populations Tools for collecting and storing data Tools for analysis Safety First • • • • • Virus Scanner Disk and File Encryption Secure Email File Transfer Backup Tools Getting Security Software • Software licensed for the entire University can be found here: www.stanford.edu/services/ess/index.html • You definitely want to have: – Sophos Anti-Virus – Stanford Desktop tools – Security Self-Help Tool – BigFix Client • You may want AFS & PGP – Sophos Anti-Virus (For both Windows & Mac OS) • Watches for suspicious things and stops them until you authorize the software If your quarantine has a file get help You can submit suspicious files Stanford Desktop Tools • This allows you to install and update BigFix, Security Self-Help and Open AFS and other tools. – BigFix automatically checks for important software updates. – Security Self-Help checks and allows you to fix security weaknesses on your machine. – Open AFS lets you have access to your UNIX account like it is just another Windows hard drive. Stanford Desktop Tools Your UNIX Account • You have a website made for you already: – www.stanford.edu/~YOUR_SUNET_ID • UNIX stuff – You can use Stanford Desktop Tools to mount your UNIX drive just like another hard drive. I get stuff on the web quickly with Open AFS www.stanford.edu/services/afs/intro/index.html www.stanford.edu/services/web/howto.leland.html – If you do not want AFS you can also use SecureFX which you can get from ESS. – Do NOT put confidential/HIPAA sensitive stuff out there. After AFS is Installed My UNIX Space SecureFX WebAFS • Only plan to use your AFS space occasionally? Or just want to be able to access your AFS space from any computer? • Try WebAFS • login to: afs.stanford.edu BigFix Client • Instead of worrying about applying all the patches you need, you can use BixFix. • You will not typically notice it but it will occasionally push a patch onto your machine and tell you to reboot. Security – Hard Drives • Unless it is encrypted, all the files on your computer’s hard drive are easily read. • Stanford has licensed PGP whole disk encryption software, and created a service called Stanford Whole Disk Encryption (SWDE) • Secures your entire hard drive and can encrypt USB drives. • If you have HIPAA sensitive information, you must secure your computer: www.stanford.edu/services/encryption/wholedisk/ Security - Email • Email provides all the confidentiality of a postcard. • If you are sending HIPAA sensitive information, you must secure your email: www.stanford.edu/services/secureemail/ Back up your work! • Each year, on average, one in five of my students loses all their work. Plan on your computer being destroyed at the worst possible time this year. – Coffee, computer worm or virus, small child with refrigerator magnet, physical hard drive failure, theft, bicycle crash, etc. • Every day, back up your work to more than one location. Where to Backup • PLEASE use removable media if you have no network access – – Floppy disk, CD, DVD, flash media • NEVER backup or share confidential data (HIPAA sensitive protected health information) on mobile media without talking to security experts first. • I used a Maxtor BlackArmor disk that has built in encryption (but it does not work with PGP). – Ask your Tech support person for recommendations. Encrypted USB drives • USB drives (also called thumb drives) are a very convenient way to keep backups and allow you to move your data around. • However, they are very easy to lose! NEVER store unencrypted, restricted data on a USB drive. • You can encrypt at the file level (excel, winZip) – Good. • You can encrypt the whole drive (PGP disk, TrueCypt) – Better. • You can have a hardware encrypted USB drive – BEST! – There are many manufacturers, however, most are Windows only. – IronKey supports both Windows and Mac and is highly recommended. Backup Tools • There are many options for backup (local drive, server, online service). • Properly implemented online backup services provide the most safety, by storing backup away from original. – Many departments use the Iron Mountain backup service. – Individuals can get a Stanford discount for the Mozy backup service (mozy.com/stanford). • Online services work great for < 50 GB, but when you have large amounts of data, you need to consider other options. • Why not just buy a 1 TB external USB drive and backup there? Backup Tools (cont’d) • External drives (more than 1) can work well if they are encrypted and you remember to physically remove them from the location of the machine being backed up. • Another option is to use software like CrashPlan (www.crashplan.com). – Allows backup to another machine (ideally in a different location, but on the same network). – Can encrypt the data being backed up. – Is free for personal use. • NOTE: Crashplan does not have a contract with Stanford. Consequently, its online service is not already approved for the storage of restricted data. Building a Cohort • Use the STRIDE cohort tool for work before IRB. – You can find out if you have enough subjects to continue with your study idea. • There are separate tools for looking at the medical records after IRB. Post IRB Tools Collecting and Storing Data • Excel • REDCap • Surveyor What can a database do? • Track who did what to every bit of information in the data capture system and when they did it • Is every change logged? • Can you roll back mistakes 2 days later? • Controls what a user can see and modify • Prevents you from entering garbage • Can I possibly enter blue for gender? Excel… • I think Excel 2007 or 2008, in theory, can do all these requirements if you have an extraordinarily talented (VBA) programmer. • I tried and I could not implement a satisfactory database model. • Anybody that is good enough to make it work will tell you to use a different tool. • Excel is NOT a database but it is not useless. Excel 2003 vs. 2007 • Office 2007 file suffixes end with an x (.xlsx vs. .xls) • New graphical user interface (ribbon instead of menus) – Push F1 to start Excel Help then search for Interactive 2003 to find where they moved stuff. • Microsoft Help is no longer an oxymoron… lots of videos. Setting up a Spreadsheet • Use column headings – Keep names short but meaningful – No spaces – No special characters • ~!@#$%^&*()_- – Use camelcase • First letter of each word is capitalized – Use verbs Include a Dummy Record • Include a fake first patient – Make the width of the character fields as wide as the widest possible value • African-American is 16 letters wide so use it for the fake subject’s race • X234567890123456 is a nice way to force the width to be 16 letters wide NO Missing Data • You want to have a value in every cell in your spreadsheets. If something is unknown, code it as “missing”, “unknown”, “refused”, “illegible”, “N/A”, etc.. • You want a blank cell to be a clear indicator that something is wrong. Make it a Table • If you have Excel 2007, convert the values to be a table. – Select the header record and the dummy record • The context specific Table tools show up when you have clicked anywhere inside of the table. Give the table a name Pick a color scheme Data Entry Help • Row or column banding helps a LOT with data entry. If you scroll down the table, the column headings are still displayed. Garbage In, Garbage Out • Prevent bad data from getting into your system with validation. – In Excel 2003, click on the column then open the Data menu and choose Validation… – In Excel 2007, click a cell in the dummy record, then click on the Data tab and choose Data Validation Custom Validation • By default, you can put anything in any cell. • Change the IDs to only allow whole numbers starting with 0. Uncheck this Validate Everything Validation is Auto-filled • The validation is filled-in down the table as you add new records. The triangles indicate a note Custom Errors • You can change and enhance the message. Click the validated cell(s) you want to modify and click Data Validation. Excel is Still Problematic • Even set up properly, Excel has significant issues. Be aware that: – It does not always plot data correctly. – In some versions, math does not work correctly on very large numbers. – Exports into other packages do not work cleanly and do not always generate error messages. (The result is missing data without error/warning messages.) Excel 2007 … Awesome … 10 9 8 7 6 Series1 5 4 3 2 1 0 1 2 3 4 5 R SAS You can fix this. • Make sure to follow these instructions carefully and/or ask for help from your IT person. If you tweak the wrong thing in the registry you can render your machine unable to reboot! 1. With XP, click the Windows Start menu and choose Run or in Vista, search for and open regedit. 2. In the dialog, type regedit and click ok. 3. Open up the tree to this path HKEY_LOCAL_MACHINE ► SOFTWARE ► Microsoft ► Jet ► 4.0 ► Engines ► Excel 4. Double click TypeGuessRows. 5. Type 0, that is zero not the letter o, in the DWORD editor and click ok. 6. Repeat for this path HKEY_LOCAL_MACHINE ► Software ► Microsoft ► Office ► 12.0 ► Access Connectivity Engine ► Engines ► Excel • Microsoft ACCESS will silently change this setting! – So watch this setting if you use ACCESS. Rather than Excel • Rather than doing all the hard work of setting up a validated Excel workbook, you can use a tool provided by the School of Medicine, REDCap. • You use Excel to make an easy template that describes the data you need to collect, then SCCI does the rest. redcap.stanford.edu Click here. Everyone with permission to use REDCap can see the demo database. Your work will appear here until it is on the final “build”. Text Date text Notes Dropdown lists Radio buttons Explore the Excel Tutorial • This is the REDCap Data Dictionary Demo File. • It is just an Excel file that REDCap uses to build the database (inside of MySQL). Watch these to learn how to set it up. Start to Finish • Figure out… – how to break up the questionnaire into on-screen forms – if questions generate multiple answers – what to name each question PHI These are not mutually exclusive. So you need many yes and no variables. This is an extra variable. These are mutually exclusive so only one variable. Other demographics and medical information This is an extra variable. This is 3 variables. last dob racewhite First age country raceblack raceasian raceeast ishispanic reason reasonother middle raceother racedetail Screen shot of first build Surveyor med.stanford.edu/irt/survey/ • A great tool for collecting data into a safe location Surveyor Analysis Tools • R/R Commander – R is the preferred statistical tool for most statisticians at Stanford. Its help files are userhostile and the learning curve is a very rough climb. • SAS with SAS/Enterprise Guide R 2.9 R is a modern programming language with user-hostile help files…. Learning R • Finally, a great introductory book for R book. It focuses exclusively on the data manipulation and graphics instead of mixing statistics with the language. • The index is not great but otherwise, it is ideal. Slides/Notes on R • Notes from my five, two-hour-long introductory talks are here: www.stanford.edu/~balise/HowToDoBiostatistics.htm • Notes from a two-hour-long introduction to R for life sciences can be found here: www.stanford.edu/~balise/SPCTRM/RProgrammingLifeSciences20090331.pptx How it Works • R has two main websites. One describes the project: http://www.r-project.org/ • The other has all the stuff you could ever want to download: http://cran.r-project.org/ • Because the project has people working all over the globe, the software download site is “mirrored” everywhere. The closest mirror is USA CA1 (aka UC Berkeley). http://cran.cnr.berkeley.edu/ • There is an R installer for all the common operating systems: • cran.cnr.berkeley.edu/bin/windows/base/ • cran.cnr.berkeley.edu/bin/macosx/ • cran.cnr.berkeley.edu/bin/linux/ • Each is basically self explanatory. Tell R to Update Now • Update using Berkeley Rcmdr • Get the Rcmdr package. – Go to the Packages menu in the R Console window. – Click Install packages. – Tell it to use a mirror (Berkeley). – click Rcmdr then push OK. • It will download Rcmdr plus a lot of other packages that Rcmdr depends on. • You only need to do this once but you want to regularly use the Update option on the Packages menu. Starting Rcmdr • Remember: capitalization matters … In the R console, type: library(Rcmdr) Then push enter to get a new window. R Commander 1.4 Rcmdr is a friendly, but incomplete, graphical user interface (GUI) for R. Importing Data into R • You can easily import data into R with Rcmdr by using the Data menu. I will assume that you have data stored in Excel. – Excel is NOT a good way to enter and store data but it is what is commonly used. Do not take my use of Excel as an endorsement of the product. Importing Excel 1) Pick Excel from the Data menu 2) Type a short name. I suggest you use a capital letter for a dataset name. 4) Click on the table name and push OK. Data from Glenn A. Walker’s Common Statistical Methods for Clinical Research with SAS Examples 3) Navigate to where the file is located on your hard drive. Adding a New Variable • For the first analysis, you want to compare the Body Mass Index of the subjects to a population value (28.4). The formula for BMI is: Compute the BMI and save it to the dataset. WeightKG HeightCm 100 2 Adding a New Value(2) • Type in the formula (you can double click the variable names to save on typos). • After hitting OK you can browse the new data set by clicking on the View data set button. Look at the Data • Never, EVER do a statistical test before you have looked at your data graphically and with a numeric summary. • Ask for a numeric summary of the entire dataset, not just the ones that are in the analysis. – Common sense applies here (summarizing 10,000 variables is not a great idea) but it is a very good idea to look at everything to see if any one variable suggests a problem. If you want to code…. summary() is a smart generic function. If you apply it to a data set, you get summaries of each variable. If you apply it to a model, you get information on each of the predictors. SAS 9.2 TS2 SAS is an old programming language where you type commands and run a bunch of things at once. Enterprise Guide 4.2 EG is a newish programming environment where you type commands or point and click. Lots of Tools • If you have questions about tools for data management and analysis please ask. med.stanford.edu/spctrm/biostatistician.html clinicalinformatics.stanford.edu/consultation/