Lies, Damned Lies, and Health Physics Some Random Comments About Statistics in Health Physics Tom LaBone Savannah River Chapter of the Health Physics Society Aiken, SC April 15, 2011 1 “There are three kinds of lies: lies, damned lies, and statistics.” Mark Twain “It is easy to lie with statistics.” “It is hard to tell the truth without statistics." Andrejs Dunkels 2 Today Informal, mostly apocryphal discussion of Main message of talk what statistics really is, who practices statistics and how they do it, and why all of this is important to you as a health physicist A good working knowledge of statistics is essential in any endeavor where data are collected and analyzed (e.g., health physics) Everyone in the room should become a statistician (of sorts) No math is used in this presentation and no health physicists were harmed during its preparation 3 Health Physics and Statistics Some HP “stat” books I used in school F. Knoll Radiation Detection and Measurement 1st Edition 1979 J. Shapiro Radiation Protection 1nd Edition 1972 H. Cember Introduction to Health Physics 1st Edition 1969 R. D. Evans The Atomic Nucleus 1955 P. R. Bevington Data Reduction and Error Analysis for the Physical Sciences 1st Edition 1969 G. Statistics was a tool, a “wrench to turn a nut” Is that all it is? 4 What is Statistics? “Humans are good, she knew, at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent.” Carl Sagan in Contact 5 Signals and Noise Useful information comes to us in the form of signals that form distinct patterns The signals are contaminated with varying degrees of noise, which can make it difficult to see the signal 6 Seeing Patterns In our evolutionary history, seeing patterns where none existed may have been less harmful than missing patterns that did exist noise in the grass – is it just the wind or is it a lion? That So, we as a species got very good at seeing patterns, even in the absence of a signal 7 Apophenia Apophenia is the experience of seeing meaningful patterns or connections in random or meaningless data What do you see below? 8 Face on Mars Viking 1 Orbiter Mars Global Surveyor 9 Face in Food, et cetera 10 Face in Data 11 Statistics is … … a science that helps us to differentiate signal from noise and make decisions with a known probability of being wrong … a very practical, decision oriented methodology developed to tame our natural tendency to be Apopheniacs … based on the idea that variability and noise are natural and unavoidable … a relatively modern science that is actively evolving especially since cheap, powerful computers became available 12 Really, What is Statistics? “Statistics is concerned with collecting, analyzing, and interpreting data in the best possible way, where the meaning of “best” depends on the particular circumstances of the practical situation” Chris Chatfield Problem Solving: A Statistician’s Guide 13 Exploratory Data Analysis Look at data (usually with graphics) and use our ability to see patterns in the data to Suggest hypotheses to test Assess validity of assumptions on which statistical inference will be based Support the selection of appropriate inferential tests Suggest ideas for further data collection 14 Air Filters Fecal Samples Kinectrics Filters All 3 338 477 477 157 176 157 176 20 509 479 10 5 5 4 2 3 3 6 5 4 2 476 142 203 136475 511 451453 159 137 202178 155 174 212 518 85 293 158 138 177 322 164 183 210 150229 167 186 500 417 149 337 287 301 454 319 204 486 151144 517504 496 298 512 194 162 181 146 498 403 471 519 502 91 321 48 217 128 488 148 193 10 449 28 231 510 46 187 491 497 25 7216 27 9 143 90 508 23 5134 64 63 201 467 235 168 188 478 437 356 513 264 57 481 355214 340 516 165 184 56 240 130 227 462 480 59 22 4520 169 189 289 220 261 26 8 198 232 233 457 492 166 185 55 386 236 129 506 505 309 110 113 263 140 170 190 432 219 205 213 16 34 369 153 172 265 507 332 465 239 326 211 446 238 71 317 257 230 330 371 404 21 3442 254 288 400 192 439 141 50 156 175 44 66 38 253 414 242 464 269 388 160 179 273 470 484 11 29 468 223 195 303 43 335 297 226 67 197 81 171 191 234 354 364 472 342 392 62 455 118 99 245 88 394 416 237 272 387 415 13 31 92 101 329 283 390 53 302 1 3 103 147 2 40 206 260 4 2 71 80 291 207 348 247 461 405 397 409 107 270 275 277 349 357 374 411 447 221 286 347 365 440 458 324 421 73 89 209 393 58 154 173 17 35 86 246 102 249 115 39 456 389 473 344 425 295 52 45 196 222 311 372 399 419 418 424 427 452 61 60 112 208 370 377 445 125 135 345 19 391 200 250 1428 304 360 450 413 100 218 82 398 49 412 109 429 284 98 132 87 37 122 161 180 215 248 255 278 306 316 396 430 438 466 487 362 382 259 367 380 401 163 182 351 268 341 368 4104 366 406 18 36 44 126 385 258 353 459 299 152 490 515 358 76 123 127 228 276 281 331 334 352 378 383 422 443 75 296 343 381 93 111 133 292 318 346 375 420 433 435 376 423 94 290 105 117 119 199 15 33 131 96 483 314 69 501 300 384 436 463 70 97 145 114 224 266267 280 305 308 312 315 339 350 361 373 407 431 441 460 469 482 485 495 494 68 241 244 499 503 285 313 359 448 474 79 434 24 323 327 74 106 336 6 42 274 363 83 124 294 78 121 77 3204765 493 410 256 41 252 251 262 310 395 402 72 325 279 333 489 84 108 307 514 95 116 54 282 139 225 426 51 20 2328 120 379 408 12 30 Pu239 15 Pu239 10 6 243 509 479 476 203 136 475 453 511 159 178 451 155 137 174 202 212 518 293 138 85 158 177 322 164 183 150 167 186 500210 229 149 417 287 337 301 319 454 204 151 517 486 144 496 298 512 504 194 162 181 403 498 146 502 519 321 48 217 91231 471 128 148 488 10 28 193 510 46 187 491 134 449 7437 497 27 9214 216 90 23 5 25 508 6364 467 201 168 188 143 235 478 513 264 356 520 57 355 56 462 165 184 340 516 227 240 130 59 480 22 4481 232236 220 261 169 189 289 26 8386 198 233 457 55 492 166 185 129 110 113 432 263 205 170 506 213 309 505 140 219 16 34 446 369 326 465 265 239 332 153 172 211 71 507 238 317 257 371 230 330 400 404 192 141 288 21 156 175 439 3190 254 50 44 414 470 388 464 253 273 38 160 179 269 484 66 242 223 335 468 303 197 226 195 11 29 81 297 67 43 394 354 364 171 191 392 455 99 245 88 118 234 342 62 472 101 103 291 329 271 14 32 260 348 415 13 31 272 442 247 283 387 80 40 237 302 390 53 92 206 207 147 357 447 270 349 405 458 461 107 397 102 275 286 409 73 89 365 411 154 173 324 374 221 246 277 393 440 347 421 86 17 35 58 249 115 39 419 399 345 377 398 413 425 424 456 473 112 125 218 295 208 389 450 200 370 372 445 196 304 391 427 452 222 344 360 52 100 250 19 45 61 135 311 418 1416 82 60 49 65 215 412 396 466 259 362 368 367 401 430 163 182 255 278 284 444 487 122 126 132 351 382 429 104 268 306 341 366 406 438 316 37 248 380 98 161 180 87 428 109 18 36 353 385 459 258 276 281 299 334 420 111 123 292 343 352 376 375 383 433 435 515 127 152 228 296 331 358 381 422 490 117 131 318 346 443 94 199 290 423 93 96 119 76 75 105 378 328 133 15 33 474 483 314 448 145 266 363 361 384 407 436 441 494 501 114 285 313 320 327 339 359 463 495 106 224 241 280 300 308 323 350 460 482 97 121 294 373 431 469 485 503 47 68 244 312 434 499 24 79 124 274 315 70 69 77 83 6209 78 305 336 42 74 267 410 252 310 395 493 514 256 325 489 95 251 262 307 333 402 72 116 279 108 84 41 54 282 225 426 139 20 51 2 408 120 379 12 30 142 243 243 1 3 Slope = 0.316 Slope = 0.236 157 176 6 2 8 231 Cm244 3 0 2 5 4 Slope = 2.02 60 6 Slope = 1.38 477 338 142 Slope = 4.56 40 10 50 Slope = 6.09 159 178 471202 85 210 44 137453475203 91 287 212 298 509 451 479 229 337 136 204 462 449502496 150 293 500 155518 174 63 504 164 183 39 476 214 511 167 186 216301 217 10 28 65 64 149 322 193 48 144 247 355 90 47 321 519 134 25 7517 403 498 437 138 416227 115 264 130 23 5319 16 34 129 148 151 162 181 236 168 188 520 235 486 320249 417 282 194 497 512 49 467 240 22 4481 398 205 81 213 158 177 128 454 394 126 77 197 156 175 457 508 102 121 165 184 219 57 446 371 82 484 56 59 478 238 510 348 328 211 71 46 146 246 78 330 207 233 386 488 291 100 245 470 88 170 190 26 8 218 294 432 226 261 326 267 160 179 239 413 116 154 173 17 35 55 54 220 273 230 166 185 131 366 406 20 96 124 99 2 86 388 80 118 18 36 269 492 297 67 140 143 444 104 260 465 198 332 289 516 232 368 95 341 199 464 15 33 58 206 263 50 513 356 242 363 514 450 304 141 393 455 428 507 106 192 307 268 335 110 108 113 119 274 360 392 51 83 105 250 336 209 265 342 439 147 340 163 182 426 103 323 351 117 200 271 290 421 42 40 302 62 153 172 345 400 327 73 89 324 391 434 84 283 390 472 474 408 259 377 401 458 112 111 125 292 359 376 375 489 208 286 329 333 318 346 365 370 445 14 32 279 423 24 221 414 440 347 380 6 19 135 74 1 133 195 53 448 362 367 420 101 285 313 325 343 433 435 241 296 382 381 223 503 94 244 303 499 79 187 43 357 419 215 399 447 252 396 466 145 266 276 281 310 334 349 361 395 407 424 430 441 494 114 123 225 255 278 339 352 383 487 495 122 127 224 228 251 257 262 275 280 308309 331 350 354 364 379 422 460 482 139 171 191 306 373 372 402 411 431 438 443 469 485 68 72 196 312 316 374 415 427 452 13 12 31 30 93 222 272 277 288 315 442 75 248 378 61 161 180 305 311 387 418 234 253 92 60 254 169 189 491 27 9480 201 358 37 45 41 38 505 270 317 384 436 107 256 295 463 515 132 152 300 490 97 468 70 76 87 506 258 299 425 501 344 52 237 314 353 459 473 120 284 397 389 409 369 98 11 483 69 429 21 329 385 405 66 456 410 109 404 461 493 412 15 4 1 Cm244 2 10 12 2 338 8 1 25 1 6 0 10 1 4 8 2 6 0 4 30 2 Fecals as of 3/5/2011 Am241 5 10 15 20 25 30 10 20 30 40 50 60 0 10 20 5 30 Am241 0 2 4 6 8 10 12 0 5 10 15 15 Confirmatory Data Analysis Use statistical tests to answer questions about the data along with the risks of reaching the wrong conclusion Is the material on the filters the same material that is in the fecal samples? Are the Pu-239 to Am-241 ratios in the fecal samples and air samples the same once we account for random noise? 16 Fecal Samples 10 5 95% CI = (1.33, 1.46) 0 Am-241 (mBq) 15 2 0 2 4 6 Pu-239 (mBq) 8 10 12 17 Data Dredging Are the two Pu-239 to Am-241 ratios the same? If this question was asked before we saw the data we can proceed with the test to answer it If this question was inspired by the data then we should not test the same data to get the answer Referred to as data snooping, data dredging, etc. Cancer clusters 18 Statistical Method Define the problem Formulate your questions in such a way that unambiguous answers are possible Collect data Collect data capable of answering your question Analyze the data Present the results in terms your audience can understand 19 Define the Problem “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” John Tukey "It is better to solve the right problem the wrong way than to solve the wrong problem the right way". Richard Hamming 20 Data Collection Collect data that are capable of answering the question asked (Data Quality Objectives) Designed experiments Observational studies Sampling You select samples from a population in order to make inferences about the population 21 GIGO The collection of data is often the most timeconsuming and expensive part of a study Reverend Bayes and all of his horses can’t fix a bum dataset 22 Analyze the Data All statistical procedures have assumptions In practice, the assumptions of any given statistical procedure are violated to some degree Can Can the validity of the assumptions be verified? the validity of the answer be verified? How robust is your statistical procedure to violations of its assumptions? Simple approximate solutions you can understand may be better than complex exact solutions that you can’t Augment standard statistical analyses with simulations 23 Present Results Technical answer versus the functional answer “the null hypothesis is not rejected” technically “not rejected” “accepted” functionally “not rejected” = “accepted” Statistical significance and practical significance Apply “so what” test to your answers 24 What is a Statistician? “Powerful spirits should only be called by the master himself” Goethe The Sorcerer's Apprentice 25 What is a Statistician? Based on Chatfield’s definition of statistics, anyone who makes decisions based on the analysis of data might be called a statistician However, the title statistician is usually reserved for a professional who has specialized training in the concepts, theoretical bases, and methodologies of statistics Key difference between the sorcerer and his apprentice Contrary to what you might think, there is a lot of subjectivity and professional judgment in the practice of statistics Statistics is vast in scope and detail, and the apprentice does not know what he does not know “It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.” Mark Twain 26 The Sorcerer’s Apprentice We may not be statisticians, but we are clearly doing statistics, often without adult supervision Doing our own statistics is a good thing, but we need to become better students of the black arts and consult the master before the brooms get out of control “Should I refuse a good dinner simply because I do not understand the processes of digestion?” Oliver Heaviside [On being criticized for using formal mathematical manipulations without understanding how they worked] 27 How We Can be Better Statisticians Master the basics Learn the language Play with your data Use better software Perform reproducible work Consult with a real statistician 28 Master the Basics Kahn Academy http://www.khanacademy.org/ 29 Statistics MS/Certificate Distance Programs University of South Carolina Colorado State University Texas A&M University Penn State University 30 Concepts and Terminology Specialized Concepts Statistics has a very precise language all its own “the null hypothesis is not rejected” “not rejected” “accepted” Questions and answers are not right unless you use the proper language to convey the proper concept Population versus sample for example some statisticians can be intolerant of laymen who misuse the language of statistics Learn to phrase questions and interpret answers properly 31 Exploratory Statistics Learn to play with your data and see if it is trying to tell you something new Study graphs of your data “There is no data that can be displayed in a pie chart, that cannot be displayed BETTER in some other type of chart.” John Tukey 32 Software used for Statistics I use the following software for statistical calculations (in order of usage) R Minitab SAS Spreadsheet (e.g., MS Excel, Gnumeric) There are many others 33 Spreadsheets (Excel) What some people can do in Excel is nothing short of amazing (but should they be doing it?) Amarillo Slim beat tennis champ Bobby Riggs at PingPong, using a frying pan instead of a paddle Spreadsheet Addiction by Patrick Burns http://lib.stat.cmu.edu/S/Spoetry/Tutor/spreadsheet_ad diction.html Problems with spreadsheet implementation Excel has a long history of doing bad stats Problems with spreadsheet paradigm Reproducible science 34 http://www.msnbc.msn.com/id/21033161/from/RS.1/ 9/28/2007 M. G. Almiron et al. On the Numerical Accuracy of Spreadsheets, Journal of Statistical Software (34) 4, 2010 35 Reproducible Research Reproducible research refers to the idea that the ultimate product of research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. necessary for reproduction of the results Raw Data Data Massaging Calculations Plots and Tables Final Paper 36 The R Project for Statistical Computing R is a language and environment for statistical computing and graphics R is available as Free Software under the terms of the GNU General Public License in source code form It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS Download from http://www.r-project.org/ 37 Advantages of R Command line interface rather than a GUI Promotes reproducible statistics Open source Flexible licensing Availability of source code for peer review Bugs are public knowledge and are fixed quickly New tests and methods tend to appear first in R Many dozens of recently published books devoted to R Free (and very good) community support available 38 Consult with a Statistician If you are going to involve a statistician, do it at the study design and data collection phases If not, at least estimate how much it will cost to collect the data all over again Anybody can analyze compelling data “To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.” Sir Ronald Fisher 39 Twisted Answers to Crooked Questions As health physicists there are times when a decision will be made, with or without good data and a proper statistical analysis In such situations we base our decisions on professional judgment, often augmented with “statistics” We must not fool ourselves about what we are doing … of all the wrong answers we have to choose from, this one is the best We have no right to expect a statistician to endorse such mischief 40 The Apprentice Should Beware of … The Management Prior Being bamboozled by other people’s statistics “The only right way to do this is X [insert statistical method here]” Being seduced by complexity 41 Statistics in the Workplace: Musings of a Sorcerer's Apprentice Presentation to USC Stat Club March 26, 2009 Main message A degree in statistics is a “Swiss Army Knife” that is very useful in any endeavor where data are collected and analyzed Everyone in the room should become a health physicist (I had no takers) 42