DMReview.com 09-21-07 OpenBI Forum:

advertisement
DMReview.com
09-21-07
OpenBI Forum:
Frank Harrell, Iowa State and useR!2007
DMReview.com
By Steve Miller
Frank Harrell
I finally got to meet Frank Harrell. Over the last few years, Frank's been an
outstanding mentor for me, though he doesn't know it and, before mid-August,
wouldn't have recognized me from Adam.
Frank's comprehensive text, Regression Modeling Strategies, With Applications
to Linear Models, Logistic Regression, and Survival Analysis, has been a close
companion over the last year and a half, helping me update to the very latest
statistical developments for predictive analytics. The book is great because it
gives equal weight to the problem domain, theory and math, and programming
solutions in the S+/R statistical languages. I've also learned quite a bit from
reading the excellent statistical and programming teaching documents on Frank's
Web site at Vanderbilt University: http://biostat.mc.vanderbilt.edu/FrankHarrell.
And perhaps best of all, Hmisc, Frank's S+/R package of miscellaneous
metadata, data manipulation, statistical, predictive modeling and graphical
functions, is an absolute godsend of helpful goodies for R programmers. I was
recently asked by a data warehousing colleague for an open source, quick-anddirty, data profiling recommendation. Without hesitation, I offered Hmisc's
describe function. The modified boxplot graph included in Hmisc is also a
personal favorite, able to consolidate a good deal of exploratory information in a
small space.
So when I saw the opportunity to not only participate in the international R user
conference (useR!2007), but also to attend a one day modeling course taught by
Frank, I immediately made time in my schedule. Even then, a late registrant, I
had to connive to secure a spot in the class. Apparently many others felt as I did
about the opportunity.
The preconference tutorial on regression modeling strategies didn't disappoint.
Starting with a 200 page handout to cover in six hours, the class, comprised of
college teachers, researchers, analysts and graduate students, was challenged
to keep pace. I'm sure most thought much of the day would be review. We were,
however, exposed to a good deal of new material reflecting many of the latest
approaches in regression, classification and prediction, covered at a rapid pace.
Frank's soft-spoken demeanor belies his command of the discipline. A
biostatistician, he has seen just about everything in medical research,
epidemiology and health care evaluation - including a lot he doesn't like. Many
current "habits" in the trade are anathema to Frank. Stepwise regression is non
grata, as is the pernicious practice of recoding continuous attributes into
categories to perform logistic regression, discriminant analysis or data mining
classification. Frank's beef is that information is always lost migrating from
continuous to category variables. I made the mistake of asking whether it was a
lesser evil to use many categories rather than two. Frank's deadpan response:
don't do it. He's also intolerant of lazy analysis that depicts strong relationships
as linear only, offering from experience that they tend often to involve significant
curvature.
I could tell from discussions during breaks that many of us were guilty of at least
a few of the statistical sins noted in class. At the same time, we all realized how
invaluable a "retreat" with an expert like Frank can be. Perhaps BI analytics
practitioners should be required to participate in an annual seminar with an
authority to snuff out bad habits before they become too ingrained - and costly for
business. In fact, I think most professions could profit from regular "rebalancings"
with experts like Frank Harrell.
Statistics and ISU
Iowa State University in Ames, Iowa, might seem an unlikely location for an
international conference. Ames is approximately 350 miles from Chicago, about
45 minutes from the Des Moines airport, in the heartland of America. But ISU has
been a juggernaut in statistics seemingly forever, starting from its land-grant
roots of agricultural research. Snedecor Hall is today the epicenter of a thriving
program in theoretical, applied and computational statistics that is very pertinent
for business intelligence. Indeed, some of the current ISU research on statistical
graphics and visualization will hopefully find a home soon in next iterations of BI
tools - and I'd be a more-than-willing beta tester. ISU was ever hospitable as
well. Professor Diane Cook and the ISU staff enabled a quality eleventh-hour
conference experience for the 250+ participants from around the world. I even
felt a bit of a personal connection from the past as I strolled around campus,
recalling three texts from my undergraduate studies many years ago: Calculus by
Thomas, Economics by Samuelson, and Statistical Methods by Snedecor and
Cochran - the same Snedecor as the statistics building.
useR!2007 Presentations
Many of the presentations given in the two-day conference were relevant to BI;
others not so much. And with multiple concurrent presentations, it was
sometimes difficult to maneuver where desired, when desired. I was hopelessly
lost in discussions of "mixed models using residual maximum likelihood" and the
"analysis of soybean seed transcriptomics data." But I found the sessions on R
GUIs I and II and the discussion of Web analytics in R quite informative.
I was in my element at the two presentations on social science and statistics. The
theme of the combined session was the use of advanced statistical designs and
techniques to help assure the validity of observational study findings when the
gold standard of randomization isn't feasible - which is often the case in the
business world. The questions posed were quite relevant for BI and have been
addressed in other OpenBI Forum columns
(http://www.dmreview.com/authors/author_sub.cfm?authorId=1052295). The
approach outlined in these presentations involves matching subjects across
potential confounding variables so that comparison groups are "equal" on factors
other than the treatment - and can thus be compared validly. Each talk
addressed the optimization of some aspect of the matching problem. The first
speaker presented an approach and software for matched sampling and pair
matching. The second discussed problems of matching algorithms in very large
samples, emphasizing efficient use of computer code and machine resources.
Though I didn't attend the "Teaching with R" sessions, I was able to get my
hands on the materials for review. Both undergraduate and master's-level
mathematical statistics courses at ISU now use R significantly for instruction,
complementing arduous mathematical proofs and derivations with the simulation
capabilities of R. Students might investigate the distributions of transformed
random variables, for example, by examining samples of size 10,000. Or
simulations might be used to examine the robustness of assumptions in
statistical inference. R's powerful graphics can be paired with sampling
techniques to visualize the distribution of order statistics. And, of course, the
bootstrap method of examining the precision of parameter estimates through
resampling is now de rigueur in the discipline. I love to play with large sample
simulations in R on my notebook, watching practice converge with theory. The
ascendance of simulation methods is, in my opinion, a major boon for learning
and understanding statistical thinking. What I would have given for such
computer support back in the day!
John Chambers, the architect of the original S language predecessor to R, was
the keynote speaker at the end of the first conference day. In 1998, Chambers
won a prestigious "Software System Award" from the Association for Computing
Machinery (ACM) for developing the S system. The ACM observed that Dr.
Chambers' work "will forever alter the way people analyze, visualize, and
manipulate data." Chambers' talk, "Programming with R," noted two tenets in the
development of S: rapid and effective exploration and trustworthy software. He
credited the open source movement with improving software quality through its
many eyes. Ever the programmer and architect, Chambers then highlighted
several important design considerations for S/R, finally producing code at the
end. An appreciative audience understood the R community's debt to Chambers'
seminal work.
Different perspectives on graphics and visualization in R created excitement the
second morning of the conference. The author of R's outstanding lattice graphics
gave a programming tips talk, and the developer of a new high-level, easy-todeploy package called ggplot demoed his wares with a comprehensive statistical
example. At the conclusion of these two very useful presentations, the moderator
created a stir by opining that such static, programmed graphics, while certainly
valuable for analysis, were not in keeping with Chambers' mission of R to "enable
effective and rapid exploration of data." He then showed yet a third R visual
package, iplots, which is interactive and live. A spirited discussion of almost
religious proportions followed on the relative benefits of each. My take? Both
interactive and programmed graphics have significant roles in statistics and BI.
Keep them coming!
There were many sideline discussions of how to handle large data sets in R. In
the current implementation, the size of a data structure is limited by physical
memory. Users circumvent this limitation in a number of ways. As might be
expected, statisticians use sampling techniques to reduce the amount of data
needed for their models. Crafty programmers "chunk" large data into smaller
subsets that can be accessed and then discarded. Database-savvy users store
their data in relational tables, feeding R as needed. One presentation noted a
seamless use of a database to serve data to R for large-scale surveys. I like to
use the "pipe" function to filter data with agile languages like Python or Ruby
before loading R data frames. A programming competition winner developed a
package to handle data larger than memory with binary flat files. And R's
commercial cousin S-Plus from Insightful has developed a virtual memory large
data capability. I'm hopeful for the future of R in the business world that the large
data concern will become a priority with the core R development team in the near
future.
useR!2008
My connection to R and the R community is well-chronicled in
http://www.dmreview.com/article_sub.cfm?articleId=1065015 and
http://www.dmreview.com/article_sub.cfm?articleId=1084643, but I must admit to
being continually amazed at the level of contributions of such a high-octane open
source community. As a result of its volunteer efforts, the latest in statistical
methods from top practitioners are readily and freely available months and even
years before the commercial competition. Almost all presenters at useR!2007
demoed new packages they'd developed to showcase their methods and
analytics. And the world-wide R community continues to expand dramatically.
Little wonder that R is now lingua franca of academic statistical computing. Little
wonder as well that useR!2008 (http://www.statistik.uni-dortmund.de/useR-2008/)
in Dortmund, Germany, promises to be the largest and best yet.
References:
1. Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to
Linear Models, Logistic Regression, and Survival Analysis. Springer Series in
Statistics. 2001
2. http://user2007.org/program/
...............................................................................
For more information on related topics visit the following related portals...
Business Intelligence (BI) and Open Source.
Steve Miller is co-founder of a Chicago-based business intelligence (BI) services
firm OpenBI, LLC, that specializes in delivering analytic solutions with both open
source and commercial technologies. Miller has more than 30 years of
experience in intelligence and analytics, having migrated from health care
program evaluation, to database consulting with Oracle Corporation, to running a
fast-growing BI services business at Braun Consulting. Advances in technology
over that time have fundamentally enabled the use of quantitative methods for
business differentiation. OpenBI, LLC, is all about helping customers attain that
differentiation. You can reach him at steve.miller@openbi.com.
Download