Syllabus and First Assignment for MPO 524 - MPO524-2014

advertisement

Syllabus and First Assignment for MPO 524

Applied Data Analysis

Brian Mapes, RSMAS/MSC 366, mapes@miami.edu or bmapes@rsmas.miami.edu

Prerequisites: MPO 503 or 551 (basic oce/atm science), or permission of instructor http://mpo524-2013.wikispaces.com/ is last year’s website, we will probably work from that model a lot.

Book:

Environmental Data Analysis with MatLab, W. & J. Menke, Elsevier, 2011.

Get the powerpoints : http://www.ldeo.columbia.edu/users/menke/edawm/index.htm

Software to install: Matlab or Octave; consider Panoply, IDV, Python, NCL, R, others

Matlab is licensed on any UM-owned computer. Your advisor can get a key for new installs. It is on the vis cluster. Octave http://www.gnu.org/software/octave/ is the freeware work-alike, can be gotten for

Mac OS X via Fink , MacPorts , or Homebrew .

Panoply, http://www.giss.nasa.gov/tools/panoply/ Great quick data visualizer.

GrADS, http://www.iges.org/grads/downloads.html

, Gridded data Analysis &

Display System. Nice defaults for easy weather-climate displays, and can do a lot...

The IDV, Great big data visualizer (and menu-driven computing). Install as directed at https://www.rsmas.miami.edu/users/bmapes/MapesIDVcollection

Python, https://www.enthought.com/products/canopy/academic/

Python is the future of scientific computing. This distribution is free for academics

(register from .edu email address). Or consider UV-CDAT (http://uvcdat.llnl.gov/), a climate-oriented Python package with GUI, it looks a little less new-user friendly. http://matplotlib.org/basemap/users/examples.html

NCL, NCAR Command Language http://www.ncl.ucar.edu/ . Free, and has so many useful utilities (data format conversion) even if you don’t use its detailed plotting.

By all means install it!

R, http://www.r-project.org/ R is a free software environment for statistical computing and graphics. It runs on a wide variety of UNIX platforms, Windows and MacOS.

IDL is costly. There is free GDL at http://gnudatalanguage.sourceforge.net

, version

0.9 (so is it much good yet?)

Excel (or OpenOffice spreadsheet) can be useful. (Hope one of you can demo it….)

Work and grading

Attendance, participation, and homeworks are 40% of the grade.

The best use of all our time is understanding what aspect of a data set some code or

figure is measuring or depicting. So we will use a lot of canned codes, studying the structure and the results rather than coding per se.

A midterm or late-term (avoiding other classes’ time crunch) exam for 30% will cover the book-math knowledge I consider essential (the basics of probability and statistics, matrices, & Fourier decomposition), suitable for Comps questions.

You will choose a class project (30%). This might as well be a piece of your research: unify your life and work as much as you can in this frazzled world! Part of this Intro

Assignment below is to begin to identify already what dataset you will work on.

Graduate grades range from A to B if you put in good faith effort -- just to recognize better vs. worse a little bit -- since a B average is required to stay in grad school and we want everyone who works to be able to stay in the program! In short, I hope you won’t be too grade-focused, but quality is noticed and appreciated and recognized.

Course philosophy and motivation:

Everything we may achieve as scientists, and much of what we do as human beings, can be called “applied data analysis.” But what are our data, how good is our analysis, and what are we applying it toward? In this course, we will mainly study the tools of Analysis (including appreciation of the stat-math underpinnings, how-to examples on computers, and best practices in communication = graphics). Part of that is the nuts and bolts of handling relevant Data (data sources and servers, file formats, columns vs. rows in arrays, etc.). But let us first consider the larger point of it all: the Applied part.

In geosciences we study a vastly complex Earth, whose fluid parts especially are churning and changing on many timescales. In the face of that jungle of complications, what is rational inquiry, what constitutes knowledge, what can we hope to gain? Our predecessors in the past few centuries have cut many paths into this jungle with the machetes of raw description, simple sums (averaging), and mastery of physics and chemistry (the governing equations). Mechanization (with computers) has turned those old machetes into chainsaws for a couple of generations now. Let’s learn how to use them! But to what end? Addressing the third question above, human society has vital Interests that support your presence here and orient our work. To make true progress, our research needs not only

sound methods and ethics, but also fluency and skill about its motivations and interpretations. Learning to be more effective in this broad sense is the neverending, lifetime project of a whole scientific career, not just graduate school. My goal is to spark a lifelong interest in learning (and perhaps inventing!) new tools and methods, with a hungry eye to the new science realms these may open up.

Science as a game

I don’t mean anything disparaging here: “Games” can have high stakes and be played with utmost seriousness. I mean that science is a human community process with a set of norms and rules and incentives and rewards. For a maturing science like ours, where the first explorations and descriptions have mostly been done already and we are flooded with Big Data, a new Applied Data Analysis is often how we make a play in the game.

It helps to think backward from the goal (making a contribution to widely accepted

Knowledge in your area). In the end, your piece of knowledge has to be communicated well: your graphical/numerical evidence and words have to combine to inject an idea clearly into others’ minds, and make it compelling. But it also has to be right. Others will reproduce any result worth being proud of, and you don’t want to be a person who is cited a lot because you got it wrong (at least not as a young scientist). So (working backward), how do you build such a piece of knowledge? You must master writing and computing and graphics of course, but… you first need to find or build something new, a result, and get it right.

At this stage (of thinking backwards) you see that science depends on presenting

yourself with provisional results (“evidence”), and viewing them skeptically. You have to simulate in your own mind the whole scientific community you will need to convince, including jealous rivals, skeptical doubters of statistical significance, and categorizers who will say your thing is just a facet of a familiar thing, not a new thing. This gauntlet of doubting spirits is almost never wrong: almost all our

provisional results are weak or disappointing in some way. The easy stuff has been done already. You need to learn to see these weaknesses yourself, quickly, so you can press on rather than waiting for others to find the weaknesses (later, slowly, losing your forward momentum). Self-skepticism is the most important skill in science for this reason of simple time efficiency. You must look at every provisional result with this question in mind: how quickly can I see all the ways this is wrong or meaningless or dull or misleading, despite its interesting appearance?

Backing up one more stage, how do you not get discouraged by the process of creating feeble failures all day? The game needs to be fun to you. Every figure you make is a puzzle, a challenge, a way to learn your mind and improve your mental simulations of the scientific community you hope to reach and join, if not of the

Physical World. If this makes sense to you, you might be a scientist in the making. (If not, don’t worry, you may still earn a good grade and a degree and a paycheck in scientific data analysis, and actually have a happier life than freaks like me!)

Thinking about thinking

Our internal dialogue, and our external dialogue when we write and speak to others, is more effective when you know some of the tricks, conventions, frameworks.

There are many, but here are three. One is important to the First Assignment below. a. Hegelian dialectic

In attacking the jungle of the unknown or unarticulated, Hegel’s dialectic pattern of discourse seems to cut like a pair of scissors, unlike a simple assertion of claims that is more like a club. It is a one-two-three in terms of time stages: thesis, antithesis,

synthesis. But each stage involves a pair of ideas. We invoke a pair of things that might seem to go together (a thesis), postulate their opposites (the antithesis), and then reconcile them (synthesis) by taking parts from each. The process can achieve

(or take the listener through) an increment in knowledge, or perhaps even wisdom.

A Wikipedia example, on a grand but vague topic, illustrates the structure, and also its rhetorical feeling of profound progress:

Thesis : unconscious + unity

Antithesis : conscious + separation

Synthesis : conscious + unity

The word “synthesis” refers to combining or putting things together. Synthesis is the opposite of “analysis,” which is taking things apart. A dialectical synthesis combines the thesis and the antithesis; it never introduces a new concept not found in either the thesis or the antithesis. Most dialectics have two concepts per stage, in which case the synthesis stage incorporates one concept from the thesis and one from the antithesis. ( http://en.wikipedia.org/wiki/Thesis-antithesis-synthesis

) b. Splitting and lumping (analysis and synthesis)

The words analysis and synthesis are more general, and also form a kind of mental scissors. Science often snips its way forward using the creative tension between reductionism vs. holism, or analysis vs. synthesis, or “splitting vs. lumping.” Some say these are two personality types, “splitters and lumpers”. Which are you?

Ultimately, we all use them both, and being able to switch views is powerful. But I do find that different people seem to have a predilection for one or the other. c. Bayesian thinking

In science, applied logic like the philosophy tools above is not enough: we need a role for evidence. Another powerful pattern of scientific discourse to know about is

Bayesian inference. This has a 3-step sequence in time, like human learning: Prior,

Evidence, Posterior (Repeat). It explicitly acknowledges that prior information always conditions our view of evidence. If we are rational, evidence improves our

posterior assessment (or knowledge, or information, or estimation) about the topic in question. Bayesian thinking is a humanistic, procedural view of knowledge and its creation, and for centuries has rankled stuffy scientists who take science’s myth of objectivity very literally and seriously (The Theory that Would Not Die, a 2011 book by Sharon McGrayne, describes the history of Bayesian reasoning). Today Bayesian thinking is on the rise, because it is powerful; this is one of the ways ours is sometimes called a “post-modern” age. But of course, if prior knowledge is so wrong that it misleads us about the meaning of evidence, then successive Bayesian inferences can lead us astray, and a paradigm jump may be needed to get out of the rut. And people with different priors may never convince each other, even with infinite amounts of evidence. d. the systems approach

What system are you studying? What is a system?

The word comes from a meaning like a “standing bundle of things”: system (n.) from Greek systema "organized whole, body," from syn- "together" (see syn) + root of histanai "cause to stand" from PIE root *sta- "to stand" (see stet ). Meaning "set of correlated principles, facts, ideas, etc." first 1630s.

Thermodynamics speaks of open vs. closed systems, which can be thought about mainly in terms of responses to forcings (open) vs. internal conservation constraints

(closed). Only the whole universe may be a truly closed system, although on shortenough time scales other systems can often be treated usefully as almost-closed.

For splitters, a system always seems to be composed of (or can be decomposed

into…by hypothesis 1 …is that different?) a set of sub-systems. These internal subsystems are treated as “coupled,” with “feedbacks” carrying information or patterns

(“signals”) between them. Sometimes those signals are big gobs of mass or energy, although they may be subtler, like “signals” or “triggers.”

For lumpers, a system may be seen the proper unification of our view of elements which are inappropriate to speak of separately (the ocean-atmosphere system, the

Earth system, the Solar system). Sometimes this synthesis gesture is only aspirational – we WISH to know how these things interact, and by putting them together in a box, at least the right questions are raised. Occasionally it seems more desperational, trying to put a lid and a label on a can of mess in order to sell it.

(Note: I am a splitter so this comment is unfair and provocative!). To say everything is one big coupled system is to say nothing at all (or merely OMMMM).

1 from Greek hypothesis "base, basis of an argument, supposition," literally "a placing under," from hypo- "under" (see sub) + thesis "a placing, proposition"

Questionnaire

Name____________________ Year_____________ RSMAS division/ program__________

0. Brief summary of your research area/duties, especially data-analysis aspects:

I. Computer and data analysis experience: languages, fluency

II. Statistics and math background: (don’t look anything up, please just say don’t know if you don’t know offhand)

What is the standard deviation of this set of numbers: {5,6,7}?

What is the covariance between the ordered sets {5,6,7} and {1,2,3}?

If v is the variance of sin(t), what is the standard deviation of f(t) = sin(2t)+cos(3t)?

III. What you hope to learn from the class:

IV. Worries about the class mismatching your interests and needs:

IV. Anything else?

HOMEWORK 1: DUE JAN 21, WHEN WE WILL DISCUSS IT IN CLASS.

MPO 524, Applied Data Analysis, spring 2014

0. Install or identify a computer where you can launch Matlab or Octave. Also,

install or locate others of the above packages, as many as you care to – if you do the install chore now, you will be ready to reproduce codes whenever they come to you.

It is good to be flexible, and smart to be able to take somebody’s (say) R or Python code and run it and change a line and run it again. Rather than defining yourself as just “a Matlab user” …. Python in particular is rocketing to prominence, and can be a great wrapper language (calling Matlab and R and NCL and other things). IDV is a quick install and is good for menu-driven (as opposed to code-writing) analyses and plotting, so I may use that sometimes in class, and I especially recommend that.

1. Read the 10-page book chapter attached, “1. INTRODUCTION”, from a 1980s textbook on System Identification (which we will otherwise not use).

a. After reading it, write a new 1-paragraph example of a System, like those given in the chapter, using the book’s terminology as much as you can.

Ideally, this could be your research/project (this may be a good topic for a highlevel conversation with your advisor). Or examples from ordinary life can be good. In the book’s terminology, does your system call for a mental model, a graphical model, a mathematical model, or a software model? Is it a system you would have to “identify” from data? If so, could you do so through “experimental design” or would you have to just analyze a free-running data stream? Or is it a system suitable for “modeling” as the chapter defines that term (from first principles)? b. Sketch a few kinds of graphs or plots that might usefully characterize the system’s activity (inputs, diagnostics of internal state, outputs), like the book has. You can just make up the data as squiggles -- the main point for now is axes and quantities (indicated by their units), and what key system relationships these graphs would indicate. Be creative! Talk to others! Then when you settle on an idea, think it through to some depth. This is just a conversation piece, not a research proposal, so don’t worry too much about loose ends (System definition is hard!) c. Add a final few sentences at the end of all this thinking-it-through, about how tractable your System idea turned out in hindsight. Was your postulated

“System” a bit too open to be tractable -- that is, too dominated by external factors? Or perhaps a bit too closed, with lots of internal complications but not enough room for inputs and outputs? Did it come from a synthesis (lumping things together), or an analysis (splitting sub-parts)? Upon reflection, is it better viewed as part of a larger System? Or does it have sub-Systems you may have glossed over?

2a. Find, bring, and be prepared to explain 1-3 key graphics that underlie your

research area (or a possible class project topic of interest). Behind every living science field, with its funded proposals and well-defined projects, there lie some essential images depicting quantitative information. You will want to learn how these were made, and ideally gain the tools and knowledge to reproduce their form.

Your graduate career may be based on extending or expanding them. Another good topic for an advisor conversation, perhaps…

2b. Write a few sentences about the nature of these figures, viewing them as artifacts of a discourse, or play-tokens in the “game” of science. Are they exploratory

(a basic showing of a discovery or finding), or explanatory (idea-driven depictions that claim or assert something within a complex story about abstract and ideas), or exploitation-oriented (measures of very practical matters, such as statistical measures of hindcast/forecast skill)? Are they statistical (reductions of much larger sets of numbers), or raw (all the values are shown)? Does color play a crucial role?

Is it well used? Is ink used well, drawing the eye and showing crucial distinctions? Is area on the page used well, with the important distictions it is trying to show being well spread out on the page and properly proportional?

2c. Find one horrible and one fantastic example of a data-depicting graphic, perhaps from another field of study via web search. Tufte has his old classics, but in today’s Data Age there is so much great work by brilliant people across so many fields, including great cautionary tales I suspect. Find something new! Could we learn to use it in our field? Or learn to avoid its pitfalls?

Download