Cranvastime: Interactive longitudinal and temporal data plots 1 Introduction Xiaoyue Cheng

advertisement
Cranvastime: Interactive longitudinal and temporal data plots
Xiaoyue Cheng
August 26, 2011
1
Introduction
Cranvas interfaces to the Qt libraries (http://qt.nokia.com/) using the R packages,
qtbase and qtpaint,
to provide interactive and dynamic graphics for large amounts of data. It provides programmable interactive
graphics in R, enabling users to link plots with analytical methods. Cranvas has several basic plot types:
scatterplots, histograms, barcharts, density plots, mosaic plots, maps, parallel coordinate plots and tours.
Cranvastime develops interactive plots for longitudinal and temporal data. This type of data is prevalent
in the study of problems related to human health, and medical treatments, wages, prices and economic
indicators, temperature and climate change. Longitudinal data plots show values for individuals, possibly
at irregular time intervals. The purpose is to explore individual variation, dierences and similarities. Time
series plots tend to show just one measurement for each time, but there is implicitly an interest in studying
seasonal trends, and temporal dependencies.
There are previously existing software which have explored the use interactive graphics for time series
data, including Diamond Fast (Unwin and Wills, 1988), XQz(McDougall and Cook, 1994), XGobi and
GGobi(Cook and Buja, 1997), Fortune(Kotter and Theus), and TimeSearcher(Hochheiser and Shneiderman,
2001).
Diamond Fast ran on Macintosh computers, which is able to explore, interrogate, manipulate plots of
univariate or multivariate time series.
performed on the series.
Some transformations including dierences and smoothing can be
A lot of the inspiration for the methods available in cranvastime come from
Diamond Fast. XQz was developed to provide interactive tools for interactively exploring times series using
Box-Jenkins approaches.
XGobi and its contemporary ancestor GGobi are broader in focus, interactive
graphics tools for multivariate data which include some facilities for working with temporal data. Fortune was
inuenced by Diamond Fast, and focused on forecasting and transformations, also implemented a few basic
interactive features. XQz, XGobi, GGobi, and Fortune were designed on Windows, Mac, and Linux system.
TimeSearcher 1 used time boxes to query the time series dynamically, while its successors TimeSearcher 2
and 3 extended the applications to large data and forecasting.
2
Functionality
•
Univariate time series
Wrap the time series by a particular period.
To examine the periodicity, particularly slight irregularities. A graphical approach like this can
allow for nding near-seasonal dependencies.
Change the wrapping period on-the-y.
Seasonality can take several forms, yearly, monthly, weekly. Setting the period will enable users
to wrap in dierent units, to explore dierent kinds of periodicity. Users are encouraged to set
multiple values in the 'shift' argument based on their knowledge about the data, then use those
values to explore the wrapping period.
1
Fold the series on itself completely by the wrapping period setting.
To see the fully wrapped time series directly without intermediate stages when users are condent
on the period.
Unfold/Reset the plot.
To completely unwrap the time series by only hitting the key once, which is much more convenient
than pressing the regular unwrapping key.
Zoom in the series to see the details.
When the time series is quite long but the window width is limited, users can zoom in the series
to see the local details.
Pan the series which is zoomed in.
To move the zoomed-in series horizontally. Then users can see the long series from one side to
another. It helps to understand the global patterns when in a local mode.
Compute the AutoCorrelation Function (ACF).
Autocorrelation is used to numerically summarize the seasonality of time series.
This is the
correlation between itself and its lagged self. The lag is set by the wrapping period.
•
Multivariate time series
Separate the standardized variables vertically.
The variables involved may have dierent scales. To display them in one plot, cranvastime will
rstly standardize the variables. Hence those series are mixed together. With this functionality
users can then separate the variables to see whether they have dierent features.
•
Wrapping, folding, reseting, zooming, panning are the same as univariate time series.
Compute ACF for each variable.
Longitudinal data
Separate the time series of the subjects vertically.
Similar to separating the multi-variables, in the longitudinal data, users may be interested in
comparing the groups/subjects/observations which are repeatedly measured over time. Hence it
is very useful to distinguish the groups vertically while keep them in the same time axis.
Highlight the whole series for an individual when the brush covers any point in the series.
Sometimes the groups of longitudinal data are too close to specify by eyes. Highlight the whole
series will help to nd out the groups of interest.
Hold one of the series and shift it horizontally.
Groups may start from dierent time points. Hold the series and shift can make them start at
the same time and then it is easy to compare their trends.
•
Wrapping, zooming, panning, reseting are the same as univariate time series.
Functionality that is general to the whole cranvas package, that also works for cranvastime.
Label the information of a point when the cursor is near it to provide detailed information to the
user.
Change the size of points.
When there are too many points in the plot, people may want the size of points to be smaller;
when there are only a few points, the bigger points are better. So the functionality to change the
size is needed.
2
Change the transparency of points
At times some of the points are overlapped while the other are sparse. Making the points transparent is useful to see the density of points, because the overlapped areas are darker while the
sparse parts are lighter.
Brush an area and highlight all the points/segments in the area
Then users can select some observations and link them with other type of plots.
This is very
important for high-dimensional data, to understand relationships between many characteristics
that were recorded.
3
Settings
3.1
Usage
qtime(time, y, data, period = NULL, group = NULL,
wrap = TRUE, shift = c(1, 7, 12, 24),
size = 2, alpha = 1, asp = NULL,
main = NULL, xlab = NULL, ylab = NULL, ...)
3.2
Arguments
time
The variable indicating time, which is displayed on the horizontal axis. This variable (for now) needs
to be consecutive integers
y
1, 2, 3, ..., n
The variable(s) displayed on the vertical axis.
It must be a formula with only right hand side at the
moment.
data
Mutaframe data to use.
period
The variable to group the time series, which could cut the long time series to shorter period, and
then each period could be drawn by a single line. The setting is better to be `year',`month', or other
time resolutions. Default to be null. When it is not null, the key U and D can be hit to separate the
groups or overlap them together to watch the patterns.
group
wrap
The variable used for longitudinal data grouping. Usually to be `ID', etc.
The switch for the wrapping mode. Default to be TRUE. When it is TRUE, hitting the right arrow
or left arrow will wrap or unwrap the time series; while when it is FALSE, hitting arrows will zoom
in/out. The wrapping mode can be changed on-the-y.
shift
Wrapping period selector. The default possible period are 1, 7(for days a week), 12(for months), 24(for
hours).
size
Point size, default to be 2.
alpha
asp
Transparency level, 1=completely opaque, default to be 1.
Ratio between width and height of the plot.
main
Main title for the plot.
xlab
Label on horizontal axis, default is name of x variable.
ylab
Label on vertical axis, default is name of y variable.
3
3.3
Up
Keypress Events
Increase size of points.
Down
Decrease size of points.
Right
Wrap the time series when wrap=TRUE, while zoom in with the center of the last clicked dot when
wrap=FALSE.
Left
Back the time series to the position before last wrapping when wrap=TRUE, and zoom out when
wrap=FALSE.
Shift+Right
When wrap=TRUE, the time series will be folded directly to the width of maximal value in
argument shift.
Shift+Left
Plus(+)
Increase alpha level (starts at alpha=1 by default).
Minus(-)
U(up)
Time series will be backed to the original x-axis position, no matter wrap is TRUE or FALSE.
Decrease alpha level (starts at alpha=1 by default).
Separate the series groups vertically.
D(down)
Mix the series groups by shifting them back vertically.
Shift+U
For multivariate y's, separate them vertically.
Shift+D
For multivariate y's, mix them back together.
G(Gear)
Change the wrapping period circularly in the values of argument 'shift'.
M(Mode)
Switch the mode for series selecting. There are two modes: on and o. Default to be o. When
the argument 'group' is not null, users can turn it on to hold a series and shift the series horizontally
by dragging with the mouse. When the wrapping mode is FALSE, turning on the series selecting mode
will make it possible to pan the series which is zoomed in by dragging with the mouse or pressing
left/right arrows.
W(Wrap)
Switch the wrapping mode between TRUE and FALSE. When it is TRUE, an indicator of
wrapping period will be shown at the bottom right of the graph; otherwise there is not any indicator
on the bottom right.
4
Examples
4.1
•
Datasets
NASA
The data are geographic and atmospheric measures on a very coarse 24 by 24 grid covering Central
America. The variables are: elevation, temperature (surface and air), ozone, air pressure, and cloud
cover (low, mid, and high). With the exception of elevation, all variables are monthly averages, with
observations for Jan 1995 to Dec 2000. These data were obtained from the NASA Langley Research
Center Atmospheric Sciences Data Center. The data is provided by package
cranvas.
More details
about the data, including descriptions of the variables, are available online at
http://www.amstat-online.org/sections/graphics/dataexpo/2006data.php.
•
Remifentanil
Pharmacokinetics of the drug Remifentanil dataset in the package
nlme.
It contains 2107 rows and 12
columns, containing concentration of the drug at dierent times along with demographics information
of subjects. The variable ID, Time, conc are used in cranvastime.
4
•
Wages
Wages data was collected to track the labor experiences of male high-school dropouts. The men were
between 14 and 17 years old at the time of the rst survey. There are 15 variables, among which id,
lnw, exper are used in cranvastime. More details can be found from package
•
cranvas.
Pigs
The data collected the quarterly production and prots for raising UK pigs during 1967-1978. There
are 11 variables and 48 records. The dataset is also provided by package
•
cranvas.
Lynx
Lynx is a dataset of annual numbers of lynx trappings for 18211934 in Canada.
package
datasets,
It is provided by
which comes automatically with R. It is a historic data included as a cranvastime
purposes for posterity reasons.
•
Sunspots
The data are monthly mean relative sunspot numbers from 1749 to 1983. Collected at Swiss Federal
Observatory, Zurich until 1960, then Tokyo Astronomical Observatory.
datasets,
More details are in package
which comes automatically with R. Sunspot.month and sunspot.year provides univariate
time series of sunspots, which are used in cranvastime. It is a historic data included as a cranvastime
purposes for posterity reasons.
4.2
Examples for Functionality
4.2.1 Create a cranvastime plot
Before creating a cranvastime plot, users should use the original data to create an augmented data as a
mutaframe.
Some attributes, like color and brushed, are added as columns of the mutaframe.
common setting for all the plot functions in package
cranvas.
Then function
plot. The code example is as follows.
library(cranvas)
data(nasa)
nasa11 <- subset(nasa, Gridx == 22 & Gridy == 21)
qnasa <- qdata(nasa11)
qtime(TimeIndx,~ts,qnasa,shift=c(1,12))
5
qtime
This is a
is used to draw a time
Figure 1: Cranvastime plot for the temperatures of one location in NASA data. Strong seasonality is visible.
Figure 1 is a simplest example using NASA data.
One location is chosen from the 24x24 locations.
72 months from 1995-2000 are drawn. Strong seasonality can be seen from the plot. ACF with lag=1 is
computed and shown at the bottom left.
4.2.2 Wrap the time series
The NASA data have strong seasonality. Hence we want to wrap it to explore the periodicity. Press the
Right arrow once will give Figure 2(left). Keeping on pressing that key will get the plot in the right panel.
Figure 2: Wrapping the time series of NASA temperature data.
The period is quite regular, following a
yearly cycle.
Wrapping will stop if the series is maximally wrapped, when there are only three time points left in the
plot. Pressing the Left arrow unwraps the series.
4.2.3 Change the wrapping period
There is an indicator at the bottom right of each plot above, which indicates the wrapping period. It is 1 by
default. The plot will shift by this many points at each wrapping step. For a long series moving one point
6
each time will be too slow, or we are condent about the seasonality period, we can set dierent periods in
the argument shift. For this data
shift=c(1,12) is set for this data, because we know the data is monthly,
and 12 should be the yearly seasonality.
Figure 3: Wrapping the time series with a larger period (12). The cycles t well with each other.
Then pressing key G will change the text in the indicator, as shown in Figure 3(top left). Then press
the Right arrow once each time, we can get the other ve graphs of Figure 3 in turn.
Similarly, pressing the Left arrow will unwrap the series.
4.2.4 Fold/unfold the time series
When the time series have the constant periodicity, users may want to skip the wrapping steps and go to
the fully wrapped plot directly. Pressing Shift+Right arrow could realize this functionality. The period for
folding is set to be the maximum of the values in the argument shift. To unfold the series, press Shift+Left
arrow. Unfolding can be used in ANY status of wrapping or folding, so it is also like a reset function.
Figure 4: Folding the time series. The variation of temperatures in winter is larger than in summer.
7
In Figure 4, the left graph is the same as Figure 1. To fold the series do NOT ask user to change the
wrapping period to 12 by pressing G . Only Shift+Right arrow could give the right plot.
For a folded series, we can still slowly wrap or unwrap it by just hitting the right or left arrow.
Pressing Shift+Left arrow will get the left graph again.
4.2.5 Zoom in / out
For the long series drawn in a small window, some local features are easy to be covered by the intensive
points. Hence zooming in/out is a way to solve the problem. In cranvastime, rstly we need to switch from
the wrapping mode to the zooming mode, by pressing W, which will give the top left panel in Figure 5.
Compare it to Figure 1, we can see that the wrapping period indicator at the bottom right is removed. So
this is a sign for the zooming mode.
Then we need to click somewhere in the graph to conrm that the series will be zoomed in with a
horizontal center at where we clicked. After clicking, a yellow brush will pop up, remember that the hotspot
of the brush is the place of clicking. (Top center of Figure 5)
Now we can press the Right arrow to zoom in the area around the clicked place, as shown in the other
four plots of Figure 5. The magnifying will stop when there are three(or two sometimes) points left in the
graph.
Figure 5: Zooming in the time plot makes the change look gently, while zooming out makes the spikes look
sharper.
To zoom out, press the Left arrow, then we can get the last ve pictures in a reverse order. And pressing
W will switch back to the wrapping mode.
4.2.6 Pan the series
Suppose we stay in the bottom left penal of Figure 5, and we would like to see the other area of the series
with the same level of zooming. We can change the mode by pressing M rstly and then press the Left
and Right arrows, or use the mouse to drag the series to where we are interested in.
8
Figure 6: Panning the local series. The local features look quite similar when panning.
Figure 6(top left) is the plot after pressing M to change the mode.
The dierence between the two
modes is that whether the brush appears. In the panning mode, there is no brush. The top right panel is
given by pressing Left arrow. The bottom two plots are the left and right ends of the series, in which case
pressing Left or Right arrow will not move the series any more.
After panning, press M again could go back to the zooming mode, where the Right/Left arrow will
change the level of zooming.
4.2.7 Label the point of interest
Figure 7: Labels in dierent directions of points. When the point is near the lower boundary, its label will
be above the point. When the point is close to the right border, the label is on the left of the point.
The label will show up when the cursor is close to the point of interest. Whether they are 'close' enough
is depend on the size of points. For the example in Figure 9(left), it is hard to identify the points, but for
Figure 9(center/right), it is quite easy to get a label when moving the mouse.
In the regular situation, the label tag is put at the right and below of the point. But when the point is
near the border of the plot, the position of label will change depend on the position of point (Figure 7).
9
The content of label includes the values of x-axis and y-axis, as well as the period or group variables if
they are not null.
If two or more points are overlapped, both information will be shown in the label tag.
4.2.8 Brush the area
Figure 8: Two dierent sizes of the brush. Only one point is brushed on the left plot, while more than a half
points are of interest on the right.
When the user click the mouse in the plot, a square brush will pop up.
The initial size of the brush
is 1/30 of the range of the axes. There is a hotspot in one of the four corners of the brush square. Right
clicking and dragging it will change the size of the brush. Left clicking will change the position of the brush.
Figure 8 shows two dierent sizes of brush.
4.2.9 Change the size and transparency
Figure 9: Dierent sizes and transparency of the points. The left panel looks concise, but the center one
shows the exact positions of the observations. From the right graph we see several overlapped places.
Keeping on pressing the Down arrow will make the points disappear (Figure 9 left). Keeping on pressing
the Up arrow increases the size of points (Figure 9 center).
Pressing the minus(-) key makes the points and lines lighter and lighter (Figure 9 right) until they are
hardly seen, while pressing the plus(+) key will make it darker and nally back to the black color of the
initial plot.
10
4.2.10 Create a univariate time plot with period
NASA data measured the temperatures for 72 month, and the time points are integers starting from 1 to 72.
But it has the Year variable too. To compare the temperatures of the same month over years, we can group
the series by year, and get a segment line for each year. The following code could give Figure 10(left).
qtime(TimeIndx,~ts,qnasa,period=Year,shift=1)
Figure 10:
Univariate time plot with period.
It wraps the series fully.
The right plot shows the same
functionality as described in the above subsections.
This plot also has the basic functionality as described in sections above. We can wrap the series, increase
the point size, brush some points, and get the label, as shown in Figure 10(right).
4.2.11 Separate the period groups
Although Figure 10(left) looks similar to Figure 4(right), setting the argument period could do something
dierent, like separating the period groups.
Pressing key U (for up) once can get Figure 11(left).
The
distances between lines increases than Figure 10(left), and the labels for y-axis have changed to the values
in variable Year.
Keeping on pressing U we can get Figure 11(center) and nally arrive to 11(right).
11(right) all the lines are completely separated.
11
In
Figure 11: Separating the period groups. Year 2000 had an earlier temperature peak than other years.
Pressing key D (for down) will mixed the lines gradually and nally get Figure 10(left) again.
4.2.12 Make a multivariate time series plot
If the dataset contains multiple variables on the same time line, we may throw some questions, like whether
those variables have the same period? Do the peaks and valleys match with each other?
In cranvastime we can handle multivariate time series, just by putting all the variables in the argument
y as a formula. The code below give an example, and get Figure 12.
qtime(TimeIndx,~ts+ca_med,qnasa,shift=c(1,12))
Figure 12: Time series plot for two variables in NASA data. Data are standardized.
12
The variables are automatically rescaled between 0 and 1. The ACF for both of the variables are listed
too.
4.2.13 Separate the variables for multivariate time series
To separate the variables, we can press Shift+U. For Figure 12, this keypress event will give Figure 13(left).
The labels for y-axis and the ACF indicator also changed with the lines.
Figure 13: Separating the variables. Two variables have the same period, but the months for the peak values
do not match. The peaks of
ts
are two or three months later than the valleys of
ca_med.
From 13(left) we see that the variables have the same period, but the peaks are not matched. To see the
pattern clearly, we can press Shift+Right arrow, which gives Figure 13(center). Here we nd the peak of
ca_med
is about 3 months later than
ts.
Pressing Shift+D will mix the lines back. Figure 13(right) is the result of Shift+D from 13(center). Then
pressing Shift+Left arrow will make the plot go back to Figure 12.
4.2.14 Make a longitudinal plot
Longitudinal data is another important application of cranvastime. Since each of the groups/subjects/observations
get multiple measures over time, one line for each group is necessary. To identify the groups, the argument
group is needed in function
qtime.
Here we use the Remifentanil data as an example. The following code will make Figure 14.
library(nlme)
Remi <- Remifentanil[complete.cases(Remifentanil),]
Remi$ID <- factor(Remi$ID)
qRemi <- qdata(Remi)
qtime(Time,~conc,qRemi,group=ID)
13
Figure 14: Longitudinal time plot of Remifentanil data. A few groups have signicantly dierent patterns
and magnitudes.
4.2.15 Separate the groups
Similar as the manipulation for separating the period groups in Section 4.2.11, to separate/mix the longitudinal groups, press key U/D. Figure 15(left) is given by pressing U once to Figure 14, and keeping on
pressing U will nally get Figure 15(right), where all the groups are completely separated.
14
Figure 15: Separating the longitudinal groups. For most of the groups the
conc
increases for a while and
drops with a convex curve.
4.2.16 Highlight the whole series when brushing
This functionality is not interactive on-the-y so far, instead, a couple of commands are needed. For the
Remifentanil example, we need the following code:
link_var(qRemi) <- "ID"
link_type(qRemi) <- "self"
Then brushing any point in the plot will highlight the whole line, as shown in Figure 16(left). With a wider
brush, we can also select many groups simultaneously (Figure 16 right).
15
Figure 16: Selecting the groups of interest. The brush only contained the tail points of the group(s). Groups
selected on the right stopped the measurement at about the same time.
To cancel this kind of brushing, the following code is needed. If you are using a dierent dataset, change
the name
qRemi
to the name you use.
link_var(qRemi) <- NULL
4.2.17 Hold the series and shift horizontally
For some reasons (for example, the starting time of groups are not the same) we may consider to shift one
or more groups horizontally to match other groups. In cranvastime, pressing key M will switch the mode
under which we can drag a series with mouse or shift it by pressing Left/Right arrow.
16
Figure 17: Selecting one group, and shifting it to the right. The series is highlighted when it is dragged.
Stick to the Remifentanil example, we switched the mode by pressing M, then the brush will not shown
in the graph. Move the mouse slowly to a point in the series that we want to move. When the cursor is
close enough, the whole line is highlighted, as shown in Figure 17(left). Then drag it to the right (pressing
the Right arrow will do the same thing), we can get Figure 17(center)(right).
To recover the initial plot, press M to switch back to the wrapping mode, and then press Shift+Left
arrow.
5
Summary
In conclusion, cranvastime develops the interactive longitudinal and temporal plots, makes it possible to
wrap, zoom, pan, brush, and identify the time series dynamically. This part of proposal was nished very
well.
The statistical time series analysis procedure is not quite developed in cranvastime as proposed. There
are some basic statistics calculated and presented on the plot, like ACF. But more work needs to be done.
Things that are done but were not in the proposal includes modifying the wrapping period on-the-y,
vertically separating the overlapped series by period groups or longitudinal groups, dragging/shifting the
groups of interest, etc. These implementations are very neat and useful to analyze seasonality and compare
groups.
In the future plans for the work we will explore more on the statistics for summarizing correlation
between series, and interactive shifting tools for exploring lags between multivariate measurements, overlaying
additional clues such as smoothed series, in sync with these developments in the cranvas package, and how
eciently is the user interaction.
Acknowledgement
I would like to thank Prof. Dianne Cook and Prof. Heike Hofmann, who are the mentors of this project, as
well as the cranvas-core group, who also helped me a lot.
Cranvastime is funded by Google Summer of Code 2011.
17
References
Cook, D. and Buja, A. (1997). Manual controls for high-dimensional data projections. Journal of Computational and Graphical Statistics, pages 464480.
Hochheiser, H. and Shneiderman, B. (2001). Visual specication of queries for nding patterns in time-series
data. In Proceedings of Discovery Science, pages 441446. Citeseer.
Kotter,
T.
and
McDougall,
A.
Theus,
M.
Fortune
a
system
for
home.vrweb.de/∼martin.theus/Fortune_JCGS.pdf.
and
Cook,
D.
(1994).
interactive
Exploring
http://stat-graphics.org/movies/time-series.html.
time
statistical
series
graphics
using
for
time
interactive
series.
graphics.
Unwin, A. and Wills, G. (1988). Eyeballing time series. In ASA Proceedings of the Section on Statistical
Graphics', American Statistical Association, Alexandria, VA, pages 263268.
18
Download