Cranvastime: Interactive longitudinal and temporal data plots Xiaoyue Cheng August 26, 2011 1 Introduction Cranvas interfaces to the Qt libraries (http://qt.nokia.com/) using the R packages, qtbase and qtpaint, to provide interactive and dynamic graphics for large amounts of data. It provides programmable interactive graphics in R, enabling users to link plots with analytical methods. Cranvas has several basic plot types: scatterplots, histograms, barcharts, density plots, mosaic plots, maps, parallel coordinate plots and tours. Cranvastime develops interactive plots for longitudinal and temporal data. This type of data is prevalent in the study of problems related to human health, and medical treatments, wages, prices and economic indicators, temperature and climate change. Longitudinal data plots show values for individuals, possibly at irregular time intervals. The purpose is to explore individual variation, dierences and similarities. Time series plots tend to show just one measurement for each time, but there is implicitly an interest in studying seasonal trends, and temporal dependencies. There are previously existing software which have explored the use interactive graphics for time series data, including Diamond Fast (Unwin and Wills, 1988), XQz(McDougall and Cook, 1994), XGobi and GGobi(Cook and Buja, 1997), Fortune(Kotter and Theus), and TimeSearcher(Hochheiser and Shneiderman, 2001). Diamond Fast ran on Macintosh computers, which is able to explore, interrogate, manipulate plots of univariate or multivariate time series. performed on the series. Some transformations including dierences and smoothing can be A lot of the inspiration for the methods available in cranvastime come from Diamond Fast. XQz was developed to provide interactive tools for interactively exploring times series using Box-Jenkins approaches. XGobi and its contemporary ancestor GGobi are broader in focus, interactive graphics tools for multivariate data which include some facilities for working with temporal data. Fortune was inuenced by Diamond Fast, and focused on forecasting and transformations, also implemented a few basic interactive features. XQz, XGobi, GGobi, and Fortune were designed on Windows, Mac, and Linux system. TimeSearcher 1 used time boxes to query the time series dynamically, while its successors TimeSearcher 2 and 3 extended the applications to large data and forecasting. 2 Functionality • Univariate time series Wrap the time series by a particular period. To examine the periodicity, particularly slight irregularities. A graphical approach like this can allow for nding near-seasonal dependencies. Change the wrapping period on-the-y. Seasonality can take several forms, yearly, monthly, weekly. Setting the period will enable users to wrap in dierent units, to explore dierent kinds of periodicity. Users are encouraged to set multiple values in the 'shift' argument based on their knowledge about the data, then use those values to explore the wrapping period. 1 Fold the series on itself completely by the wrapping period setting. To see the fully wrapped time series directly without intermediate stages when users are condent on the period. Unfold/Reset the plot. To completely unwrap the time series by only hitting the key once, which is much more convenient than pressing the regular unwrapping key. Zoom in the series to see the details. When the time series is quite long but the window width is limited, users can zoom in the series to see the local details. Pan the series which is zoomed in. To move the zoomed-in series horizontally. Then users can see the long series from one side to another. It helps to understand the global patterns when in a local mode. Compute the AutoCorrelation Function (ACF). Autocorrelation is used to numerically summarize the seasonality of time series. This is the correlation between itself and its lagged self. The lag is set by the wrapping period. • Multivariate time series Separate the standardized variables vertically. The variables involved may have dierent scales. To display them in one plot, cranvastime will rstly standardize the variables. Hence those series are mixed together. With this functionality users can then separate the variables to see whether they have dierent features. • Wrapping, folding, reseting, zooming, panning are the same as univariate time series. Compute ACF for each variable. Longitudinal data Separate the time series of the subjects vertically. Similar to separating the multi-variables, in the longitudinal data, users may be interested in comparing the groups/subjects/observations which are repeatedly measured over time. Hence it is very useful to distinguish the groups vertically while keep them in the same time axis. Highlight the whole series for an individual when the brush covers any point in the series. Sometimes the groups of longitudinal data are too close to specify by eyes. Highlight the whole series will help to nd out the groups of interest. Hold one of the series and shift it horizontally. Groups may start from dierent time points. Hold the series and shift can make them start at the same time and then it is easy to compare their trends. • Wrapping, zooming, panning, reseting are the same as univariate time series. Functionality that is general to the whole cranvas package, that also works for cranvastime. Label the information of a point when the cursor is near it to provide detailed information to the user. Change the size of points. When there are too many points in the plot, people may want the size of points to be smaller; when there are only a few points, the bigger points are better. So the functionality to change the size is needed. 2 Change the transparency of points At times some of the points are overlapped while the other are sparse. Making the points transparent is useful to see the density of points, because the overlapped areas are darker while the sparse parts are lighter. Brush an area and highlight all the points/segments in the area Then users can select some observations and link them with other type of plots. This is very important for high-dimensional data, to understand relationships between many characteristics that were recorded. 3 Settings 3.1 Usage qtime(time, y, data, period = NULL, group = NULL, wrap = TRUE, shift = c(1, 7, 12, 24), size = 2, alpha = 1, asp = NULL, main = NULL, xlab = NULL, ylab = NULL, ...) 3.2 Arguments time The variable indicating time, which is displayed on the horizontal axis. This variable (for now) needs to be consecutive integers y 1, 2, 3, ..., n The variable(s) displayed on the vertical axis. It must be a formula with only right hand side at the moment. data Mutaframe data to use. period The variable to group the time series, which could cut the long time series to shorter period, and then each period could be drawn by a single line. The setting is better to be `year',`month', or other time resolutions. Default to be null. When it is not null, the key U and D can be hit to separate the groups or overlap them together to watch the patterns. group wrap The variable used for longitudinal data grouping. Usually to be `ID', etc. The switch for the wrapping mode. Default to be TRUE. When it is TRUE, hitting the right arrow or left arrow will wrap or unwrap the time series; while when it is FALSE, hitting arrows will zoom in/out. The wrapping mode can be changed on-the-y. shift Wrapping period selector. The default possible period are 1, 7(for days a week), 12(for months), 24(for hours). size Point size, default to be 2. alpha asp Transparency level, 1=completely opaque, default to be 1. Ratio between width and height of the plot. main Main title for the plot. xlab Label on horizontal axis, default is name of x variable. ylab Label on vertical axis, default is name of y variable. 3 3.3 Up Keypress Events Increase size of points. Down Decrease size of points. Right Wrap the time series when wrap=TRUE, while zoom in with the center of the last clicked dot when wrap=FALSE. Left Back the time series to the position before last wrapping when wrap=TRUE, and zoom out when wrap=FALSE. Shift+Right When wrap=TRUE, the time series will be folded directly to the width of maximal value in argument shift. Shift+Left Plus(+) Increase alpha level (starts at alpha=1 by default). Minus(-) U(up) Time series will be backed to the original x-axis position, no matter wrap is TRUE or FALSE. Decrease alpha level (starts at alpha=1 by default). Separate the series groups vertically. D(down) Mix the series groups by shifting them back vertically. Shift+U For multivariate y's, separate them vertically. Shift+D For multivariate y's, mix them back together. G(Gear) Change the wrapping period circularly in the values of argument 'shift'. M(Mode) Switch the mode for series selecting. There are two modes: on and o. Default to be o. When the argument 'group' is not null, users can turn it on to hold a series and shift the series horizontally by dragging with the mouse. When the wrapping mode is FALSE, turning on the series selecting mode will make it possible to pan the series which is zoomed in by dragging with the mouse or pressing left/right arrows. W(Wrap) Switch the wrapping mode between TRUE and FALSE. When it is TRUE, an indicator of wrapping period will be shown at the bottom right of the graph; otherwise there is not any indicator on the bottom right. 4 Examples 4.1 • Datasets NASA The data are geographic and atmospheric measures on a very coarse 24 by 24 grid covering Central America. The variables are: elevation, temperature (surface and air), ozone, air pressure, and cloud cover (low, mid, and high). With the exception of elevation, all variables are monthly averages, with observations for Jan 1995 to Dec 2000. These data were obtained from the NASA Langley Research Center Atmospheric Sciences Data Center. The data is provided by package cranvas. More details about the data, including descriptions of the variables, are available online at http://www.amstat-online.org/sections/graphics/dataexpo/2006data.php. • Remifentanil Pharmacokinetics of the drug Remifentanil dataset in the package nlme. It contains 2107 rows and 12 columns, containing concentration of the drug at dierent times along with demographics information of subjects. The variable ID, Time, conc are used in cranvastime. 4 • Wages Wages data was collected to track the labor experiences of male high-school dropouts. The men were between 14 and 17 years old at the time of the rst survey. There are 15 variables, among which id, lnw, exper are used in cranvastime. More details can be found from package • cranvas. Pigs The data collected the quarterly production and prots for raising UK pigs during 1967-1978. There are 11 variables and 48 records. The dataset is also provided by package • cranvas. Lynx Lynx is a dataset of annual numbers of lynx trappings for 18211934 in Canada. package datasets, It is provided by which comes automatically with R. It is a historic data included as a cranvastime purposes for posterity reasons. • Sunspots The data are monthly mean relative sunspot numbers from 1749 to 1983. Collected at Swiss Federal Observatory, Zurich until 1960, then Tokyo Astronomical Observatory. datasets, More details are in package which comes automatically with R. Sunspot.month and sunspot.year provides univariate time series of sunspots, which are used in cranvastime. It is a historic data included as a cranvastime purposes for posterity reasons. 4.2 Examples for Functionality 4.2.1 Create a cranvastime plot Before creating a cranvastime plot, users should use the original data to create an augmented data as a mutaframe. Some attributes, like color and brushed, are added as columns of the mutaframe. common setting for all the plot functions in package cranvas. Then function plot. The code example is as follows. library(cranvas) data(nasa) nasa11 <- subset(nasa, Gridx == 22 & Gridy == 21) qnasa <- qdata(nasa11) qtime(TimeIndx,~ts,qnasa,shift=c(1,12)) 5 qtime This is a is used to draw a time Figure 1: Cranvastime plot for the temperatures of one location in NASA data. Strong seasonality is visible. Figure 1 is a simplest example using NASA data. One location is chosen from the 24x24 locations. 72 months from 1995-2000 are drawn. Strong seasonality can be seen from the plot. ACF with lag=1 is computed and shown at the bottom left. 4.2.2 Wrap the time series The NASA data have strong seasonality. Hence we want to wrap it to explore the periodicity. Press the Right arrow once will give Figure 2(left). Keeping on pressing that key will get the plot in the right panel. Figure 2: Wrapping the time series of NASA temperature data. The period is quite regular, following a yearly cycle. Wrapping will stop if the series is maximally wrapped, when there are only three time points left in the plot. Pressing the Left arrow unwraps the series. 4.2.3 Change the wrapping period There is an indicator at the bottom right of each plot above, which indicates the wrapping period. It is 1 by default. The plot will shift by this many points at each wrapping step. For a long series moving one point 6 each time will be too slow, or we are condent about the seasonality period, we can set dierent periods in the argument shift. For this data shift=c(1,12) is set for this data, because we know the data is monthly, and 12 should be the yearly seasonality. Figure 3: Wrapping the time series with a larger period (12). The cycles t well with each other. Then pressing key G will change the text in the indicator, as shown in Figure 3(top left). Then press the Right arrow once each time, we can get the other ve graphs of Figure 3 in turn. Similarly, pressing the Left arrow will unwrap the series. 4.2.4 Fold/unfold the time series When the time series have the constant periodicity, users may want to skip the wrapping steps and go to the fully wrapped plot directly. Pressing Shift+Right arrow could realize this functionality. The period for folding is set to be the maximum of the values in the argument shift. To unfold the series, press Shift+Left arrow. Unfolding can be used in ANY status of wrapping or folding, so it is also like a reset function. Figure 4: Folding the time series. The variation of temperatures in winter is larger than in summer. 7 In Figure 4, the left graph is the same as Figure 1. To fold the series do NOT ask user to change the wrapping period to 12 by pressing G . Only Shift+Right arrow could give the right plot. For a folded series, we can still slowly wrap or unwrap it by just hitting the right or left arrow. Pressing Shift+Left arrow will get the left graph again. 4.2.5 Zoom in / out For the long series drawn in a small window, some local features are easy to be covered by the intensive points. Hence zooming in/out is a way to solve the problem. In cranvastime, rstly we need to switch from the wrapping mode to the zooming mode, by pressing W, which will give the top left panel in Figure 5. Compare it to Figure 1, we can see that the wrapping period indicator at the bottom right is removed. So this is a sign for the zooming mode. Then we need to click somewhere in the graph to conrm that the series will be zoomed in with a horizontal center at where we clicked. After clicking, a yellow brush will pop up, remember that the hotspot of the brush is the place of clicking. (Top center of Figure 5) Now we can press the Right arrow to zoom in the area around the clicked place, as shown in the other four plots of Figure 5. The magnifying will stop when there are three(or two sometimes) points left in the graph. Figure 5: Zooming in the time plot makes the change look gently, while zooming out makes the spikes look sharper. To zoom out, press the Left arrow, then we can get the last ve pictures in a reverse order. And pressing W will switch back to the wrapping mode. 4.2.6 Pan the series Suppose we stay in the bottom left penal of Figure 5, and we would like to see the other area of the series with the same level of zooming. We can change the mode by pressing M rstly and then press the Left and Right arrows, or use the mouse to drag the series to where we are interested in. 8 Figure 6: Panning the local series. The local features look quite similar when panning. Figure 6(top left) is the plot after pressing M to change the mode. The dierence between the two modes is that whether the brush appears. In the panning mode, there is no brush. The top right panel is given by pressing Left arrow. The bottom two plots are the left and right ends of the series, in which case pressing Left or Right arrow will not move the series any more. After panning, press M again could go back to the zooming mode, where the Right/Left arrow will change the level of zooming. 4.2.7 Label the point of interest Figure 7: Labels in dierent directions of points. When the point is near the lower boundary, its label will be above the point. When the point is close to the right border, the label is on the left of the point. The label will show up when the cursor is close to the point of interest. Whether they are 'close' enough is depend on the size of points. For the example in Figure 9(left), it is hard to identify the points, but for Figure 9(center/right), it is quite easy to get a label when moving the mouse. In the regular situation, the label tag is put at the right and below of the point. But when the point is near the border of the plot, the position of label will change depend on the position of point (Figure 7). 9 The content of label includes the values of x-axis and y-axis, as well as the period or group variables if they are not null. If two or more points are overlapped, both information will be shown in the label tag. 4.2.8 Brush the area Figure 8: Two dierent sizes of the brush. Only one point is brushed on the left plot, while more than a half points are of interest on the right. When the user click the mouse in the plot, a square brush will pop up. The initial size of the brush is 1/30 of the range of the axes. There is a hotspot in one of the four corners of the brush square. Right clicking and dragging it will change the size of the brush. Left clicking will change the position of the brush. Figure 8 shows two dierent sizes of brush. 4.2.9 Change the size and transparency Figure 9: Dierent sizes and transparency of the points. The left panel looks concise, but the center one shows the exact positions of the observations. From the right graph we see several overlapped places. Keeping on pressing the Down arrow will make the points disappear (Figure 9 left). Keeping on pressing the Up arrow increases the size of points (Figure 9 center). Pressing the minus(-) key makes the points and lines lighter and lighter (Figure 9 right) until they are hardly seen, while pressing the plus(+) key will make it darker and nally back to the black color of the initial plot. 10 4.2.10 Create a univariate time plot with period NASA data measured the temperatures for 72 month, and the time points are integers starting from 1 to 72. But it has the Year variable too. To compare the temperatures of the same month over years, we can group the series by year, and get a segment line for each year. The following code could give Figure 10(left). qtime(TimeIndx,~ts,qnasa,period=Year,shift=1) Figure 10: Univariate time plot with period. It wraps the series fully. The right plot shows the same functionality as described in the above subsections. This plot also has the basic functionality as described in sections above. We can wrap the series, increase the point size, brush some points, and get the label, as shown in Figure 10(right). 4.2.11 Separate the period groups Although Figure 10(left) looks similar to Figure 4(right), setting the argument period could do something dierent, like separating the period groups. Pressing key U (for up) once can get Figure 11(left). The distances between lines increases than Figure 10(left), and the labels for y-axis have changed to the values in variable Year. Keeping on pressing U we can get Figure 11(center) and nally arrive to 11(right). 11(right) all the lines are completely separated. 11 In Figure 11: Separating the period groups. Year 2000 had an earlier temperature peak than other years. Pressing key D (for down) will mixed the lines gradually and nally get Figure 10(left) again. 4.2.12 Make a multivariate time series plot If the dataset contains multiple variables on the same time line, we may throw some questions, like whether those variables have the same period? Do the peaks and valleys match with each other? In cranvastime we can handle multivariate time series, just by putting all the variables in the argument y as a formula. The code below give an example, and get Figure 12. qtime(TimeIndx,~ts+ca_med,qnasa,shift=c(1,12)) Figure 12: Time series plot for two variables in NASA data. Data are standardized. 12 The variables are automatically rescaled between 0 and 1. The ACF for both of the variables are listed too. 4.2.13 Separate the variables for multivariate time series To separate the variables, we can press Shift+U. For Figure 12, this keypress event will give Figure 13(left). The labels for y-axis and the ACF indicator also changed with the lines. Figure 13: Separating the variables. Two variables have the same period, but the months for the peak values do not match. The peaks of ts are two or three months later than the valleys of ca_med. From 13(left) we see that the variables have the same period, but the peaks are not matched. To see the pattern clearly, we can press Shift+Right arrow, which gives Figure 13(center). Here we nd the peak of ca_med is about 3 months later than ts. Pressing Shift+D will mix the lines back. Figure 13(right) is the result of Shift+D from 13(center). Then pressing Shift+Left arrow will make the plot go back to Figure 12. 4.2.14 Make a longitudinal plot Longitudinal data is another important application of cranvastime. Since each of the groups/subjects/observations get multiple measures over time, one line for each group is necessary. To identify the groups, the argument group is needed in function qtime. Here we use the Remifentanil data as an example. The following code will make Figure 14. library(nlme) Remi <- Remifentanil[complete.cases(Remifentanil),] Remi$ID <- factor(Remi$ID) qRemi <- qdata(Remi) qtime(Time,~conc,qRemi,group=ID) 13 Figure 14: Longitudinal time plot of Remifentanil data. A few groups have signicantly dierent patterns and magnitudes. 4.2.15 Separate the groups Similar as the manipulation for separating the period groups in Section 4.2.11, to separate/mix the longitudinal groups, press key U/D. Figure 15(left) is given by pressing U once to Figure 14, and keeping on pressing U will nally get Figure 15(right), where all the groups are completely separated. 14 Figure 15: Separating the longitudinal groups. For most of the groups the conc increases for a while and drops with a convex curve. 4.2.16 Highlight the whole series when brushing This functionality is not interactive on-the-y so far, instead, a couple of commands are needed. For the Remifentanil example, we need the following code: link_var(qRemi) <- "ID" link_type(qRemi) <- "self" Then brushing any point in the plot will highlight the whole line, as shown in Figure 16(left). With a wider brush, we can also select many groups simultaneously (Figure 16 right). 15 Figure 16: Selecting the groups of interest. The brush only contained the tail points of the group(s). Groups selected on the right stopped the measurement at about the same time. To cancel this kind of brushing, the following code is needed. If you are using a dierent dataset, change the name qRemi to the name you use. link_var(qRemi) <- NULL 4.2.17 Hold the series and shift horizontally For some reasons (for example, the starting time of groups are not the same) we may consider to shift one or more groups horizontally to match other groups. In cranvastime, pressing key M will switch the mode under which we can drag a series with mouse or shift it by pressing Left/Right arrow. 16 Figure 17: Selecting one group, and shifting it to the right. The series is highlighted when it is dragged. Stick to the Remifentanil example, we switched the mode by pressing M, then the brush will not shown in the graph. Move the mouse slowly to a point in the series that we want to move. When the cursor is close enough, the whole line is highlighted, as shown in Figure 17(left). Then drag it to the right (pressing the Right arrow will do the same thing), we can get Figure 17(center)(right). To recover the initial plot, press M to switch back to the wrapping mode, and then press Shift+Left arrow. 5 Summary In conclusion, cranvastime develops the interactive longitudinal and temporal plots, makes it possible to wrap, zoom, pan, brush, and identify the time series dynamically. This part of proposal was nished very well. The statistical time series analysis procedure is not quite developed in cranvastime as proposed. There are some basic statistics calculated and presented on the plot, like ACF. But more work needs to be done. Things that are done but were not in the proposal includes modifying the wrapping period on-the-y, vertically separating the overlapped series by period groups or longitudinal groups, dragging/shifting the groups of interest, etc. These implementations are very neat and useful to analyze seasonality and compare groups. In the future plans for the work we will explore more on the statistics for summarizing correlation between series, and interactive shifting tools for exploring lags between multivariate measurements, overlaying additional clues such as smoothed series, in sync with these developments in the cranvas package, and how eciently is the user interaction. Acknowledgement I would like to thank Prof. Dianne Cook and Prof. Heike Hofmann, who are the mentors of this project, as well as the cranvas-core group, who also helped me a lot. Cranvastime is funded by Google Summer of Code 2011. 17 References Cook, D. and Buja, A. (1997). Manual controls for high-dimensional data projections. Journal of Computational and Graphical Statistics, pages 464480. Hochheiser, H. and Shneiderman, B. (2001). Visual specication of queries for nding patterns in time-series data. In Proceedings of Discovery Science, pages 441446. Citeseer. Kotter, T. and McDougall, A. Theus, M. Fortune a system for home.vrweb.de/∼martin.theus/Fortune_JCGS.pdf. and Cook, D. (1994). interactive Exploring http://stat-graphics.org/movies/time-series.html. time statistical series graphics using for time interactive series. graphics. Unwin, A. and Wills, G. (1988). Eyeballing time series. In ASA Proceedings of the Section on Statistical Graphics', American Statistical Association, Alexandria, VA, pages 263268. 18