TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner, Simone Kriglstein, Margit Pohl, Nik Suchy Motivation Motivation Motivation Motivation Overview TimeCleanser: special quality checks for time-induced problems Evaluation of TimeCleanser Results Derived design principles Conclusion TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets TimeCleanser – Special Focus on Time Time Point in time Interval Time-Oriented Values Values that change over time Multiple Data Sets sales per hour TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation Fequent feedback sessions Evaluation Requirement Analysis taxonomy of time-oriented quality problems [Gschwandtner et al., 2012] Page 1 Page 2 Page 3 real life experience of partner company TimeCleanser Time Checks – Examples Intervals Same durations Minimum and maximum duration Obligatory gaps, e.g., break in the night 8pm 7am time Time Checks – Examples Temporal range IDs should cover same temporal range (with some tolerance), e.g., different departments A B ………………... ………………... time Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range e.g., sales of one hour value time Time-Oriented Value Checks – Examples Valid minimum and maximum values within a given temporal range Valid value sequences, e.g., ready – start – operate – end value Y Y X X Z Z time Multiple Data Sets Checks – Examples Data should cover same temporal range (with some tolerance) Contain intervals of same length Contain time stamps of same precision A 8:02 9:01 8:00 9:00 B time Summary - Checks Syntax Checks Time Checks Valid overall temporal range Durations/interval length Missing time point or interval Entries for different IDs cover same temporal range Time-Oriented Value Checks Valid minimum and maximum values within a given temporal range Values which do not change for too long Valid timing of values Valid value sequences Valid intervals between subsequent values Multiple Data Sets Cover same temporal range Contain intervals of equal length Contain time stamps of same precision Visualizations Overview of values over time Visualizations Difference plot of subsequent data values Visualizations Heatmap of interval lengths and data values Evaluation – Focus Group Participants: 2 data analysts of our partner company (target users) 2 HCI experts Session: 2 scenarios (GPS data and working hours) Tasks: 1. Remove syntax errors 2. Check interval lengths 3. Check plausibility of velocity values (GPS data set only) 4. Check validity of working hours and of weekly profiles (working hours data set only) Audio and video recording Design Principle 1: Data cleansing is a sequential task with loops correct syntax Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles From To Value Differentiator Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview zoom & filter analysis & details on demand Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview time values zoom & filter data values analysis & details on demand sequences multiple data sets Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview time values zoom & filter data values analysis & details on demand sequences multiple data sets Design Principle 2: Complex quality problems are best spotted with visualizations ‘You get a picture of the data set, not only of erroneous entries, but also of how the data looks like and how it should look like.’ Design Principle 3: Visualizations and raw data tables are complementary Design Principle 4: Algorithmic means are suited to identify precisely definable errors Design Principle 5: Original data needs to be preserved Correct data right away for further processing Confer with customers later Quickly undo changes Design Principles – Summary 1. Data cleansing is a sequential task with loops 2. Complex quality problems are best spotted with visualizations 3. Visualizations and raw data tables are complementary 4. Algorithmic means are suited to identify precisely definable errors 5. Original data needs to be preserved Negative Points and Possible New Features More interactive features would be necessary (HCI experts) Synchronized zooming for multiple visualizations Linking and brushing between visualizations and data tables Statistics about string lengths to support the detection of outliers Use of wildcards and regular expressions for filter functionality A one-page statistical summary of the data set (e.g.,minimum, maximum, average, distribution) Conclusion Very close collaboration with target users Systematic list of data quality checks Sequence of cleansing steps Design principles for data cleansing support (with special focus on time-oriented data) Need of visualizations for complex error detection and cleansing tasks Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features Results – Topics which were discussed Traditional methods Workflow Advantages of TimeCleanser Attitudes towards visualizations Intertwinedness of analytical and visual methods Negative points and possible new features Design principles Syntax Checks Time-Oriented Value Checks Time-Oriented Value Checks Time-Oriented Value Checks Evaluation – Questions (1) Does the prototype help the target users to perform data cleansing tasks? (2) Is an integration of visualizations methods useful? (3) What are the advantages and disadvantages in comparison with the data cleansing methods they have used so far? (4) For which tasks are visualization methods, common data cleansing analysis methods, and a combination of both suitable? (5) Which interaction methods for the visualizations are useful to support users‘ working steps to perform data cleansing tasks? TimeCleanser TimeCleanser Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview zoom & filter Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview time values zoom & filter data values analysis & details on demand sequences multiple data sets Additions to Shneiderman's Visual Information Seeking Mantra: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘ Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview time values zoom & filter data values analysis & details on demand sequences multiple data sets Additions to Keim's Visual Analytics mantra: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘ Lessons Learned 1. Automatic methods are preferred in cases which are easily defined 2. Visualizations are superior when judging plausibility 3. Analysts appreciated the use of visualizations as an interactive analysis tool 4. Efficient connection of visualizations to raw data and a side by side display is important TimeCleanser – Design and Development Tight collaboration with software company Working at the company site for over 3 years Problem analysis, design, and implementation – CEO, data analysts, software developers, VA experts Fequent feedback sessions – CEO, VA experts, software developers Evaluation – data analysts, HCI experts Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview zoom & filter analysis & details on demand Additions to Shneiderman's Visual Information Seeking Mantra [Shneiderman, 1996]: `correct syntax first, assign semantic roles, overview, zoom and filter, then analysis and details-on-demand‘ Design Principle 1: Data cleansing is a sequential task with loops correct syntax assign semantic roles overview zoom & filter analysis & details on demand Additions to Keim's Visual Analytics mantra [Keim et al., 2008]: `correct syntax first – assign semantic roles – overview – analyse – show the important – zoom, filter and analyse further – details on demand‘ Design Principle 4: Algorithmic means are suited to identify precisely definable errors