What are “Data Clones”?

advertisement
DATA CLONE DETECTION AND
VISUALIZATION IN
SPREADSHEETS
Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen
Delft University of Technology, Infotron
Netherlands
1
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
2
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
3
ABOUT THE PAPER
• Spreadsheets are widely used in industry
• Spreadsheets are error-prone
• Numerous companies have lost money because
of spreadsheet errors
One of the causes – “Copy-pasting”
• In this paper:
– Study cloning (copy-pasting) in spreadsheets
– Provide an approach to detect and visualize clones in
spreadsheet
– Evaluate the proposed approach
4
Introduction
INITIAL RESEARCH PROJECT
• Study the “Gap” between business users and
Programmers
• Surprising finding – Some programmers were
heavily involved in Business
In Excel
5
Introduction
NEW RESEARCH INTEREST
Impact of Spreadsheets on Business
• 95% of all US firms use spreadsheets for
Financial reporting
• 90% of all analysts in industry perform
calculations in spreadsheets
• 50% of spreadsheets form the basis for
decisions
Finding – Impact of spreadsheets grow over
time
6
Introduction
SPREADSHEETS ARE “UNDER THE RADAR”
•
•
•
•
No lists of spreadsheets
No tracking
Do not have clear owner
No proper documentation
Complex spreadsheets without documentation
can lead to serious errors
7
Introduction
PREVIOUS RESEARCH ON SPREADSHEETS
• Focused on analyzing and testing the
“Formulas”
• “Data” on spreadsheet calculations was
“Overshadowed”
8
Introduction
CLONE DETECTION IN SOURCE CODE
• Text-based techniques - perform little or no transformation
to the raw source code before attempting to detect
identical lines of code
• Token-based techniques - apply a lexical analysis
(tokenization) to the source code and, subsequently, use
the tokens as a basis for clone detection
• AST-based techniques - use parsers to obtain a syntactical
representation of the source code, typically an abstract
syntax tree (AST). The clone detection algorithms then
search for similar subtrees in this AST
• PDG-based approaches – Program dependence graphs
(PDGs) contain information of a semantical nature, such as
control and data flow of the program.
9
Introduction
HORROR STORIES
• In 2003 TransAlta lost US$24 Million because
of copy-paste error in spreadsheets
• Federal Reserve made a copy-paste error in
their customer credit statement – Could have
led to difference of US$4 Billion
10
Introduction
11
Introduction
12
Introduction
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
13
CO-RELATION TO CLONES IN SOFTWARE
• Clones in spreadsheets have similar effect as
of clones in software system
– They make it error – prone
– Have negative effect on Quality and Maintenance
– Can lead to severe unexpected behavior
14
Motivation
AIM
• Do not aim to change user’s behavior
• Follow an approach to mitigate risks by
detecting and visualizing the copy-pasting
relationships
• Understand the impact and develop a strategy
to automatically detect data clones in
spreadsheets
15
Motivation
THREE RESEARCH QUESTIONS
1. How often do data clones occur in
spreadsheets?
2. What is the impact of data clones on
spreadsheet quality?
3. Does our approach to detect and visualize
data clones in spreadsheets support users in
finding and understanding data clones?
16
Motivation
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
17
WHAT ARE “DATA CLONES”?
• Clones of Formula results copied as Data on
other parts of spreadsheet
18
Part A – Data Clone Detection
SOME TERMS TO REMEMBER
• Clone
• Clone Cluster
• Matching Clone
Clusters
• Near-miss
Clone Clusters
19
Part A – Data Clone Detection
EXAMPLE USED IN FOLLOWING SLIDES
Worksheet Name: Problem Data
Copied Data
Worksheet Name: Eff4
Worksheet
that contains
the formula
20
Part A – Data Clone Detection
CLONE DETECTION TECHNIQUE
The technique is based on existing text-based clone detection
algorithm
Algorithm steps:
• Cell Classification – Divide the cells into data cells, formula cells and
empty cells
• Lookup creation – A lookup table of all cells is created, with the cell
value as key and a list of locations as the value
• Pruning – Remove all values from the lookup table that do not
occur both in a formula and a constant cell
• Cluster finding – The algorithm looks for clusters of neighboring
cells that are all contained in a clone, and that are all either formula
cells or constant cells
• Cluster matching - Each formula cluster is matched with each
constant cluster.
21
Part A – Data Clone Detection
STEP 1: CELL CLASSIFICATION
Data
Cells
22
Formula
Cells
Part A – Data Clone Detection
STEP 2: LOOK UP CREATION
Cell
Value as
Key
List of
Locations
as Value
23
Part A – Data Clone Detection
STEP 3: PRUNING
Remove:
Not a clone
24
Part A – Data Clone Detection
STEP 4: CLUSTER FINDING
25
Part A – Data Clone Detection
STEP 5: CLUSTER MATCHING
26
Part A – Data Clone Detection
INPUT PARAMETERS
• StepSize – Indicates the search radius in terms of numbers
of cells. Used in 4th Step (Finding Clusters)
• MatchPercentage – 100% means the values have to match
exactly, lower percentages allow for the detection of nearmiss clones. Used in 5th Step (Match Clusters)
• MinimalClusterSize – Sets the minimal number of cells that
a cluster has to consist of
• MinimalDifferentValues – Represents the minimal number
of different values that have to occur in a clone cluster
• Define region for finding clones – Indicate whether clones
are found within worksheets, between worksheets
between spreadsheets or a combination of those
27
Part A – Data Clone Detection
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
28
WAYS TO VISUALIZE CLONES
• First Way –
Dataflow Diagrams – Showing relations between
worksheets
• Second Way –
Add pop-ups indicating Source and Copied Data
29
Part B – Clone Visualization
DATA FLOW DIAGRAM
Rectangles indicate Worksheets
Arrows indicate Formula Dependencies
Dashed Arrows indicate Data Clone Dependencies
30
Part B – Clone Visualization
DATA FLOW DIAGRAM
Eff4
Problem Data
31
Part B – Clone Visualization
POP-UPS
32
Part B – Clone Visualization
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
33
IMPLEMENTATION
• Spreadsheet Analysis Tool – Breviz
• Implemented using
– C# 4.0 using Visual Studio 2010
– Utilizes Gembox component to read Excel Files
• Also available as-a-service on Infotron’s
Website
34
Implementation and Evaluation
EVALUATION
• Two approaches:
– Quantitative
Analyzed a subset of EUSES corpus
– Qualitative
Studied two real life cases
35
Implementation and Evaluation
QUANTITATIVE EVALUATION
• Goal:
– To answer first research question:
1. How often do data clones occur in spreadsheets?
– Evaluate algorithm performance in terms of
execution time and in terms of the precision
36
Implementation and Evaluation
SPREADSHEETS FROM THE EUSES CORPUS
• Used by several researchers to evaluate
spreadsheet algorithms
• 11 different domains like educational,
financial, inventory, biology
• 4223 real-life spreadsheets
• 1711 spreadsheets contain formulas
37
Quantitative Evaluation
PARAMETERS FOR EVALUATION
• MinimalClusterSize - Different Values
• MinimalDifferentValues – Different Values
• MatchPercentage – 100%
• Search for clones not performed between
domains
38
Quantitative Evaluation : EUSES Corpus
DETERMINING FALSE POSITIVES
• Manual Detection by inspecting clones and
determining whether:
– Clone clusters detected indeed have same value
– One clone cluster has formula and other indeed
has copied data
– Headers of the clone clusters match
39
Quantitative Evaluation : EUSES Corpus
FINDINGS
• Precision:
– % of Spreadsheets in which clone was Verified
Total Number of Spreadsheets Detected
– Results:
For MinimalClusterSize = 5, MinimalDifferentValues = 3
(Lowest meaningful values)
No. of spreadsheets detected = 157
No. of verified spreadsheets = 86
Precision = 54.8%
Precision rises for higher values of two given
parameters
40
Quantitative Evaluation : EUSES Corpus
FINDINGS
41
Quantitative Evaluation : EUSES Corpus
FINDINGS
42
Quantitative Evaluation : EUSES Corpus
FINDINGS (CONTD.)
• False Positives:
– Header values
– When values of two input parameters are below 6
– When some data are calculations while others are
input
– Array Formulas
43
Quantitative Evaluation : EUSES Corpus
FINDINGS (CONTD.)
• Performance:
For 1711 Spreadsheet files in EUSES corpus, running
time
= 3 Hours, 49 Minutes and 14 Seconds
= 8.1 Sec/file On an Average
44
Quantitative Evaluation : EUSES Corpus
FINDINGS (CONTD.)
• Clone occurrence:
1711 Spreadsheet files in EUSES corpus contain
Formulas
Which means around 5% (86/1711) of all
spreadsheets contain verified clones
45
Quantitative Evaluation : EUSES Corpus
FINDINGS (CONTD.)
• Observations:
– Cannot yet conclude impact of cloning on quality
– Mostly, one spreadsheet is used for calculations,
while other is used for reporting
– Copies are used to sort
– Sometimes, format of clones did not match
46
Quantitative Evaluation : EUSES Corpus
QUALITATIVE EVALUATION
• Goal:
– To answer second and third research question:
2. What is the impact of data clones on spreadsheet
quality?
3. Does our approach to detect and visualize data
clones in spreadsheets support users in finding and
understanding data clones?
– Evaluate data clone detection and visualization
approach
47
Implementation and Evaluation
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
48
SETUP
• Two Case Study : South-Dutch FoodBank and Delft
University
• Analyzed real-life spreadsheets in both case studies
• Asked owner following questions –
–
–
–
–
Is this a real clone, in other words: did you copy this data?
Did this clone lead to errors or problems?
Could this clone be replaced by a formula link?
Asked questions about clones and the approach:
• Do you know why no direct links were used initially?
• How did the pop-ups help you in understanding the found data
clones?
• How did the dataflow diagrams help you in understanding the
found data clones?
49
Case Study
SOUTH-DUTCH FOODBANK
• Use spreadsheets to keep track of food
processed per month at distribution centers
• Average processing of 130 KGs of food/month
• Supplies to 27 local foodbanks
• Problem: Result did not balance and food
remained in center
• Possible Cause: Copy-pasting practice
• Received 31 sheets for analysis
50
Case Study
FINDINGS
• Settings:
31 spreadsheets
MinimalClusterSize = 9
MatchingPercentage = 80%
MinimalDifferentValues = 9
StepSize = 2
•
•
•
•
Performance: 3 Hrs, 9 Mins and 39 Secs = 6 Mins/file
145 Clones detected, 61 were near-miss
False-positives: 1
Near-miss: Some updated values were correct. 25 of 61 near-miss clones
were actual errors.
• One Exact Clone was actually an error (Data copied into wrong column)
• Result of Analysis: After fixing the clones the overall results were
balanced
51
Case Study: FoodBank
DELFT UNIVERSITY
• Budget spreadsheet had to be created for
grant proposal
• Spreadsheet calculates salary costs of
different employee
• Salaries are raised every year and creator of
the spreadsheet calculated it once and copied
in different places in the spreadsheets
52
Case Study
FINDINGS
• Settings:
15 worksheets
MinimalClusterSize = 9
MinimalDifferentValues = 9
MatchingPercentage = 80%
StepSize = 2
• Performance: 3 Secs
• 8 exact clones
• No errors but analysis is very useful for improving the
spreadsheet
53
Case Study: Delft University
AGENDA
Introduction
Motivation
Part A – Data Clone Detection
Part B – Clone Visualization
Implementation and Evaluation
Case Study
Conclusion
54
RESEARCH QUESTION REVISITED
1. How often do data clones occur in spreadsheets?
A. Both EUSES case and the case studies suggest that
clones occur often in spreadsheets
2. What is the impact of data clones on spreadsheet
quality?
A. Clones matching 100% mainly impact perspective,
while Near-miss causes trouble
3. Does our approach to detect and visualize data
clones in spreadsheets support users in finding and
understanding data clones?
A. Visualization aided users to quickly get an overview of
the, otherwise hidden, copy dependencies
55
Conclusion
CONCLUSION
• Data clones in spreadsheets are common
• Data clones in spreadsheets often indicate
problems and weaknesses in spreadsheets
• Algorithm is capable of detecting data clones
quickly with 80% precision
• Approach supports spreadsheet users in
finding errors and possibilities for improving a
spreadsheet
56
Conclusion
QUESTIONS?
Conclusion
57
Download