Software Visual Analytics Tools and Techniques for Better Software Lifecycle Management prof. dr. Alexandru (Alex) Telea Department of Mathematics and Computer Science University of Groningen, the Netherlands www.cs.rug.nl/svcg Introduction Who am I? • professor in computer science / visualization @ RuG (since 2007) • chair/steering committee of ACM SOFTVIS / IEEE VISSOFT 2007-2013 • 7 PhD students, over 35 MSc students • over 150 international publications in data / software visualization www.cs.rug.nl/~alext www.solidsourceit.com Topics of this lecture • program comprehension • software maintenance and evolution • software visual analytics Slides: www.cs.rug.nl/~alext/SVA Data Visualization: Principles and Practice A. K. Peters, 2008 www.cs.rug.nl/svcg What you will see • • lots of tools, use cases, applications all shown techniques were applied in the real IT industry source code text duplication code repositories code quality code dependencies design and metrics P2P networks program behavior stock exchange evolution metrics structure evolution team analysis www.cs.rug.nl/svcg History of Visual Data Analysis 1985 Scientific Visualization: - engineering - geosciences - medicine 1995 Information Visualization: - finance - telecom - business management 2000 Software Visualization: - the software industry! www.cs.rug.nl/svcg When is data visualization useful? 1. Too much data: • do not have time to analyze it all (or read the analysis results) • show an overview, discover which questions are relevant • refine search 2. Qualitative / complex questions: • cannot capture question compactly/exactly in a query • question/goal is inherently qualitative: understand what is going on 3. Communication: • transfer results to different (non technical) stakeholders • learn about a new domain or problem www.cs.rug.nl/svcg When is visualization NOT useful? 1. Queries: • if a question can be answered by a compact, precise query, why visualize? • “what is the largest value of a set” 2. Automatic decision-making: • if a decision can be automated, why use a human in the loop? • “how to optimize a numerical simulation” Key thing to remember: • visualization is mainly a cost vs benefits (or value vs waste) proposal • cost: effort to create and interpret the images • benefits: problem solved by interpreting the images • discussion in software engineering: lean development [Poppendieck 2006] ✗ B. Lorensen, On the Death of Visualization, Proc. NIH/NSF Fall Workshop on Visualization Research Challenges, 2004 ✗ S. Charters, N. Thomas, M. Munro, The end of the line for Software Visualisation? Proc. IEEE VISSOFT, 2003 ✗ S. Reiss, The paradox of software visualization, Proc. IEEE VISSOFT, 2005 ✓ J. J. van Wijk, The Value of Visualization, Proc. IEEE Visualization, 2005 M. Poppendieck, T. Poppendieck, Lean Software Development, Addison-Wesley, 2006 www.cs.rug.nl/svcg Software Visualization Definition: • “Software visualization is concerned with the static or animated 2D or 3D visual representation of information about software systems based on their structure, history, or behavior in order to help software engineering tasks” [Diehl, 2006] For whom: • developers: • testers: • architects: • managers: • business: Goals: understand large code bases quicker develop and refactor quicker create and manage large test suites quicker invest testing effort more efficiently understand very large software systems compare code with design documents reverse-engineer structure from code overview long-duration projects correlate quality with process decisions assess quality of process and product decisions support in/outsourcing activities communicate with technical stakeholders easily reduce cost/time, increase quality and productivity! www.cs.rug.nl/svcg Software Visualization – Really needed? Surveys: • software industry forecast: 457 billion $ (2013), 50% larger than in 2008 [www.infoedge.com] • comparison: total US health care spending 2.5 trillion $ (2009) [www.usatoday.com/news/health] • 80% of development costs spent on maintenance [Standish’84, Corbi’99] • 50% of this is spent for understanding the software! Practice: • 40% engineers find SoftVis indispensable, 42% find it not critical [Koschke ’02] • visual tools or tool plugins become increasingly accepted in software engineering • Visual Studio, Eclipse, Rational Rose, Together, JReal, DDD, … • See new Visual Studio 2010 dependency visualization! Research: • Software visualization is now an established research area • own events: ACM SoftVis, IEEE VISSOFT • related events: InfoVis, VAST, ICSE, I(W|C)PC, WCRE, OOPSLA, SIGSOFT, ECOOP, … A few billion of lines of code later: Using static analysis to find bugs in the real world (A. Bessey et al.), CACM, 2010 Software visualization in software maintenance, reverse engineering, and re-engineering: a research survey (R. Koschke), JSME, 2003 Measuring the ROI of software process improvement (R. van Solingen), IEEE Software, 2004 The paradox of software visualization (S. Reiss), IEEE VISSOFT, 2005 The end of line for software visualisation? (S. Charters, N. Thomas, M. Munro), IEEE VISSOFT, 2003 www.cs.rug.nl/svcg Software Visualization vs Visual Programming is examined by creates software visualization tool is analyzed by software visualization The software The user visual programming generates creates is read by visual programming tool The tools www.cs.rug.nl/svcg Software Visualization vs Visual Programming Visual Programming Software Visualization • usually for small programs • for small…huge programs • mostly used in forward engineering (FE) • most used in reverse engineering (RE) • usually at ‘component’ level • at line-of-code…system level • is still not very popular • starts becoming quite popular • complemented by ‘classical’ textual programming • complemented by ‘classical’ code text reading • can work at several levels of details • can work at several levels of detail • quite well integrated in the engineering pipeline • not yet tightly integrated in the engineering pipeline We’ll talk about software visualization, not visual programming www.cs.rug.nl/svcg Software Visualization – Methods Method Dataset Techniques table lenses 1. Multivariate visualization tables parallel coordinates hierarchical node-link layouts 2. Relational visualization trees / hierarchies treemaps general graphs force-directed layouts matrix plots compound digraphs bundled layouts general documents graph splatting source code dense pixel techniques diagrams areas of interest 3. Text visualization context and focus 4. Interaction techniques multiple views semantic zooming Will discuss all these with InfoVis and SoftVis examples… www.cs.rug.nl/svcg Software Visualization Pipeline Software analysis software engineering software data • design • development • testing • maintenance internal data data filtering • static analysis • dynamic analysis • debugging • evolution analysis insight, decisions human interpretation data acquisition • graph extraction • metric computation • dataflow analysis • control flow analysis enriched data Software visualization image graphics rendering • graph drawing • texture mapping • antialiasing • interaction visual representations data mapping • graph layouts • attribute mapping The two sub-pipelines need to be strongly connected! www.cs.rug.nl/svcg 1. Software structure visualization Goal: • get insight in how a program is structured, from code lines up to modules and packages • different levels of detail = different visualizations code lines functions • containment relations (usually) • can also be association relations (e.g. provides, requires, uses, calls, owns, …) classes files components packages level of complexity applications www.cs.rug.nl/svcg 1.1. Code structure visualization Source code • show code in structured way, helps understanding purpose • lexical highlighting: show code type • indentation: show structure lexical highlighting (Visual C++ 9.0) syntax highlighting (Xcode 2.0) • scalability: limited by font size • expressivity: typically limited by complexity of lexical constructs • implemented by most modern development tools www.cs.rug.nl/svcg Source code structure Syntax highlighting • more structure: • generalize lexical highlighting to full code syntax • adapt shaded cushions idea from treemaps to code blocks source code syntax tree + cushion texture cushion profile f(x) border size x G. Lommerse, F. Nossin, L. Voinea, A. Telea, The Visual Code Navigator: An Interactive Toolset for Source Code Investigation, IEEE InfoVis, 2005 www.cs.rug.nl/svcg Source code structure • classical code editor • indentation shows some structure Images from CSV tool (www.cs.rug.nl/svcg/SoftVis/VCN) www.cs.rug.nl/svcg Source code structure • blend in structure cushions… • color shows construct type (functions, loops, control statements, declarations, …) Images from CSV tool (www.cs.rug.nl/svcg/SoftVis/VCN) www.cs.rug.nl/svcg Source code structure • more structure cushions… Images from CSV tool (www.cs.rug.nl/svcg/SoftVis/VCN) www.cs.rug.nl/svcg Source code structure • cushions generalize syntax highighting ‘flat’ syntax highlighting Images from CSV tool (www.cs.rug.nl/svcg/SoftVis/VCN) ‘cushioned’ syntax highlighting www.cs.rug.nl/svcg Source code structure Brushing • show details on demand as function of the mouse pointer • widespread concept in InfoVis move the mouse somewhere… Images from CSV tool (www.cs.rug.nl/svcg/SoftVis/VCN) www.cs.rug.nl/svcg Source code structure Brushing – spotlight cursor • fade out cushions as function of distance to mouse pointer • easy to do: blend a radial transparency texture bring the text around in focus… Images from CSV tool (www.cs.rug.nl/svcg/SoftVis/VCN) www.cs.rug.nl/svcg Source code structure Brushing – structure cursor • highlight (desaturate) cushion under mouse • emphasizes structure …of focus on a whole syntactic block Images from CSV tool (www.cs.rug.nl/svcg/SoftVis/VCN) www.cs.rug.nl/svcg Source code structure Dense pixel techniques • how to show 10000 lines of code on one screen? • adapt table lens idea • zoom out text view (keep layout) • replace characters by single pixels The SeeSoft tool from AT&T [Eick et al. ‘92] zoomed-out code drawn as pixel lines color shows code age source code at usual font size • scales to ~50 KLOC on one screen • correlations between files become possible • syntactic structure not emphasized www.cs.rug.nl/svcg Source code structure Dense pixel techniques • good to show data attributes (code age, bugs, quality metrics, …) • does not show structure VTK class library • 1 C++ file • 10 headers • ~7000 lines VTK library: www.kitware.org/vtk www.cs.rug.nl/svcg Source code structure Add structure cushions • we start seeing things source headers comment class iteration if ….. VTK library: www.kitware.org/vtk www.cs.rug.nl/svcg Source code structure Enhancements • use two metrics per structure block • area color: code complexity [McCabe ‘76] • border color: number of casts • multiple correlated views • well-known InfoVis technique • table lens: find outliers (complex functions) • code view: get detailed insight Example • find most complex functions in two files • check if they do many typecasts A. Telea, L. Voinea, SolidFX: An Integrated Reverse-engineering Environment for C++, ACM SOFTVIS, 2008 Image created with the SolidFX tool (www.solidsourceit.com) www.cs.rug.nl/svcg 1.2. Module structure Compound digraphs • particular type of graphs, very frequent in SoftVis • two types of edges • containment – e.g. software hierarchy (files, modules, classes, functions, …) • association – e.g. software interactions (calls, inherits, uses, provides, requires, …) • how to visualize this? Compound digraph of a C Hello World program Node-link layouts • (very) bad choice! • mix containment and association edges • occlusions, clutter • layout may change significantly if you • add/remove a few nodes/edges • change the layout parameters slightly • how to teach one to read such a layout? folders & files functions www.cs.rug.nl/svcg Nested layouts Principle • show containment by box spatial nesting • show associations by node-link diagrams • SHriMP layout (Simple Hierarchical Multi-Perspective) [Storey and Muller ’95] • very popular, tens of tool implementations • limited scalability, clutter SoftVision tool [Telea et al. ’02] Creole tool (www.thechiselgroup.org/creole) www.cs.rug.nl/svcg Nested layouts Variations of the principle • extend the hierarchical DAG layout [Sugiyama et al. ’81] to incorporate nesting • lay out children containers using original method • compute container bounding box • lay out parent level using sizes from children • route edges using orthogonal paths (eliminate crossings) aiSee tool (www.absint.com/aisee) • one of the best compound digraph layouts in existence • clean and clear images, ‘engineering diagram’ look • can show 2..3 hierarchy levels simultaneously G. Sander, Graph Layout through the VCG tool, Proc. Graph Drawing, 1995 Practical intro: T. Würthinger, Visualization of Program Dependence Graphs, http://ssw.jku.at/Research/Papers/Wuerthinger07Master Hierarchically bundled edges Best existing technique to showing containment and associations • application: modularity / coupling assessment • hard to quantify in a metric… • …but extremely easy to see! Modular system • blue = caller, red = called • all functions in the yellow file call the purple class • green file has many self-calls Monolithic system • blue = virtual, green = static functions • red class has many virtual calls (possible interface class) Decoupled system • many intra-module calls • few inter-module calls • typical for library software www.cs.rug.nl/svcg Hierarchically bundled edges • if associations are correlated with structure, HEBs shows this clearly! • images are intuitive for most users • edge overlaps instead of edge crossings! • overlaps communicate information • easy to visually follow a bundle • built-in scalability (bundling = free aggregation) Structure and associations of C# system Enhancements • node and edge color coding • edge alpha blending • saturated = high density = many edges • transparent = low density = few edges • aggregation and navigation • collapse and expand nodes • replace multiple edges by single one • shaded hierarchy cushions • emphasize structure Image generated with SolidSX tool (www.solidsourceit.com) www.cs.rug.nl/svcg Multiple views Code view SolidSX tool (www.solidsourceit.com) Tree browser Treemap Table lens HEB view Images generated with SolidSX tool (www.solidsourceit.com) www.cs.rug.nl/svcg Enhancements Image-based edge bundles (IBEB) • produce a simplified view of system modularity original HEB new IBEB • A is connected to B… •…but how? • A1 is connected to B1… • A2 is connected to B2 A. Telea, O. Ersoy, Image-based edge bundles: Simplified visualization of large graphs, CGF 2010 www.cs.rug.nl/svcg 1.3. Attributes and structure Use the 3rd dimension: city metaphor • xy plane: structure (treemap technique) • z axis: metrics • color + xy size : extra metrics JHotDraw system (www.jhotdraw.org) visualized with CodeCity (www.usi.inf.ch/phd/wettel/codecity.html) R. Wettel, M. Lanza, Program comprehension through software habitability, IEEE ICPC, 2007 see also Codstruction tool (codstruction.wordpress.com) www.cs.rug.nl/svcg 1.4. Architecture Diagrams Diagrams • design or architectural information • can be also generated from source code (reverse engineering) UML (Unified Modeling Language) • most accepted (but not only) diagram notation in the engineering community • class • object • sequence • use case • statecharts • activity • package • component • deployment • variants of node-link layouts • special semantics (icons, positions, text annotations) • most often drawn by hand • UML 1.0 most known and used • UML 2.0 and beyond: more powerful but quite complex The UML Notation Standard (www.uml.org) www.cs.rug.nl/svcg UML Diagrams • add software metrics to support e.g. software quality analysis • show several metrics per element with icons • metric encoded as icon size, shape, color • compare metrics of the same element • compare metrics across elements The MetricView tool A: visualization B: UML browser C: metric layout D: metric data MetricView tool (www.cs.rug.nl/svcg/SoftVis/ArchiVis) www.cs.rug.nl/svcg UML Diagrams • smoothly navigate between structure and metrics • use transparency diagram: opaque metrics: transparent diagram: transparent metrics: opaque metrics structure MetricView tool (www.cs.rug.nl/svcg/SoftVis/ArchiVis) www.cs.rug.nl/svcg UML Diagrams • 2D vs 3D again • same idea as CodeCity but • use different xy layout • can also show associations • flip scene to smoothly navigate from 2D to 3D view 2D view (=3D seen from above) 3D view MetricView tool (www.cs.rug.nl/svcg/SoftVis/ArchiVis) www.cs.rug.nl/svcg Metric lens • embed a small table lens within each diagram element • scroll/sort columns in sync for all elements Graphical editor • areas = subsystems • LOC and complexity metrics • most classes have low metrics • 1 large, complex class (hard to maintain code!) • luckily, not in the system core Area metrics • show areas of interest: groups of components sharing some property • show metrics defined on areas of interest (complexity, testability, portability, size, …) • continuous rainbow colormap • continuous (smooth) metrics • discrete colormaps • discrete metrics www.cs.rug.nl/svcg 2. Software Behavior Analysis What is software behavior? • the collection of information that a running program generates • also called a program trace • data: the values generated by the program • internal values (e.g. local variables, stack, registers, …) • external values (e.g. graphical/console input/output, log files, …) • control: the states the program passes through • data collection • log files • instrumentation (debuggers, profilers, code coverage analyzers, …) Use cases • optimization: • testing: • debugging: • education: • research: determine performance bottlenecks (profiling) check program correctness vs specifications find incorrect program constructions leading to test failures learn algorithm behavior by seeing its execution test/refine new algorithms www.cs.rug.nl/svcg Behavior Analysis Challenges E. Dijkstra, Comm. ACM 11(3), 1968 Program trace visualization Callstack view • visualize call stack depth over time with additional metrics (e.g. cache hits/misses) • like Jinsight’s per-thread execution view • shaded cushions: ‘ruler’ for the y axis • antialiasing: same idea as for the timeline view callstack view cache L2 miss cache L1 miss time (samples) www.cs.rug.nl/svcg Program trace-and-structure visualization • program trace: icicle plot (as before) • program structure (treemap) • layout the two atop of each other • correlate them by means of mouse interaction time (samples) J. Trümper, A. Telea, ViewFusion: Correlating Structure and Activity Views for Execution Traces, TPCG 2012 Tool implementation: www.softwarediagnostics.com www.cs.rug.nl/svcg Visualizing software deployment Gammatella • deployment yields multiple copies of a given software system • collect runtime data from actual deployed copies • visually analyze it to • understand platform-specific problems • detect potential bottlenecks, bugs, optimization possibilities, … execution bar file view code view system view metric view Memory visualization • N concurrent processes p1, …, pN • pi allocates blocks aij = (memstart, talloc, memend , tdealloc), talloc< tdealloc, memstart < memend • blocks do not overlap in space but may overlap in time Visualization usage • understand how memory allocators perform in practice Memory visualization with the MemoView tool address space time color = process ID memory fill Application • embed previous visualization in a matrix, one cell per bin • color = waste • application: Symbian OS, Nokia Execution trace-and-structure visualization Execution traces T = { ci }, ci = ( fcaller , fcallee , t)i , f ProgramFunctions , t [0,tmax] • set of function-call events Visualization • hierarchical edge layout (program structure) • massive sequence view (calls over time) execution traces in the ExtraVis tool ExtraVis tool: www.win.tue.nl/~dholten/extravis Massive sequence view • combines structure with calls Top • system structure (icicle plot) Bottom • calls over time • 1 call c = ( fcaller , fcallee , t) 1 line (xstart , y , xend , y) •y=t • xstart = position of fcaller in structure view (red) • xend = position of fcallee in structure view (green) • similar / regular call patterns visible • zoomable view • importance-based sampling, like in MemoView Application JHotDraw graphics editor • create new drawing • insert 5 figures • repeat above 3 times • use sequence view to find execution of the above operations • verify results on source code with the HEB view start-up new drawing repeat 1 insert 1 repeat 2 insert 2 insert 3 repeat 3 cleanup insert 4 insert 5 3. Software Maintenance and Evolution delivery • five main phases size? cost? size? cost? time “Software development does not stop when a system is delivered but continues throughout the lifetime of the system” [Sommerville, 2004] Software evolution • the set of maintenance activities done after the 1st software release until its lifetime end • duration and cost of maintenance are far larger than the first four phases • 80% of the lifecycle costs, 50% of which are understanding • hence, an excellent terrain for SoftVis! www.cs.rug.nl/svcg Costs of Maintenance • defect removal cost as function of lifecycle phase [Balaram ‘04] • relative cost of correcting an error [Boehm et al ‘00] “This planet has invested so far $2300 billion in the maintenance of COBOL programs” [Foster ’91] Relative maintenance cost • 60% [Hanna 1993] • 70% [Lientz & Swanson 1980] • 80% [Brown 1990, Coleman 1994, Pfleeger 2001] • 90% [Sommervile 2004] • apparently there is a steady rise… S. Balaram (senior VP HP R&D, Bangalore) Building software with quality and speed, www.symphonysv.com B. Boehm, C. Abts, S. Chulani, Software development cost estimation approaches – A survey, Ann. Soft. Eng., 2000 Maintenance Pipeline end users developers bugs, improvements, features, … estimate cost vs benefit change management tracking system estimate needed effort change request find affected code change management impact analysis release planning analysis / decision-making (focus of evolution SoftVis) describe changes revise current system write change plan design changes code changes standard development pipeline new bug system is used in real life… new requirement environment and requirements evolve… test changes system release Software Evolution Lehman’s 8 Laws: • The Law of Continuing Change (1974) • The Law of Increasing Complexity (1974) • The Law of Self Regulation (1974) • The Law of Conservation of Organizational Stability (1980) • The Law of Conservation of Familiarity (1980) • The Law of Continuing Growth (1980) • The Law of Declining Quality (1996) • The Feedback System Law (1996) Derived from feedback system theory Validated empirically in practice over many years M. Lehman, J. Ramil, P. Wernick, D. Perry, W. Turski, Metrics and Laws of Software Evolution—The Nineties View,” Proc. 4th Intl. IEEE Software Metrics Symposium (METRICS '97), 1997, available online at: http://www.ece.utexas.edu/~perry/work/papers/feast1.pdf www.cs.rug.nl/svcg Laws of Software Evolution – Layman’s vs Lehman’s view Essentially the 8 laws say this: • Software must and will change whether you like it or not • Maintenance keeps being pumped in unless you want a disaster • Stuff that matters (size, quality, complexity, …) changes slowly on the average • Understanding effort keeps being pumped in unless you want a disaster • Things will get worse unless you keep pumping in effort to fix them • It’s not simple – we actually don’t know what happens out there! Hence • SoftVis and evolution analysis do have a case! My own view • SoftVis has its best chance to be meaningful in software evolution and maintenance – highest benefits and return-on-investment – by far largest datasets – we don’t really know how to make design and development more efficient www.cs.rug.nl/svcg So where are we? Are you bored? • good. This is the real world out there in software engineering • we must intimately understand this world if we want to use SoftVis to improve it • if not, we’ll just generate pretty & useless pictures So where does SoftVis go? • impact analysis • quality analysis • decision making support So where are the pretty pictures? • they come up next • but remember: they must be focused to serve a real purpose • lean engineering principle applies: – diminish waste – increase value www.cs.rug.nl/svcg Software repositories • also called source control management (SCM) systems • central tool for storing change • – revisions – files, folders – delta storage to minimize required space (using e.g. cmp or diff functions) – typically semantics-agnostic: just store changes in files – collaborative work (shared check-in, check-out, permissions) – more advanced system also store change requests (e.g. CM/Synergy) – work with command-line, web, or IDE interfaces if not, we’ll just generate pretty & useless pictures Examples • CVS – • Subversion (SVN) – • automated build; change request support; however, more complex than CVS/Subversion ClearCase – • probably the best (known) open-source SCM; atomic commits; efficient binary file support; … CM/Synergy – • pretty old and outdated; revisions are per-file (physical, not logical); no atomic commits; … Build auditing; dependency management; high scalability Jit, Mercurial, Visual SourceSafe, … 3.1. Evolution at line level • take one file in a repository • visualize its changes across two versions line groups version detail • unit of analysis: line blocks (as detected by diff) • shows insertions, deletions, constant blocks, drift • cannot handle more than 2 versions Evolution at line level • visualization tool: CVSscan • correlated views, details on demand L. Voinea, A. Telea, J. J. van Wijk, CVSscan: Visualization of source code evolution, ACM SOFTVIS, 2005 code metric bar: code size change per separate line time metric bar: code size change in time detailed code view around mouse cursor move mouse in code view: show code that will be inserted / was removed from current location Evolution at line level – several files file lines • extend idea to a few (1..4) files • stack several line-level views, one per file • line color = construct type • helps doing cross-file correlations • large size jumps = code refactoring • less wavy patterns = stable code • horizontal patterns = unchanged code comments time (version) function bodies strings function headers Multiscale visualization • we often want to see files with similar evolution • define a file-level evolution similarity metric • good choice: similarity of change moments • cluster all files using this metric • bottom-up agglomerative clustering • visualize cluster tree and user-selected ‘cut’ in the tree C1 C2 C4 C5 (level of detail = size) C1 C2 C3 visualization of selected cut: one cushion per cluster C4 C5 C3 cluster tree: icicle plot; color = cluster cohesion 3.2. Evolution at file level Trend Analyzer (SolidTA) www.solidsourceit.com • 2D dense pixel layout • x = time, y = files; one file = one horizontal band; one version = one band segment time (version) project activity number of commits files sorted by order in folders color = version author ID Evolution at file level • which are the most changed files, and who worked in those? • sort files on activity, color on author ID most active files files sorted by activity More on SolidTA: www.solidsourceit.com Evolution at file level • what type of code came when in the project’s lifetime? • sort files on age, color on file type oldest files activity ~ age… files sorted by age .tcl, .py More on SolidTA: www.solidsourceit.com .h .cxx .in .pdf Evolution at file level • which are the bug reports / fixes and how do they correlate? • color on presence of keywords “bug” and “fix” in commit logs files sorted by age no hits More on SolidTA: www.solidsourceit.com “bug” “fix” “bug” and “fix” Multiscale visualization What is the risk of releasing the software now? • show bug-reports density using texture splatting on the evolution view • group files by change similarity • color by with directory name to correlate file similarity with file location • • • largest debugging activity localized to a single folder changes in that folder do not propagate outside (change-similar files are in 1 cluster) hence, system is well-structured for localized maintenance! L. Voinea, A. Telea, How do changes in buggy Mozilla files propagate? ACM SOFTVIS 2006 www.cs.rug.nl/svcg 3.3. Project level visualization • upper-level decision makers do not have time to look at the evolution at each file • visualize aggregated evolution metrics at project level Team risk analysis • • software projects are done by developer teams over years find if team composition is risky for the project’s maintenance • • • extract project evolution from software repositories compute impact of each developer over each file / function / …. visualize impact evolution in an aggregated manner Project level visualization • aggregate impact (#files modified by each developer) over time • visualize resulting time series using the ‘theme river’ metahphor [Havre et al ‘02] Project A (open-source) • software grows in time • impact is balanced over most developers Project B (commercial) • software grows in time at about the same rate • but one developer owns most of the code • what if this person leaves the team?! More details on SolidTA: www.soursourceit.com 4. Software Visual Analytics “The science of analytical reasoning facilitated by interactive visual interfaces” [Wong and Thomas ’04, Tomas and Cook, ‘05] The Sensemaking Loop • going from ‘raw’ data to meaning (semantics) • data hypothesis (in)validation conclusions • in simple terms: combine analysis and visualization P. Wong, J. Thomas, Visual analytics, IEEE Comp. Graphics & Applications, 24(5), 2004 J. Thomas, K. Cook, Illuminating the Path: The R&D Agenda for Visual Analytics, NVAC, 2005 www.cs.rug.nl/svcg Visual analytics vs Software visualization Related but not identical SoftVis Visual Analytics Goal present data solve problems Methods visualization visualization and analysis Challenges visual clarity, scalability, interaction effectiveness, … assist the user on the whole path from raw data to finding a solution Techniques mapping, layout, interaction data mining + all SoftVis techniques History relatively new (since ~2002) well-established (since ~1995) Integration Data mining SoftVis Software Visual Analytics www.cs.rug.nl/svcg Software Visual Analytics Challenges So Software Visual Analytics = SoftVis + data mining; What’s the big deal? Data size • exabytes of more (visualize entire ‘data warehouses’) • SoftVis techniques do not scale to this size Data heterogeneity • “any type” of data how to capture this in a uniform data model? • missing, incomplete, incorrect, conflicting data how to analyze / visualize this? Multidisciplinary • databases, data mining, SoftVis: traditionally separate communities • understanding it all: too much for any single community Infrastructure • need integral / longitudinal solutions, not small, isolated applications / prototypes • development effort becomes a major bottleneck (how to do this as a researcher?) • how to evaluate integral solutions within a limited budget? www.cs.rug.nl/svcg Software Visual Analytics Visual Analytics applied to Software Engineering graphs, dense pixel displays, charts, … reasons for process/product quality degradation, bugs, low performance, low productivity, what-if questions, … problem model (capture essentials from refined data stream) high-level analysis (static analysis, clone detection, quality metrics) low-level analysis (parsing, instrumentation, repository data mining) source code, design documents, program execution, repositories Example 1: Build Optimization 1. Context • major embedded software company (NASDAQ 100) • industrial 17.5 MLOC code base of C code • modified daily by >500 developers worldwide 2. Problem • high build time (>9 hours) • modifying a header causes very long recompilations • testing becomes very hard; perfective maintenance (refactoring) nearly impossible 3. Questions • why is the build time so long? • what impact has a code change on the build time? • how is a change impact spread over the entire code base? • how to refactor the code to improve modularity and build time? A. Telea, L. Voinea, Visual Software Analytics for the Build Optimization of Large-Scale Software Systems, Comp. Statistics, 2010, to appear Build Optimization Three analyses – three tools in a unified toolset TableVision tool Build process analysis • why is the build slow? INavigator tool CM/Synergy repository Extracted data Build cost model Dependency analysis • how does a code change affect build time? IRefactor tool Refactoring analysis • how to rewrite code to improve build time? www.cs.rug.nl/svcg Gathering Raw Data • measure build time using UNIX tools time(x) build time CPU time large small • build time = CPU + I/O + network + paging + other processes large small large small * small * CPU time build time I/O time + network preventive actions build is I/O bound! corrective actions * assume no other CPU-intensive processes besides compilation www.cs.rug.nl/svcg Sensemaking: First Steps • simple histogram of build time time (sec) 0 translation units Build time depends significantly on the translation unit! A useful build cost model must consider the per-unit build cost and not only the number of translation units * assume no other CPU-intensive processes besides compilation www.cs.rug.nl/svcg Build Cost Model – First Attempt Build cost • sources: • binaries: • headers: “how much it costs to build a file” number of lines of code in the source + (in)directly included headers negligible (linking is cheap) zero (headers don’t get compiled) Build impact: • sources: • headers: “how much it costs to rebuild the system when a file is modified” build cost of the source itself number of sources using that header Example BC = 0 BC = 3 BI = 3 lib BC = 0 c BC = 2 BI = 2 lib * both application and system headers are considered c BI = 1 BI = 2 h h BI = 2 h www.cs.rug.nl/svcg Build Cost Model – Validation low-impact headers sorted on model’s impact 12th highest-impact header (reality) classified as 21st (model) high-impact headers headers • • model’s impact time build measurements model is close to reality but not perfect deviations are important! More details on SolidBA: www.soursourceit.com www.cs.rug.nl/svcg Build Cost Model – Refinement Build cost • sources: • binaries: • headers: “how much it costs to build a file” number of (in)directly included headers negligible (linking is cheap) zero (headers don’t get compiled) Build impact: • sources: • headers: “how much it costs to rebuild the system when a file is modified” the build cost of the source itself sum of build costs of all sources including header (in)directly Example BC = 0 lib BC = 0 lib More details on SolidBA: www.soursourceit.com BC = 3 BI = 3 c BC = 2 BI = 2 c BI = 3 BI = 5 h h BI = 5 h www.cs.rug.nl/svcg Build Cost Model 2 – Validation low-impact headers sorted on first model’s impact refined model classifies outlier correctly high-impact headers headers • refined model’s impact first model’s impact time build measurements refined model delivers same header-order (in terms of impact) as actual measurements More details on SolidBA: www.soursourceit.com www.cs.rug.nl/svcg Build Cost Model 2 – Validation Let’s look at the whole picture actual time measurements first model refined model header files • • refined model nicely matches reality, including subtle ‘outliers’ why is this so? (see next slide) More details on SolidBA: www.soursourceit.com www.cs.rug.nl/svcg Build Cost Model 2 – Validation Analyze deeper: • compilation cost dominated by I/O (preprocessing headers) • I/O cost dominated by file opening/closing on this platform • hence the justification of impact = # totally opened headers Conclusions To reduce build time, we should: • either massively accelerate network • reduce per-header build impact • reduce impact of change on build time More details on SolidBA: www.soursourceit.com highly costly / complex header impact analysis header refactoring www.cs.rug.nl/svcg System-wide impact analysis 1. 2. 3. 4,5. 6. Find subsystems are expensive to build For a subsystem, find headers have high build impact Zoom in to highest impact headers For a high-impact header, see how its impact spreads over sources For a header, see its cost breakdown over its include-set 1 3 2 4 5 More details on SolidBA: www.soursourceit.com 6 www.cs.rug.nl/svcg Subsystem-level impact analysis Method • color system tree by cost (blue=low, red=high) • select desired subsystem • right panel shows build impact for each header / source in that subsystem Findings ! • most headers have a low build impact • however, a few have a very high impact • touching those incurs a high build cost! because they are used in many sources because they include many headers More details on SolidBA: www.soursourceit.com www.cs.rug.nl/svcg Refactoring analysis • OK, we have a high-impact header h: how easy it to reduce that impact? • visualize the build cost distribution of h over the sources which use it Case 1: easy refactoring • build cost spread unevenly over the targets including selected header h • to decrease cost due to h, we only need to change a few targets build impact of h is located mainly in one single place! Refactoring analysis Case 1: difficult refactoring • build cost spread evenly over the targets including selected header h • to decrease cost due to h, we need to change almost all targets selected high-impact header build impact of h More details on SolidBA: www.soursourceit.com www.cs.rug.nl/svcg Refactoring analysis - Refinement • not all headers change equally often (e.g. system headers) • new metrics: • build impact * change frequency • impact distribution: impact (%) of a header contained in the 10% most expensive of its targets • easy & quick to use flat, low distribution (~15%): impact is spread uniformly over all targets. Hence, we cannot improve by refactoring a few targets sorted by impact*change skewed distribution (50%): half of impact is concentrated in 10% most expensive targets. Hence, refactoring these is an interesting option Refactoring support • OK, we found a high-impact header; how to decide a refactoring plan? • show dependencies header clients using hierarchical DAG layout Example 1: MIXTmet.h, used by 38 sources, high impact MIXTmet.h • build impact due to direct header inclusion • hard to decrease via refactoring Example 2: WS_support.h, used by 48 sources, high impact WS_support.h * • build impact channeled via one intermediate header: WS_sim1_support.h (*) • simpler refactoring may be possible Refactoring support (2) • say we want to include a header: is this potentially expensive? • show header’s own include graph colored by build impact Example: TDMD_types.h, used by 30 sources • not a high-impact header itself • but it includes high-impact headers! • hence using this header introduces potentially expensive changes cdefs.h pyconfig.h TDMD_types.h More details on SolidBA: www.soursourceit.com Visual Tool: INavigator Refactoring support (3) How much costlier becomes the system build if we add an #include? • select a “source” header – the one in which we want to #include • select a “destination” header – the one to be #included • show the build cost increase Example: What if we #include DNCHUI_chset.h in TDMD_types.h? source target build impact increases from 9633 to 9783, i.e. 1.5% Refactoring support (4) • previous methods OK for manual header-by-header refactoring only How to refactor a large system? • system S = {fi}i, S = Headers U Sources • header hi Headers = {sj}j , sj Symbols (function declarations, variables, types, macros, …) • include relations inc : S P(Headers), inc( f ) = {hi} f includes hi • symbol use relations use : Symbols P(Headers), use(s) = {hi} s is used by hi • in typical systems, not all symbols sj h in a header are used together Automatic refactoring idea • find high-impact header h (see last slides) • split h into h1 , h2 ; h1U h2 = h by putting symbols used together in same hi • recursively split h1 , h2 • replace inc(h) by inc(h1) and/or inc(h2) More details on SolidBA: www.soursourceit.com www.cs.rug.nl/svcg Refactoring support (4) • intuitively: put symbols used together (by many sources) in same header • include newly created headers instead of original ‘monolithic’ one • why this is good • decrease build costs (by decreasing the included code) • decrease build impact (by decreasing the number of included headers) The IRefactor analysis tool • suggests refactoring possibilities and shows gained build impact Refactoring visualization header 0 ≥5 Refactoring cost (how many files must include both headers after refactoring) symbols 2 colors Build impact (build impact of the header) min max Best refactoring candidates: • low refactoring cost • high build impact parents • low build impact children Refactoring cost Build impact Refactoring visualization suggested decomposition levels header under analysis Color: refactoring cost (how many additional headers) Color: refactoring benefit (% reduction in build impact ) decomposition details: (how to split symbols in smaller headers and how to #include these headers) More details on SolidBA: www.soursourceit.com Refactoring visualization Example of bad candidate for header refactoring refactoring cost As we gain benefits, we also increase costs refactoring gain Example 2: Post-Mortem Assessment Situation • client: established embedded software producer • product: 8 years evolution (2002-2008) – 3.5 MLOC of code in C166 dialect (1881 files) – 1 MLOC of headers (2454 files) – 15 releases – 3 teams = ~60 people (2 x EU, 1 x India) • product failed to meet requests, at end... Questions • what happened right/wrong? • how to prevent such errors in the future? A. Telea, L. Voinea, Case Study: Visual Analytics in Software Product Assessments, IEEE Vissoft, 2009 www.cs.rug.nl/svcg Our context Constraints • data: source code repository only • time: answers needed in max. 1 week • we were unfamiliar with the application Questions • how to get the most insight & best address questions with these constraints? • remainder of this talk: description of our ‘visual analytics’ approach www.cs.rug.nl/svcg Methodology Raw data Data enrichment Enriched data Visualization static analysis • call graphs • dependency graphs • static metrics • code duplication charts treemaps • modification requests (MRs) • authors • changes / type document graphs timelines repository evolution analysis refine stakeholders ask interpret • perfect instance of a visual analytics process: • multiple data types, multiple tools • tight combination of data extraction, processing, visualization • incremental hypothesis refinement present observations & questions final conclusions Requirements: MR Duration files time MR related check-in R1.3 - start little increase in the file curve – most activity in old files suggests too long maintenance & closure of requirements Requirements: MR Duration time graph: # commits referring to MRs within a given id range in mid 2008, activity related to MRs from 2006 still takes place MR ids (1 bar=100 MRs) Team: Code Ownership package module file #developers 1 >8 1 >90 #modification requests (MRs) >30 1 MR closure (days) team A team B large part of software affected by long open-standing MRs Most of these are assigned to team A (largest team)… …and this team was reported to have communication problems! team C Code: Dependencies package module file uses = call, type, variable, macro, … is used iface Most dependencies occur via the iface, basicfunctions and platform packages Filter out these allowed dependencies… …to discover unwanted dependencies These are accesses that bypass established interfaces Code: Call graph High coupling at package level This image does not tell us very much Select only modules which are mutually call dependent… …to discover layering violations Not a strict layering in the system (as it should be) Code: Quality Metrics Moderate code + dependency growth • does not explain product’s problems Average complexity/function > 20 Total complexity: up 20% in R1.3 • testing can be hard! • possible cause of product’s problems Code: Duplication External duplication • links: modules that contain similar code blocks of >25 LOC Internal duplication • color: # duplicated blocks within a file Little external/internal duplication Arguably not a problem for testing 1 # duplicated blocks 60 Documentation delay • 30% of files are documentation • updated regularly • grow in sync with rest of code base 854 doc + html 1688 other files time Docs (sorted on activity) time • 40% of docs frequently updated • rest seem to be ‘stale’ Code is arguably well documented… …so refactoring is likely to be doable Start from up-to-date docs Example 3: Database visual analysis Situation • client: top-3 Swiss bank • product: 8 years evolution (2004-2012) – Oracle/SQL/MS Access database solution – ~5000 tables, 60000 fields, – mix of TS-SQL, Visual Basic, MS Access macros – code needed 24-hour uptime • product was unmaintainable, at end... Questions • how can we understand the business logic? • how can we refactor the database design for better maintenance? www.cs.rug.nl/svcg Stakeholders Technical personnel New report implementations - how to efficiently communicate changes to business layer? 3 2 Business experts New report implementations - how to efficiently validate new reports? 4 1 Final stakeholders Source code (Access, SQL, Toad, Oracle, VB, …) Business logic (BL) specifications - how to efficiently translate BL Into technical details? Business logic (rules, conditions, limits, …) Reporting requests - business-level specifications - tight deadlines Business reports (the final facts & figures) www.cs.rug.nl/svcg Visual Analytics Solution Access Analyzer (SolidAA) •visual end-to-end data flow analysis across entire reporting platform •targeted question: “Where does this (report) data come from?” •targeted users: business & technical •fully handles any MS Access / SQL database •technical details: full parsing of MS Access, SQL, symbolic Visual Basic interpretation (!) root root Benefits • • • • reverse engineering cost: days learning cost: days development cost: few months client estimated savings: ~500 KEUR More details on SolidAA: www.soursourceit.com www.cs.rug.nl/svcg Conclusions Software Visual Analysis • Effective and efficient for answering concrete problems in the IT industry • program comprehension • reverse engineering • software maintenance and evolution • software quality assessment • Many techniques • program analysis • data visualization • lots of system integration and tool building effort • Challenging but worthwhile • break the old patterns • learn new techniques • big cost savings: years / hundreds of EUR Thank you for your interest! Alex Telea a.c.telea@rug.nl www.cs.rug.nl/svcg Tool References Type of data / problem Tool Source code / bugs Source code / C++ syntax, queries Source code / testing, bugs Package & library interfaces Code evolution / file level Code evolution / syntax level Generic graphs & software architectures (1) Software architectures (2) Generic graphs and trees Compound digraphs UML diagrams UML diagrams and metrics Dynamic memory allocation logs Program traces and structure Software behavior (execution) Code clones Software structure and metrics in 3D SeeSoft CSV (see VCN) Tarantula, Gammatella DreamCode SolidSTA CodeFlows SoftVision (see VCN), ArgoUML Rigi, aiSee/aiCall/VCG, Creole GraphViz, Tulip SolidSX SugiBib MetricView MemoView ExtraVis Jinsight, Jumpstart SolidSDD CodeCity Tool References Tool CSV (see VCN) Tarantula DreamCode, CodeFlows SolidSTA, SolidSX SoftVision (see VCN) Rigi Creole GraphViz Tulip ArgoUML SugiBib ExtraVis MetricView, MemoView aiSee/aiCall/VCG Code clones Treemaps URL www.cs.rug.nl/svcg/SoftVis www.cc.gatech.edu/aristotle/Tools/tarantula www.cs.rug.nl/svcg/SoftVis www.solidsourceit.com www.cs.rug.nl/SolidSX www.rigi.csc.uvic.ca www.thechiselgroup.org www.graphviz.org www.labri.fr/tulip www.argouml.org www.sugibib.de www.win.tue.nl/~dholten/extravis www.cs.rug.nl/SoftVis rw4.cs.uni-sb.de/~sander www.solidsourceit.com www.cs.umd.edu/hcil/treemap-history