Modeling the HTML DOM and Browser API in Static Analysis of JavaScript Web Applications ESEC/FSE 2011 Anders Møller, Magnus Madsen and Simon Holm Jensen 1 / 28 Motivation • How can we help developers writing JavaScript web applications? – by providing tools for findings bugs early in the development cycle • In this work we focus on finding bugs in the way JavaScript programs interact with the web browser 2 / 28 JavaScript in a browser user interaction rendering web browser events DOM manipulation The Document Object Model JavaScript code 3 / 28 Example The el.button property is always absent (it is undefined) An HTMLImageElement object does not have a button property Unreachable The programmer has confused el and ev 4 / 28 TAJS: Type Analysis for JavaScript [S.H. Jensen, A. Møller and P. Thiemann SAS '09] A tool for static analysis of plain JavaScript – the starting point for our work – flow-sensitive dataflow analysis – interprocedural – whole-program analysis – intended for non-minified, non-obfuscated code 5 / 28 Bug Finding We look for general errors such as: – dead or unreachable code – invocations of built-in functions with an incorrect number of arguments or wrong argument types – undefined dereference – reading absent properties – etc. 6 / 28 Contributions We extend the static analysis of TAJS to reason about JavaScript that execute in a browser: – how to model the browser API? – how to model the HTML page? – how to model the event system? • 100s of non-standardized objects and functions • complex prototype hierarchy of the W3C DOM • many kinds of events • dynamic registration of event handlers 7 / 28 Architecture Browser API Flow graph extension DOM model TAJS • JavaScript code • Event handler code <script>...</script> • Named tags potential errors <div onclick="..."/> <form id="foo">...</div> 8 / 28 The Browser API • The global window object – – – – history, location, navigator, screen alert(...), print(...), encodeURI(...) setTimeout(...), setInterval(...) addEventHandler(...) • Non-standard and legacy functionality 9 / 28 The HTML DOM • The Document Object Model (W3C) – tree like structure – e.g. one JavaScript object for each HTML tag • HTMLInputElement, HTMLFontElement, etc. – arranged in a large prototype hierarchy • Huge amount of properties and functions – most properties are string or integer constants 10 / 28 The HTML DOM • Important functions – – – – createElement(...) getElementById(...) getElementByName(...) getElementByTagName(...) • The analysis tracks elements by: <img id="foo" name="bar"/> Tag ID Name 11 / 28 Prototype Hierarchy The complete model has ~250 objects and ~500 properties 12 / 28 Choice of Abstraction Model the DOM objects as: single abstract object single abstract object for every element kind abstract object for every element in the initial HTML page Our Choice <img> <img> <div> <img> <img> <div> <img> <img> <div> 13 / 28 Straightforward Hierarchy? • The image tag looks pretty innocent: <img src="a.png" alt=""/> • Image objects can be created in several ways: new Image(); document.createElement("img"); 14 / 28 Example 15 / 28 Image Prototype Hierarchy Object (prototype obj) HTMLImageElement (prototype obj) Image (prototype obj) HTMLImageElement (instance obj) Image (instance obj) new Image(); document.createElement("img"); HTMLImageElement (constructor obj) Attached to window Attached to window Image (constructor obj) Blue arrows are internal prototype links Red arrows are external prototype links 16 / 28 Registration of Event Handlers • Directly in the HTML source – <div onclick="..."> • Using the Browser API – setTimeout(...), setInterval(...) – addEventListener(...) • Writes to "magic properties" – x.onclick = ..., Special properties that have sideeffects on the DOM when written to 17 / 28 Tracking Event Handlers Separate event handlers based on their kind – page load (onload) – keyboard (onkeypress, ...) – mouse (onclick, onmouseover, ...) – timed (setTimeout, setInterval, ...) – etc. 18 / 28 Flow graph Extension Event handlers are executed by introducing an eventhandler-loop – separates page load event handlers from other event handlers – executes event handlers in two non-deterministic loops 19 / 28 Evaluation • With these extensions TAJS can reason about JavaScript applications that run in a browser • Is the analysis precise enough to be useful? 20 / 28 Benchmarks Evaluated on a series of benchmarks: – Chrome Experiments – Internet Explorer 9 Test Drive – 10K Challenge – A List Apart – (excluding benchmarks using eval, jquery or not relevant for JavaScript) 21 / 28 Research Questions Q1: Ability to show absence of errors? The analysis is able to show that • 85-100% of call sites are safe • 80-100% of property reads are safe 22 / 28 Research Questions Q2: Ability to locate sources of errors? – We randomly introduce spelling errors – The analysis is able to pinpoint most of them (details in the paper) 23 / 28 Research Questions Q3: Precision of computed call graph? The analysis is able to show that 90-100% of call sites are monomorphic 24 / 28 Research Questions Q4: Precision of inferred types? – boolean, number, string, object and undefined – the analysis is able to show that the average type size is 1.0-1.3 • e.g. if the average type size is 1.0 then every read in the program results in values of a single type 25 / 28 Research Questions Q5: Ability to detect dead or unreachable code? – found several unreachable functions – most appear to be unused library code copy & pasted directly into the benchmark programs 26 / 28 Future / Current Work • Dynamically generated code – eval • Library support – jQuery, MooTools, etc. 27 / 28 Conclusion Extended previous work to reason precisely about JavaScript programs that execute in a browser-based environment allows us to discover general errors such as: • • • • reading absent properties dereferencing null or undefined invoking functions with incorrect arguments etc. 28 / 28 29 / 28 DOM Modules & Levels Module \ Level Level 0 Level 1 Level 2 Level 3 Core Module - () HTML Module - () Event Module - - () CSS Module - - () () Browser API - - - ~1996 1998 2000 2004 Year In addition we support the HTMLCanvasElement from HTML5. 30 / 28 Soundness Issues? Assignment to computed property names foo[bar] = "baz" foo[bar] = function() {...} If the exact value of bar is unknown: – it could be a write to a "magic property" – or a registration of an event handler 31 / 28