Rethinking Browser Performance Leo Meyerovich The browser wars are back and performance might determine the winner this time around. Browser performance has dictated the engineering of popular websites like Facebook and Google so the web community is facing a crisis in balancing a highlevel and accessible eco-system on one side and squeezing out more performance on the other. We’ve been reexamining core assumptions in how we architect browsers and have found the design space to be wide open, taking us from proxies in the cloud to multicore computers in our pockets. Empirically, web application speed impacts the bottom line for software developers. Near its inception, tracking of Google Maps showed site speedups were correlated to usage spikes. Similarly, faster return times of search queries increased usage. [[marissa]] Photo sharing sites have witnessed similar performance-driven phenomena, even if a typical user wouldn’t attribute their usage to it. Google’s experience has even caused them to include page load time as part of their AdWords ranking algorithm! When the user experience is the money-maker, lowering performance barriers matters. We already spend an inordinate amount of time on the web every day [[statistic]] -is there really a performance problem browser developers should dwell upon? First, as seen with Google’s actions, performance has a huge impact on user experience. Second, and more compelling, we have hit a wall with mobile devices. To be precise, we hit the power wall. Moore’s Law holds, so we’re still shrinking our transistors, but we can’t just keep clocking up or using those transistors for complex hardware optimizations: because of energy (battery life) and power (heat) constraints, hardware architects switched focus to simpler but parallel architectures. An iPhone’s application core is at least 13x slower than a MacBook Pro’s two cores in terms of clock cycles, and it gets worse once the mapping from cycles to high-level operations is considered. Something has to change. Browser developers are facing a fascinating design space in trying to get the most out of our available cycles. As consumers of browsers, we benefit from their efforts, and as developers, we might find lessons for our own programs. In the following sections, after giving an idea of browser performance bottlenecks, we examine three fundamental axis for optimizing browsers. Understanding this design space motivates our research in parallelizing browsers from the bottom up. Performance Needs for Today and Tomorrow We first dispel the myth that the browser is network-bound. The predictability of user requests[[cite]] and the rise of browser-side storage interfaces allows the network to be taken off of the critical within many application. For applications where this is not true, mobile broadband eliminates many of the remaining network concerns. Even though we should not consider browsers on laptops and handhelds to be network-bound in theory, poor infrastructure design by both web developers and browser designers may make it so in practice. The IE8 team benchmarked loading the top 25 websites on various browsers, and, including network time, they all took 3.5-4s to fully load them[[cite]]. In another study, they found that the sites spent, on average, 850ms of CPU time. Once we consider prefetching, caching, and networking advances, the CPU time looks even more conspicuous. Where is time being spent? Despite the recent emphasis on faster JavaScript runtimes, these sites spent only 3% of their CPU time inside of the JavaScript runtimes. A JavaScript heavy website like a webmail client will bump up the usage percentage to 14%; most of it is in laying out the webpage and painting it, and, to a lesser extent, more typical compiler frontend tasks like lexing and parsing. The story is worse on handhelds and netbooks because even new CPUs like the Atom are still a magnitude slower on most benchmarks than their laptop counterpart. From font handling to matching CSS selectors, a lot needs to be faster. We must also target future workloads. We predict increased usage of clientside storage, requiring a renewal in interest for personal storage and databases. Scripting is playing an increasing role, both in how it interacts with browser libraries and standalone computations. Finally, we note a push towards 3D and graphics support: as augmented reality applications mature, like Google Maps, Virtual Earth, and Photosynth, we expect the demand to grow even further, especially in the handheld space. Graphic accelerators have large power, energy, and performance benefits over thicker multicore solutions, so we even expect to seem them in mobile devices, partially addressing how we expect to see at least one of these domains solved. The Three Axis of Performance We don’t see one silver bullet for browser performance, but by systematically breaking down how performance can be improved, we can start comparing the feasibility of different approaches. If browsers are CPU-bound, we should reanalyze how our CPU cycles are being used. This leads to three natural axis for optimization: Axis 1: Use Less Cycles If we consider a traditional desktop application struggling for CPU cycles, we go down the traditional road of optimizing for those cycles. Can we get the desired job done with fewer operations? We’re witnessing three basic approaches to this: Reduce functionality. Phones have traditionally been under-provisioned environments; this has led to standards like WAP for writing paltry sets of features. A cursory examination of web applications optimized for iPhones reveals similar principles. Websites will simply remove much of the functionality and subtle rich touches, easing the CPU load. This is a solution for developers bound by the failings of the web environment, but a problem to solve for those creating these environments. Avoid the abstraction tax. Our group made a simple experiment: what happens if we naively reimplement Google Maps in C? We witnessed performance improvements of almost 2 magnitudes. During our work on three libraries within browsers, we saw a similar trend: by more directly implementing components, skipping various indirection and safety layers, we observed drastic improvements. Various communities are showing this desire to avoid the browser abstraction. For example, the iPhone, Android, and Native Client platforms let developers code very close to the hardware. We are concerned with such solutions because of the loss of (hopefully) more productive high-level languages that allow themselves to be more divorced from the hardware. Furthermore, the high-level nature of the web stack has been useful for building an eco-system promoting programs like search engines and browser extensions, which has not emerged in lower-level approaches to such an extent. Optimize languages and libraries. Ideally, we can shift the burden to compiler and library developers. Interest in optimizing JavaScript has drastically increased, and as layout engines have been rewritten to more standards compliant, performance improvements have been incorporated as well. However, while our experiences suggest there is a lot of room for sequential optimizations, the feasibility of developers of a multimillion line codebase implementing fragile optimizations like L1 cache tuning is unclear. Proebstring’s observation about the alternative, compiler writers automating such optimizations, is worth recalling: once a language is reasonably implemented, compiler optimizations might give 4% improvements a year, while hardware has given on average 60% (largely taking advantage of Moore’s law and clock scaling). Now that JavaScript runtimes are no longer naively interpreted, we suspect Proebstring’s Law will soon start to apply. Axis 2: Use parallel cycles Even if we have exhausted our budget of sequential operations / second, we can follow Proebstring’s Law and look towards exploiting hardware. Increased performance will be largely through increased parallel operations / second. Given a 20 (25%?) energy (power?) savings rate every year, over the next decade, we expect about an additional core per device every year. Not taking advantage of a 10x speedup from hardware (as opposed to say, another projected 2-4x speedup through increasingly difficult compiler optimizations) is an odd choice. Unfortunately, while previous hardware advances have allowed us to largely reuse our existing languages and libraries, sufficient automatic parallelization has remained tantalizingly distant, even for functional and dataflow languages. More generally, it is not obvious that we can parallelize programs like browsers. We note some concerns when parallelizing browsers and ways they are being assuaged: Can browser libraries exploit parallelism? Our group has created and implemented parallel algorithms for lexing and CSS selector matching (which can be thought of as matching a lot of regular expressions against a big tree). Even more surprisingly, we were able to design an algorithm to perform basic layout processing – determining the sizes and positions of elements on a page – in parallel, and are currently implementing it and iterating on its design. We are not alone in exploring this space. For example, parallel parsing has long been considered in literature, and progress has been enough that Firefox’s parsers are being parallelized. Can we exploit parallelism through concurrent calls? We do not just want to parallelize the handling of individual calls into libraries. For example, can two different scripts interact with the layout library at the same time? Part of our process of designing new parallel libraries is to look out for such opportunities, and thinking about how to detect the guarantees the libraries need to exploit them. For example, visually independent components, such as within <iframe> elements, correspond to actors whose layout computations are largely distinct from sibling elements. Will parallelization make browsers more brittle? To increase the integrity of browser runtimes, developers have partitioned core libraries like the layout engine into processes, benefiting from address-space separation and simplifying the management of resources like CPU time. A large part of our focus has been on libraries, where we have been using task parallel systems like TBB and Cilk++. This forces clearer code structure and interfaces. Our sequential optimizations, like L1 cache optimizations, are much more brittle. Given energy concerns, parallelization should be work efficient. A common trick in parallelization is to locally duplicate computations in order to avoid communication and synchronization overhead. Work efficiency means that if we were to simulate a parallel algorithm on a sequential computer, it should take the same amount of time as the sequential one. A common theme in our algorithms is speculative execution where we guess an invariant, process in parallel based on it, and patch up our computation as needed wherever the invariant ended up being false. For example, supporting concurrent transactional modifications to a webpage is work- efficient if most of those transactions are disjoint and therefore require little rollback. We have already exploited this in two of our trickier algorithms, parallel lexing and layout. We found some large yet simple opportunities for parallelism, such as the new lexing and CSS selector libraries. However, many other computations span large amounts of code (like layout), and there is also a standing challenge in enabling web designers to productively write parallel scripts such as animations. Axis 3: Compute elsewhere. If we cannot effectively exploit the cycles available on a personal device, perhaps we can use some elsewhere. For example, as latency is decreasing and bandwidth is increasing, a model like cloud computing is appealing. Web application developers already do this, such as by running database queries on servers and performing UI computations on clients. Recently, we have witnessed reincarnations of X for browsers, allowing a thin client to display the results of running a browser elsewhere, or even just proxying individual plugins like Flash. It is worth reexamining how much computation we can (and should) redistribute. User experience requirements combined with hardware constraints provide hints at the limits of offloading computation.1 For the user experience, perceived latency is crucial. For example, browsers are now optimized to begin to display parts of a webpage before all of it is available, which increases the overall load time. Perceived latency requirements vary by the type of task. Film – continuous, non-interactive motion -- is generally shot at 24 frames / second, allowing 42ms to compute and render an animation. Many closed-loop systems, in which a user gets feedback while interacting with a system, such as by watching a mouse cursor move on a screen, have an upperbound of 100ms before tasks like verbally communicating or moving an object significantly suffer. For lowerbounds, while hand tracking allows delays of 50-60ms, other domains are less forgiving. For example, head-mounted displays with such long delays cause nausea. Finally, we note that there is a difference between delay and sampling rate: gestural and audio interactions should have samples processed with intervals on the order of milliseconds (and without jitter). For something like a hand drum with different strokes, both requirements are in place. Considering costs involved in an end-to-end system, and often just the costs of the networking step, it becomes clear that some computations are best when on client devices for the foreseeable future. Let’s assume a wireless device is a thin client for a proxied browser living in the cloud. Even on Internet2, round trip time from a NYC hub to an LA one is 40ms. Luckier users might be communicating between Seattle We mean partitioning computations across different devices. It is also possible to partition over time. For example, search engines might cache popular queries, thick clients might prefetch content, and slower compilers might create faster bytecodes. 1 and LA, only wasting 15ms on this step. Getting to a hub from a client device might be 5-10ms from the device to a tower and 10ms from the tower to the hub, with similar costs to get to the proxy at the other end. An LCD has a sampling rate of 520ms with a delay of 10-60ms, and an input device like a mouse might poll somewhere around 5-10ms. Adding all these numbers together, even without considering actual computation time, we see that while streaming a movie might be fine, playing games requires some client support. There are further hardware concerns. For example, [[connectivity]]. Another interesting cost is bandwidth. Browser use, in certain age groups, rivals TV use: proxying rich experiences has an associated bandwidth expense which must scale to support mainstream use. While TV streams might be shared between users, browsing sessions are more unique. Energy factors in again: proposals to increase bandwidth for devices, such as MIMO, are often at the expense of battery life. Finally, we note that there are economic costs. Web server farms actually cache a lot of their computations, requiring little computation: as there will be less benefit from consolidating devices, proxying browser computations means that expenses a user typically pays for a device will need to be somehow covered by proxy providers. While we view latency as a dominating concern, energy, connectivity, cost and bandwidth also have significant costs. The situation is not entirely glum. Large computations should still be on a server. Even interactive computations might actually be partitioned: for example, we experimented with a real-time mouse cursor but delayed scrollbar. Furthermore, breaking the browser experience out of the single device may still happen to enable new features, such as migrating a browsing session from a laptop to a phone when we leave the house or enabling remotely executing Flash scripts on today’s slower handhelds. There’s also the appeal of P2P systems, which may help boost bandwidth and lower latency, which we are beginning to study. It seems that proxying solutions are best for larger or non-interactive experiences. It is not always clear when to make this distinction: for example, Gmail has shown that even though emails should be stored on the client, email search should be performed elsewhere. Finally, we note that computations that are too intensive for a client device will likely be performed in parallel, and, in a sense, they are probably even more suited for parallelization. Off-device computation will happen, but we must still focus on the on-device case. Conclusion We are facing an exciting time of architectural transition. Productivity concerns such as high-level languages, large libraries, and software as a service are emerging as being efficient enough to replace traditional approaches. However, we’re finding the need for much better performance, especially in the emerging computing class of handhelds. While we have found a lot of room for sequential optimizations, the opportunity cost for them is high, so we instead advocate exploiting hardwaredriven improvements. This is taking place in the form of networked computation and local, parallel computations. Parallelizing our computations is a conservative choice independent of where the computation occurs. We are finding parallelizing on-device browser computations to be the most enticing direction for improving performance.