Rethinking Browser Performance

advertisement
Rethinking Browser Performance
Leo Meyerovich
The browser wars are back and performance might determine the winner this time
around. Browser performance has dictated the engineering of popular websites like
Facebook and Google so the web community is facing a crisis in balancing a highlevel and accessible eco-system on one side and squeezing out more performance on
the other. We’ve been reexamining core assumptions in how we architect browsers
and have found the design space to be wide open, taking us from proxies in the
cloud to multicore computers in our pockets.
Empirically, web application speed impacts the bottom line for software developers.
Near its inception, tracking of Google Maps showed site speedups were correlated to
usage spikes. Similarly, faster return times of search queries increased usage.
[[marissa]] Photo sharing sites have witnessed similar performance-driven
phenomena, even if a typical user wouldn’t attribute their usage to it. Google’s
experience has even caused them to include page load time as part of their AdWords
ranking algorithm! When the user experience is the money-maker, lowering
performance barriers matters.
We already spend an inordinate amount of time on the web every day [[statistic]] -is there really a performance problem browser developers should dwell upon? First,
as seen with Google’s actions, performance has a huge impact on user experience.
Second, and more compelling, we have hit a wall with mobile devices. To be precise,
we hit the power wall. Moore’s Law holds, so we’re still shrinking our transistors,
but we can’t just keep clocking up or using those transistors for complex hardware
optimizations: because of energy (battery life) and power (heat) constraints,
hardware architects switched focus to simpler but parallel architectures. An
iPhone’s application core is at least 13x slower than a MacBook Pro’s two cores in
terms of clock cycles, and it gets worse once the mapping from cycles to high-level
operations is considered. Something has to change.
Browser developers are facing a fascinating design space in trying to get the most
out of our available cycles. As consumers of browsers, we benefit from their efforts,
and as developers, we might find lessons for our own programs. In the following
sections, after giving an idea of browser performance bottlenecks, we examine three
fundamental axis for optimizing browsers. Understanding this design space
motivates our research in parallelizing browsers from the bottom up.
Performance Needs for Today and Tomorrow
We first dispel the myth that the browser is network-bound. The predictability of
user requests[[cite]] and the rise of browser-side storage interfaces allows the
network to be taken off of the critical within many application. For applications
where this is not true, mobile broadband eliminates many of the remaining network
concerns.
Even though we should not consider browsers on laptops and handhelds to be
network-bound in theory, poor infrastructure design by both web developers and
browser designers may make it so in practice. The IE8 team benchmarked loading
the top 25 websites on various browsers, and, including network time, they all took
3.5-4s to fully load them[[cite]]. In another study, they found that the sites spent, on
average, 850ms of CPU time. Once we consider prefetching, caching, and networking
advances, the CPU time looks even more conspicuous.
Where is time being spent? Despite the recent emphasis on faster JavaScript
runtimes, these sites spent only 3% of their CPU time inside of the JavaScript
runtimes. A JavaScript heavy website like a webmail client will bump up the usage
percentage to 14%; most of it is in laying out the webpage and painting it, and, to a
lesser extent, more typical compiler frontend tasks like lexing and parsing. The story
is worse on handhelds and netbooks because even new CPUs like the Atom are still a
magnitude slower on most benchmarks than their laptop counterpart. From font
handling to matching CSS selectors, a lot needs to be faster.
We must also target future workloads. We predict increased usage of clientside
storage, requiring a renewal in interest for personal storage and databases.
Scripting is playing an increasing role, both in how it interacts with browser
libraries and standalone computations. Finally, we note a push towards 3D and
graphics support: as augmented reality applications mature, like Google Maps,
Virtual Earth, and Photosynth, we expect the demand to grow even further,
especially in the handheld space. Graphic accelerators have large power, energy, and
performance benefits over thicker multicore solutions, so we even expect to seem
them in mobile devices, partially addressing how we expect to see at least one of
these domains solved.
The Three Axis of Performance
We don’t see one silver bullet for browser performance, but by systematically
breaking down how performance can be improved, we can start comparing the
feasibility of different approaches. If browsers are CPU-bound, we should reanalyze
how our CPU cycles are being used. This leads to three natural axis for optimization:
Axis 1: Use Less Cycles
If we consider a traditional desktop application struggling for CPU cycles, we go
down the traditional road of optimizing for those cycles. Can we get the desired job
done with fewer operations? We’re witnessing three basic approaches to this:
Reduce functionality. Phones have traditionally been under-provisioned
environments; this has led to standards like WAP for writing paltry sets of
features. A cursory examination of web applications optimized for iPhones
reveals similar principles. Websites will simply remove much of the
functionality and subtle rich touches, easing the CPU load. This is a solution
for developers bound by the failings of the web environment, but a problem
to solve for those creating these environments.
Avoid the abstraction tax. Our group made a simple experiment: what
happens if we naively reimplement Google Maps in C? We witnessed
performance improvements of almost 2 magnitudes. During our work on
three libraries within browsers, we saw a similar trend: by more directly
implementing components, skipping various indirection and safety layers,
we observed drastic improvements. Various communities are showing this
desire to avoid the browser abstraction. For example, the iPhone, Android,
and Native Client platforms let developers code very close to the hardware.
We are concerned with such solutions because of the loss of (hopefully) more
productive high-level languages that allow themselves to be more divorced
from the hardware. Furthermore, the high-level nature of the web stack has
been useful for building an eco-system promoting programs like search
engines and browser extensions, which has not emerged in lower-level
approaches to such an extent.
Optimize languages and libraries. Ideally, we can shift the burden to
compiler and library developers. Interest in optimizing JavaScript has
drastically increased, and as layout engines have been rewritten to more
standards compliant, performance improvements have been incorporated as
well. However, while our experiences suggest there is a lot of room for
sequential optimizations, the feasibility of developers of a multimillion line
codebase implementing fragile optimizations like L1 cache tuning is unclear.
Proebstring’s observation about the alternative, compiler writers automating
such optimizations, is worth recalling: once a language is reasonably
implemented, compiler optimizations might give 4% improvements a year,
while hardware has given on average 60% (largely taking advantage of
Moore’s law and clock scaling). Now that JavaScript runtimes are no longer
naively interpreted, we suspect Proebstring’s Law will soon start to apply.
Axis 2: Use parallel cycles
Even if we have exhausted our budget of sequential operations / second, we can
follow Proebstring’s Law and look towards exploiting hardware. Increased
performance will be largely through increased parallel operations / second. Given a
20 (25%?) energy (power?) savings rate every year, over the next decade, we expect
about an additional core per device every year. Not taking advantage of a 10x
speedup from hardware (as opposed to say, another projected 2-4x speedup
through increasingly difficult compiler optimizations) is an odd choice.
Unfortunately, while previous hardware advances have allowed us to largely reuse
our existing languages and libraries, sufficient automatic parallelization has
remained tantalizingly distant, even for functional and dataflow languages. More
generally, it is not obvious that we can parallelize programs like browsers.
We note some concerns when parallelizing browsers and ways they are being
assuaged:
Can browser libraries exploit parallelism? Our group has created and
implemented parallel algorithms for lexing and CSS selector matching (which
can be thought of as matching a lot of regular expressions against a big tree).
Even more surprisingly, we were able to design an algorithm to perform
basic layout processing – determining the sizes and positions of elements on
a page – in parallel, and are currently implementing it and iterating on its
design. We are not alone in exploring this space. For example, parallel
parsing has long been considered in literature, and progress has been enough
that Firefox’s parsers are being parallelized.
Can we exploit parallelism through concurrent calls? We do not just want
to parallelize the handling of individual calls into libraries. For example, can
two different scripts interact with the layout library at the same time? Part of
our process of designing new parallel libraries is to look out for such
opportunities, and thinking about how to detect the guarantees the libraries
need to exploit them. For example, visually independent components, such as
within <iframe> elements, correspond to actors whose layout
computations are largely distinct from sibling elements.
Will parallelization make browsers more brittle? To increase the
integrity of browser runtimes, developers have partitioned core libraries like
the layout engine into processes, benefiting from address-space separation
and simplifying the management of resources like CPU time. A large part of
our focus has been on libraries, where we have been using task parallel
systems like TBB and Cilk++. This forces clearer code structure and
interfaces. Our sequential optimizations, like L1 cache optimizations, are
much more brittle.
Given energy concerns, parallelization should be work efficient. A
common trick in parallelization is to locally duplicate computations in order
to avoid communication and synchronization overhead. Work efficiency
means that if we were to simulate a parallel algorithm on a sequential
computer, it should take the same amount of time as the sequential one. A
common theme in our algorithms is speculative execution where we guess an
invariant, process in parallel based on it, and patch up our computation as
needed wherever the invariant ended up being false. For example,
supporting concurrent transactional modifications to a webpage is work-
efficient if most of those transactions are disjoint and therefore require little
rollback. We have already exploited this in two of our trickier algorithms,
parallel lexing and layout.
We found some large yet simple opportunities for parallelism, such as the new
lexing and CSS selector libraries. However, many other computations span large
amounts of code (like layout), and there is also a standing challenge in enabling web
designers to productively write parallel scripts such as animations.
Axis 3: Compute elsewhere.
If we cannot effectively exploit the cycles available on a personal device, perhaps we
can use some elsewhere. For example, as latency is decreasing and bandwidth is
increasing, a model like cloud computing is appealing. Web application developers
already do this, such as by running database queries on servers and performing UI
computations on clients. Recently, we have witnessed reincarnations of X for
browsers, allowing a thin client to display the results of running a browser
elsewhere, or even just proxying individual plugins like Flash. It is worth
reexamining how much computation we can (and should) redistribute. User
experience requirements combined with hardware constraints provide hints at the
limits of offloading computation.1
For the user experience, perceived latency is crucial. For example, browsers are now
optimized to begin to display parts of a webpage before all of it is available, which
increases the overall load time. Perceived latency requirements vary by the type of
task. Film – continuous, non-interactive motion -- is generally shot at 24 frames /
second, allowing 42ms to compute and render an animation. Many closed-loop
systems, in which a user gets feedback while interacting with a system, such as by
watching a mouse cursor move on a screen, have an upperbound of 100ms before
tasks like verbally communicating or moving an object significantly suffer. For
lowerbounds, while hand tracking allows delays of 50-60ms, other domains are less
forgiving. For example, head-mounted displays with such long delays cause nausea.
Finally, we note that there is a difference between delay and sampling rate: gestural
and audio interactions should have samples processed with intervals on the order of
milliseconds (and without jitter). For something like a hand drum with different
strokes, both requirements are in place.
Considering costs involved in an end-to-end system, and often just the costs of the
networking step, it becomes clear that some computations are best when on client
devices for the foreseeable future. Let’s assume a wireless device is a thin client for
a proxied browser living in the cloud. Even on Internet2, round trip time from a NYC
hub to an LA one is 40ms. Luckier users might be communicating between Seattle
We mean partitioning computations across different devices. It is also possible to
partition over time. For example, search engines might cache popular queries, thick
clients might prefetch content, and slower compilers might create faster bytecodes.
1
and LA, only wasting 15ms on this step. Getting to a hub from a client device might
be 5-10ms from the device to a tower and 10ms from the tower to the hub, with
similar costs to get to the proxy at the other end. An LCD has a sampling rate of 520ms with a delay of 10-60ms, and an input device like a mouse might poll
somewhere around 5-10ms. Adding all these numbers together, even without
considering actual computation time, we see that while streaming a movie might be
fine, playing games requires some client support.
There are further hardware concerns. For example, [[connectivity]]. Another
interesting cost is bandwidth. Browser use, in certain age groups, rivals TV use:
proxying rich experiences has an associated bandwidth expense which must scale to
support mainstream use. While TV streams might be shared between users,
browsing sessions are more unique. Energy factors in again: proposals to increase
bandwidth for devices, such as MIMO, are often at the expense of battery life. Finally,
we note that there are economic costs. Web server farms actually cache a lot of
their computations, requiring little computation: as there will be less benefit from
consolidating devices, proxying browser computations means that expenses a user
typically pays for a device will need to be somehow covered by proxy providers.
While we view latency as a dominating concern, energy, connectivity, cost and
bandwidth also have significant costs.
The situation is not entirely glum. Large computations should still be on a server.
Even interactive computations might actually be partitioned: for example, we
experimented with a real-time mouse cursor but delayed scrollbar. Furthermore,
breaking the browser experience out of the single device may still happen to enable
new features, such as migrating a browsing session from a laptop to a phone when
we leave the house or enabling remotely executing Flash scripts on today’s slower
handhelds. There’s also the appeal of P2P systems, which may help boost bandwidth
and lower latency, which we are beginning to study.
It seems that proxying solutions are best for larger or non-interactive experiences. It
is not always clear when to make this distinction: for example, Gmail has shown that
even though emails should be stored on the client, email search should be
performed elsewhere. Finally, we note that computations that are too intensive for a
client device will likely be performed in parallel, and, in a sense, they are probably
even more suited for parallelization. Off-device computation will happen, but we
must still focus on the on-device case.
Conclusion
We are facing an exciting time of architectural transition. Productivity concerns
such as high-level languages, large libraries, and software as a service are emerging
as being efficient enough to replace traditional approaches. However, we’re finding
the need for much better performance, especially in the emerging computing class
of handhelds. While we have found a lot of room for sequential optimizations, the
opportunity cost for them is high, so we instead advocate exploiting hardwaredriven improvements. This is taking place in the form of networked computation
and local, parallel computations. Parallelizing our computations is a conservative
choice independent of where the computation occurs. We are finding parallelizing
on-device browser computations to be the most enticing direction for improving
performance.
Download