FOCUS: ALGORITHMS Excellence in Search: An Interview with David Chaiken John Favaro, Intecs SpA, Italy IN JUNE 2011, IEEE Software associate editor John Favaro interviewed search engine giant Yahoo’s chief architect David Chaiken about algorithms and today’s practitioner. Chaiken gave a keynote speech at SATURN 2011 on “Architecture at Internet Scale” that stressed a set of timeless principles that software engineers seemingly have to relearn continuously. John Favaro: Tell us a bit about your own background. How and when did you get into computer science? David Chaiken: I’ve been hacking for as long as I can remember. When I was four years old, my parents sat me down in front of a card punch at RAND. My fi rst program was a greatest-common-denominator solver, which saved me calculation time on my elementary school math homework. I didn’t really get a formal introduction to computer science until I studied at MIT many years later. You’re an architect now. Does that mean you’re not a programmer anymore? I hope that’s not the case! All of our architect job descriptions at Yahoo state, “We value architects who do enough hands-on implementation work to keep current with technology trends inside and outside the company.” To be realistic, I don’t have time to do significant programming as the chief architect of Yahoo, but I try to make time to write some code every year. 84 I E E E S O F T W A R E | P U B L I S H E D B Y T H E I E E E C O M P U T E R S O C I E T Y In your list of timeless principles that software engineers learn and relearn, the fundamental, overriding principle “zero” is mastering complexity. But right after that, you list a principle you call “algorithms fi rst.” Could you explain a bit what you mean by that? You seem to be giving algorithms a disproportionately high place in the priority of your list of principles. Formally speaking, mastering complexity requires a proof of the asymptotic computation, storage, and communication needs of a system. While we don’t always do formal specifications and proofs of the properties of our algorithms, the underlying behavior of the algorithms factors into our capacity modeling—and therefore our capital and operational expense planning—in a fundamental way. For our larger systems, understanding asymptotic behavior is essential to keeping our business running. In the example that I presented at SATURN 2011, it was absolutely critical to know the specific factors that caused exponential behavior in the common case for one of our advertising systems. Kevin Lang of Yahoo Labs wrote a beautiful analysis called “Notes on Tractability of NGD Ad Serving” that had a profound effect on the way that we rebuilt this system. What is the role of algorithms in the search industry? What are some of the subareas in search where algorithms have been essential in their development? At Yahoo, we view search systems first and foremost as vehicles for delivering the results of algorithms: discovery, content analysis, machine learning, indexing, query analysis, and ranking. Yahoo Labs delivers many of these algorithms, which are instantiated as the inner loops of tasks running on processing pipelines, Hadoop grids, online serving systems, and analytics warehouses. 0 74 0 -74 5 9 / 12 / $ 3 1. 0 0 © 2 0 12 I E E E We believe that the greatest potential for innovation in search is in task completion, which requires a new set of algorithms. By task completion, we mean delivering answers, not links, and satisfying the underlying needs that drive people to search. For example, we need to understand that the query “New York pizza” might mean pizza restaurants in New York but—especially at meal time for people outside the tri-state area searching on a mobile phone—it often means, “Where is a nearby restaurant that serves New York-style pizza?” Delivering on task completion requires better query analysis, new content sources, and understanding the underlying semantics of the Web of Objects. It also requires shifting our optimization algorithms from one-dimensional objective functions that determine link order to twodimensional objective functions that drive page layout. Tell us a bit about the place of algorithms at Yahoo. Is it true that there’s an entire group of theoretical computer scientists dedicated only to the elaboration and analysis of algorithms? How do they interact with the “normal” programmers—do the “normal” programmers encounter problems and then throw them over the fence to the theoreticians? Or do they have a more proactive role? Yahoo has the best track record of moving algorithms from research to products of any other company that I’ve seen in my career. We have such a good track record precisely because we intentionally break down the barriers between researchers and their colleagues in product and operations groups. In some cases, researchers write the code that’s deployed to production. In many other cases, researchers work directly with applied scientists who have the unique blend of talent and patience required to tune all of the dials required DAVID CHAIKEN David Chaiken is chief architect at Yahoo, where he oversees the technical architecture of all the company’s consumer, advertising, and platform products. Chaiken has been hacking since his parents sat him down in front of an IBM card punch more than 40 years ago. Over his career, he has built voice search products for consumers, mobile enterprise applications, network management systems, project management software, a largescale multiprocessing system, and five or so information appliances. His favorite technologies include the RSA encryption algorithm, the C programming language, the ARM instruction set architecture, the Fedora distribution of Linux, and the buildon-grid-push-to-serving design pattern. Chaiken earned a PhD in electrical engineering and computer science from MIT. to run an algorithm at scale—sometimes delivering results to hundreds of millions of consumers. Getting the research to the product pipeline right is a team sport. I’ve been in code reviews that included the researchers who wrote the original paper on an algorithm, architects who wrote the class structure and templates for a product, applied scientists who tuned and extended the algorithm, programmers who understood how to get the best performance out of the runtime environment, and operations engineers who deployed the system at scale. The awesome track record of Yahoo Research may lead to the assumption that our theoretical computer scientists aren’t engaged in products. If they’re publishing such high-quality work, then they can’t possibly be engaged in making money, right? It turns out that this assumption isn’t true. To the con- trary, Yahoo researchers publish great papers because they’re fi rmly grounded in the challenges of developing and deploying premier online media products. The search industry seems to be a perfect example of an industry where algorithms are absolutely critical. What are some other examples of industries that come to your mind? Two other facets of the digital media industry come instantly to mind. Computational advertising, a discipline led by Andrei Broder, uses algorithms to attack the problem of delivering the right ad to the right consumer at the right time. That might be an unfair answer, because Andrei and his team are great at recasting online advertising problems as search problems. Content optimization, an area led by Raghu Ramakrishnan, applies machine learning and optimization algorithms J A N U A R Y/ F E B R U A R Y 2 0 1 2 | IEEE S O F T W A R E 85 FOCUS: ALGORITHMS ABOUT THE AUTHOR JOHN FAVARO is a senior consultant at Intecs SpA in Pisa, Italy. His research interests include efficient safety analysis of critical systems, real-time architectural patterns, and requirements engineering. Favaro has an MS in electrical engineering and computer science from the University of California, Berkeley. Contact him at john@favaro.net. to personalizing webpages and other digital media products for individual consumers. Other industries include database technology, graphics (from consumer cameras to blockbuster movies), robotics, security, biotechnology, highenergy physics—my goodness, it’s hard to stop. How long a list do you want? I can see how a software engineer in a company like Yahoo would need to be up to speed in algorithms, but honestly, why should the rest of us software engineers care about algorithms today? Isn’t that all “under the hood” now, and we don’t have to worry about it anymore if we’re just programming up mundane IT support tasks? I defi nitely understand this view, because I was oblivious to algorithms for my fi rst 18 years of interaction with computers. At the ripe old age of 22, I was working in data communications at Motorola and realized that I was missing some foundational knowledge that limited my ability to progress from what you might call mundane IT support tasks. I found myself (re)inventing graph algorithms like depth-fi rst search and breadth-fi rst search, and wondering which approach to use. Put it this way: if you’re satisfied with a career of programming mundane IT support tasks, feel free to ignore algorithms. When you’re ready to take a step up, let your curiosity lead you to study the domains that you need to succeed. It’s almost guaranteed that you’ll fi nd some domain of algorithms that are relevant to what you want to do. How do you see the state of teaching in algorithms in universities now? Are students still getting an adequate grounding like they did 30 to 40 years ago? Or are universities slipping? What do you think an ideal curriculum in algorithms would look like? The population of programmers and other information technology professionals has changed radically in size, demographics, interests, and specializations over the last 30 to 40 years. Students have also changed: they demand to be entertained in addition to being taught. However, the fundamentals still seem the same: start with a semester course that surveys different classes of algorithms and trains students in complexity and performance analysis. Then, teach the relevant algorithms in each of the domain-specific topics that students choose to take. Of course, having Ron Rivest as a professor, with one or two guest lectures from Tom Cormen, [two of the coauthors of the classic Big Book of Algorithms —eds.] probably makes me biased toward the high end of value that I put in a graduate-level course in algorithms. What sources do you consider to be the best today for software engineers to learn about algorithms and to stay abreast of things? 86 I E E E S O F T W A R E | W W W. C O M P U T E R . O R G / S O F T W A R E Isn’t that why we invented the Internet and the Web in the fi rst place? My understanding is that Bob Taylor wanted his ARPA-funded research groups to work with each other; Tim Berners-Lee wrote HTTP to allow physicists to stay current with each others’ work; and Marc Andreessen created a better way to browse the research content. Jerry Yang, David Filo, and the rest of the gang figured out how to make the technology useful for the rest of the world, but don’t forget that the online world was originally by the nerds, for the nerds! In addition to using the Web to find what I need, what I still read cover to cover is Communications of the ACM, which has a mix of academic- and industry-focused content that satisfies some of the need to stay current. Every Yahoo employee has access to the ACM and IEEE digital libraries plus Safari Books Online, which provide great technical references. Sometimes, it’s important to go back to our foundational texts. I just pulled out the CLR book yesterday to do some analysis of IPv6 to IPv4 address hashing, which came up when Yahoo was planning for World IPv6 Day. FIN D US ON FAC E B O O K & T WI T T E R ! facebook.co m/ ieees oftware twitter.co m/ ieees oftware This article was featured in For access to more content from the IEEE Computer Society, see computingnow.computer.org. Top articles, podcasts, and more. computingnow.computer.org