The Software Architect Elevator Redefining the Architect’s Role in the Digital Enterprise Gregor Hohpe The Software Architect Elevator by Gregor Hohpe Copyright © 2020 Gregor Hohpe. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editors: Ryan Shaw, Melissa Duffield Development Editor: Melissa Potter Production Editor: Deborah Baker Copyeditor: Octal Publishing, LLC Proofreaders: Kim Wimpsett, Justin Billing Indexer: Judith McConville Cover Design: Randy Comer Interior Designer: Monica Kamsvaag Illustrators: Rebecca Demarest, Jose Marzan Jr. April 2020: First Edition Revision History for the First Edition 2020-04-07: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492077541 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Software Architect Elevator, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-07754-1 [LSI] Foreword by Simon Brown My aspiration to become a software architect stemmed from my interest in the technical side of software design. I really enjoy discussions about how we can best use technology to solve a problem, and how to create codebases that are highly modular, well-structured, and easy to work with. What nobody tells you though, is that these technical aspects are just one part of the architecture puzzle. It’s not just about technology and designing software. It’s about designing software and solving problems within a specific organizational context, and being aware of what’s happening around you, so that you can successfully navigate and influence that context where necessary. It’s crucial, therefore, that architects realize they need to communicate and influence at different levels, with different audiences, both inside and outside of their immediate team environment. As an industry, however, we do a relatively poor job teaching software developers how to move into software architecture roles, let alone providing help for those who are currently in such a role. This is especially true for the nontechnical aspects. A quick browse of your favorite bookstore will reveal a plethora of books about software architecture, architectural styles, architectural patterns, DevOps, automation, enterprise architecture, Lean, Agile, and so on. You’ll find far fewer books related to people and communication. And it’s even rarer to find a single book that covers all of these topics. The Software Architect Elevator fills this gap by discussing an architect’s role from a broader set of perspectives than usual. It will teach you how to avoid the traditional, somewhat dysfunctional “business versus IT” mindset, how to see the bigger picture to map and influence the organizational landscape, how to make effective decisions, how to deal with vendors, and how to communicate across all levels of an organization. All of this is essential for those who want to be successful in their role as an architect. References to additional reading complement the practical tips and techniques presented in the book. Many of the relatable stories will, unfortunately, sound far too familiar! Although Gregor’s stories will relate more to people working in larger organizations with a traditional IT function, many of them are equally applicable to the newer wave of “digital companies.” I’ve been surprised to see some of these situations play out in such organizations, too! In summary, this is a fabulous book for current and aspiring architects, going beyond what you will find in other books on the subject. It’s a great way to fast-track the collection of tools in your architecture toolbox. I thoroughly recommend this book to aspiring software architects and CTOs alike. Whether you’re looking to broaden your skills and get a feel for what architecture is all about, or you’ve been tasked with improving organizational productivity and performance, there’s something here for everybody. Simon Brown, author of Software Architecture for Developers Foreword by David Knott I remember the first time I was asked to form an architecture team within an IT function. I didn’t know what it meant, but thought it sounded cool, and was confident that I could figure it out. That confidence lasted about five minutes, until a team member asked whether we were going to be technology architects or enterprise architects, and I realized that I didn’t know the difference! Twenty years later, I am privileged to be chief architect of a global organization, and although I still haven’t found a perfect job description for architects, I have learned that being comfortable with ambiguity is one of the most important attributes of a good architect—asking awkward questions, as my team member illustrated, being another! This book will help you understand what being an architect is like by painting a vivid picture of an architect’s life and mission in the current phase of the information technology revolution. Riding the architecture elevator is how my team and I spend our time: racing from one part of our organization to another, connecting, explaining, questioning, and trying to make good decisions about complex systems with imperfect information. The elevator takes us from code to business strategy and back again, all within the same day. Architecture has been intermittently in and out of fashion within enterprise technology, and architects are sometimes accused of “not making anything.” I believe that architects make two things that are of vital importance and in short supply: they make sense and they make decisions. Whenever architects help their organizations understand a world that is increasingly difficult to grasp, figure out what decisions need to be taken, and help take those decisions in a rational way at the right time, then they have had a good day at the office. And, as this book explains, if you’re not taking meaningful decisions (see Chapter 6), making them explicit, and helping people understand them, you’re not doing architecture. However, these are difficult skills to master. Humans have been shown to be notoriously bad at understanding complexity and at making good decisions with limited information. Architects can help themselves and their companies by adopting techniques and ways of thinking which have been won through years of experience. They can create understanding by making sure that they turn learning curves into ramps rather than cliffs and can make better decisions by adopting the language of the market (see Chapter 18), as well as by selling options to the business (see Chapter 9). One of the reasons that architecture has been in and out of fashion is that what organizations need from architects has changed. At many points in my career, the organizations I worked for believed that they wanted me to define their current state and future state, and to figure out the path between them. This was an understandable belief: it seems reasonable to want to know where we are, where we want to go, and how we are going to get there. But it was also based on a static view of the world, in which all change was deviation from a steady state. In today’s world, the technology running any organization must be dynamic, and the organization must be able to change that technology to adapt to economies of speed (see Chapter 35). The job of architects now is to create the conditions for speed and dynamism within their organizations: to satisfy the design goals of change velocity and service quality at the same time (and to help people understand that these goals are not in conflict; see Chapter 40). If you still believe that the job is to define future state architectures delivered through a multiyear plan, you would do well to read Part V of this book. The image of the architecture elevator is apt because it is one of continuous motion running through the center of an organization. Elevators are also a transformational technology: they are one of the inventions which made skyscrapers possible, and changed our skylines forever. If you want to be an architect then you are signing up for a life of movement and transformation. If you are curious and have a need to explain, a desire to connect, and the drive to make decisions, then it might be the job for you. You still won’t get a job description that describes everything you need to do, but this book will help you figure it out. Dr. David Knott, Chief Architect, HSBC About This Book As the digital economy changes the rules of the game for traditional enterprises, the role of architects also fundamentally changes. Rather than focus on technical implementations alone, they must connect the organization’s penthouse, where the business strategy is set, with the technical engine room, where the enabling technologies are implemented. Only if both parts are connected can IT change its role from a cost center to a competitive digital advantage. Making this connection by walking from one organizational floor to the next won’t work. Instead, modern architects bypass existing structures by taking the fast track: the Architect Elevator. This book helps (aspiring) architects embrace a new view of what it means to be an architect and equips them to ride the architect elevator across many levels, aligning organization and technology and effecting lasting change. A Chief Architect’s Life: It’s Not That Lonely at the Top A lot is expected from IT leaders and chief architects: they must maneuver in an organization in which IT is often still seen as a cost center, operations means “run” as opposed to “change,” and middle management has become cozy neither understanding the business strategy nor the underlying technology. All the while they are expected to stay up to date with the latest technology, manage vendors, translate buzzwords into a solid strategy, and recruit top talent. It’s no surprise, then, that senior software and IT architects have become some of the most sought-after IT professionals around the globe. With such high expectations, though, what does it take to become a successful chief architect? And after you get there, how do you keep up? When I became a chief IT architect, I wasn’t expecting any magic answers, but I was looking for a book that would at least spare me from having to reinvent the wheel all the time. I attended many CIO/CTO events, which proved useful but focused mainly on high-level direction instead of on how to actually accomplish the mission on a technical level. Having been unable to find such a book, I decided to collect my experience of over two decades as software engineer, consultant, startup cofounder, and chief architect into a book of my own. What Will I Learn? This book is organized into major sections that correspond to an architect’s journey of supporting a large-scale IT transformation. The journey begins close to the IT engine room and slowly inches up to the organizational penthouse: Part I, Architects Understanding the qualities of an architect in the enterprise context Part II, Architecture Redefining architecture’s value proposition as a change driver Part III, Communication Conveying technical topics effectively to a variety of stakeholders Part IV, Organizations Using an architectural mindset to understand organizational structures and systems Part V, Transformation Effecting lasting change in an organization Part VI, Epilogue: Architecting IT Transformation Living the life of a change agent You are invited to read this book from beginning to end, following the progression from technical to organizational topics. However, you are just as welcome to peruse the book and start reading whichever chapter piques your interest, using the extensive cross-references I’ve provided to aid your nonlinear navigation. After all, that’s how the internet works, so I figured it would probably also work for my book. This isn’t a technical book. It’s a book about how to grow your horizon as an architect to effectively apply your technical skill in large organizations. This book won’t teach you how to configure a Hadoop cluster or how to set up container orchestration with Docker and Kubernetes. Instead, it teaches you how to reason about large-scale architectures; how to ensure your architecture benefits the business strategy; how to leverage vendors’ expertise; and how to communicate critical decisions to upper management. Is It Proven to Work? If you’re looking for a scientifically proven, repeatable “method” of transforming a technical organization, you might be disappointed (but if you are, please let me know). This book’s structure is rather loose, and you might even be annoyed at having to read through little anecdotes when all you want is the one bit of advice you need in order to be successful. However, that’s what the life of an architect is like. You won’t be able to copy-paste other people’s decisions, but you can learn from their experience to make better decisions of your own. This book is based on my daily experiences of two decades in IT, which led me through being a startup cofounder (lots of fun, not lots of money), system integrator (made tax audits more efficient), consultant (lots of PowerPoint), author (collecting and documenting insights), internet software engineer (building the future), chief architect of a large multinational organization (tough, but rewarding), and CTO advisor (lots of insights and sharing). I felt that taking a personal account of IT transformation might be appropriate because architecture is by nature a somewhat personal business. When looking at a famous building, you can easily identify the architect from afar. White box: Richard Meier; all crooked: Frank Gehry; looks like made from fabric: Zaha Hadid. Although not quite as dramatic, every (chief) IT architect also has their personal emphasis and style that’s reflected in their works. The collection of insights that make up this book reflect my personal point of view but are written such that the “nuggets” can be easily extracted and put to broader use. Sidebars show you experiences from both traditional and digital companies. Architects are busy people. I therefore tried to package my insights into anecdotes that are easy to consume and hopefully even a bit fun to read. I hope you’ll experience a mix of “I’m not the only person facing this problem” and “that’s a new way of looking at things” along the way. There’s a lot more to say about architecture and transformation than would ever fit into a book. You’ll therefore find many references to other books and articles that help you dive deeper into any particular topic. Tell Me a Story I chose to structure the book as a collection of stories because in our complex world, telling stories is a great way to teach. Studies have shown that people remember stories much better than sheer facts, and there appears to be evidence that listening to a story activates additional parts of our brain that help with understanding and retention. Aristotle already knew that a good speech contains not only logos, the facts and structure, but also ethos, a credible character, and pathos, emotions, usually triggered by a good story. To transform an organization, you don’t need to solve mathematical equations. You need to move people, and that’s why you need to be able to tell a good story and paint a compelling vision. It’s fine to start out by using some of the attention-catching slogans from this book (“Zombies will eat your brain!”) and later supplement them with your own stories. Have you seen people cry and laugh when watching movies, even though they know exactly that the story is fictitious and all acting is fake? That’s the power of storytelling in action. Conventions Used in This Book This book contains many real-life stories that highlight the contrast between traditional and digital companies. The respective examples are indicated by the following icons: The “manager” icon indicates examples describing how traditional IT organizations think and work. The “digital native” icon indicates examples describing how modern, “digital” organizations operate. This icon signifies a general note or comment. This icon indicates a warning or caution. Staying Up-to-Date My brain doesn’t stop generating new ideas with the publication date. To see what’s on my mind and to chime in: Follow me on Twitter: https://twitter.com/ghohpe Find me on Linkedin: http://www.linkedin.com/in/ghohpe And find bigger ideas and articles on my blog: https://architectelevator.com/blog O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) You can access the web page for this book, where we list errata and any additional information, at https://oreil.ly/Software_Architect_Elevator. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For more information about our books and courses, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments Many people have knowingly or unknowingly contributed to this book through hallway conversations, meeting discussions, manuscript reviews, Twitter dialogues, or casual chats over a beer. It’s challenging to give due credit to all of the people I learned from, but I’d like to highlight a few whose input has significantly shaped this book. Michael Plöd, Simon Brown, Jean-Francois Landreau, and Michele Danieli have been a substantial source of suggestions and feedback. Matthias “Maze” Reik has been an enormously thorough proofreader, while Andrew Lee spotted a few more typos. My former boss, Barbara Karuth, reviewed and approved many stories that emerged from insightful conversations with current and former colleagues. And, last but certainly not least, Kleines Genius provided untiring moral support. Part I. Architects Architects have an exciting but sometimes challenging life in corporate IT. Some managers and technical staff might consider them to be overpaid ivory tower residents who, detached from reality, bestow their thoughts upon the rest of the company with slides and wall-sized posters, while their quest for irrelevant ideals causes missed project timelines. On the upside, IT architects have become some of the most sought-after IT professionals as traditional enterprises are looking to transform their IT landscape to compete with digital disruptors. Ironically, though, many of the most successful digital companies have a world-class software and systems architecture, but don’t have architects at all. So, what makes a person an architect, besides that it’s printed on their business card? What Architects Are Not Sometimes, it’s easier to describe what something isn’t rather than trying to come up with an exact definition of what it is. In the case of architects, exaggerated expectations can paint a picture of someone who solves intermittent performance problems in the morning and then transforms the enterprise culture in the afternoon. This leads to a scenario in which architects are pulled into several roles that clearly miss the purpose of being an architect: Senior developer Developers often feel they need to become an architect as the next step in their career (and their pay grade). However, becoming an architect and a superstar engineer are two different career paths, with neither being superior to the other. Architects tend to have a broader scope, including organizational and strategic aspects, whereas engineers tend to specialize and deliver running software. Mature IT organizations understand this and offer parallel career paths. Firefighter Many managers expect architects to be able to troubleshoot and solve any crisis based on their broad understanding of the current system landscape. Architects shouldn’t ignore production issues, because they provide valuable feedback into possible architectural weaknesses. But an architect that runs from one fire drill to the next won’t have time to think about architecture. Architecture isn’t operations. Project manager Architects must be able to juggle many distinct, but interrelated topics. Their decisions also take into account—and affect—project time lines, staffing, and required skill sets. As a result, upper management often comes to rely on architects for project information, especially if the project manager is busy filling out status report templates (Chapter 30). This is a slippery slope for an architect because it’s valuable work, but it distracts from the architect’s main responsibility. Scientist Architects need to sport a sharp intellect and must be able to think in models and systems (Chapter 10), but the decisions they make impact real business projects. Hence, many organizations separate the role of the chief architect from that of a chief scientist. Personally, I prefer the title chief engineer to highlight that architects produce more than paper. Lastly, although scientists may get their papers published by making things sound complex and difficult to understand, an architect’s job is the inverse: making complex topics easy to digest (Chapter 18). Many Kinds of Architects Architects operate at different levels of abstraction. Just like real-life architecture has city planners, building, landscape, and interior architects, IT architects can have many specializations: you’ll have network architects, security architects, software architects, solution architects, enterprise architects, and many more. Just like in the real world, no one architect is more important than the other. For example, living in a house with great architecture in a poorly planned city with endless traffic jams but few public facilities is going to be equally frustrating to living in a house with poor architecture in a well-functioning city. The same is true in IT—your beautifully designed and perfectly modularized application isn’t any good if it solves the wrong problem or is duplicating an existing application. Likewise, if the application is unable to connect to the corporate network, few users will be able to appreciate it. Therefore, it’s not about which type of architect is more important; it’s about getting all types of architects to work together. Architects Deal with Nonrequirements It’s commonly assumed that developers deal with functional requirements, whereas architects deal with nonfunctional requirements, often referred to as the “ilities”: scalability, maintainability, availability, interoperability, and so on. The reality isn’t as simple, though. I find that more often, architects deal with nonrequirements. This term doesn’t indicate things that aren’t required; rather, it refers to requirements that aren’t stated anywhere. This includes context, tacit assumptions, hidden dependencies, and other things that were never spelled out. Unearthing these implicit requirements and making them explicit is one of an architect’s most valuable contributions. Again, this work can take place at any level from enterprise architect to software architect—it’s the connection that counts. Measuring an Architect’s Value Articulating an architect’s value isn’t always easy. I often explain to people that if an IT system can still absorb high rates of change after many years, the project team probably included a good architect. Now, waiting several years to assess an architect’s value is slightly impractical. Instead, we can see architects bringing value in several dimensions: Architects “connect the dots” Often, each individual element of an IT architecture is well thought out and well run, but the sum of all these fine systems still isn’t delivering what the business needs. Architects look between the boxes to make sure interdependencies are well understood. Architects see trade-offs System design and development involves innumerous decisions. Most meaningful decisions don’t just have upsides, but also downsides. Architects see both sides of the coin and balance trade-offs in line with overarching goals and principles. Architects look beyond products Too much of IT decision-making is driven by product selection (Chapter 16). Architects look beyond the product names and feature lists to distill decision options and trade-offs. Architects articulate strategy IT’s purpose is to support the business strategy. Architects establish this linkage by translating business needs into technical drivers. Architects fight complexity IT is complex. Architects harmonize to reduce complexity, for example, through governance in form of architecture review boards and inception (see Chapter 32). It also includes “retiring” systems (in the Blade Runner sense of the word), lest you want to live among zombies (Chapter 12). Architects deliver Staying grounded in reality and receiving feedback on decisions from real project implementations is vital for architects. Otherwise, control remains an illusion (Chapter 27). So, architects do a lot more than draw pretty architecture diagrams! Architects as Change Agents Today’s successful architects aren’t just IT specialists, they’re also major change agents. Architects must therefore possess a special set of skills beyond just technology. The chapters in this part prepare you for this role by teaching you how to: Chapter 1, The Architect Elevator Transcend organizational levels by riding the architect elevator. Chapter 2, Movie-Star Architects Adopt multiple personas that might resemble movie characters. Chapter 3, Architects Live in the First Derivative Live in the first derivative. Chapter 4, Enterprise Architect or Architect in the Enterprise? Connect business and IT. Chapter 5, An Architect Stands on Three Legs Bring more than skill because that’s just one of the three legs architects stand on. Chapter 6, Making Decisions Exercise good decision-making discipline in the face of uncertainty. Chapter 7, Question Everything Get to the root of problems by questioning everything. Chapter 1. The Architect Elevator From the Penthouse to the Engine Room and Back Tall buildings need someone to ride the elevator Architects play a critical role as a connecting and translating element, especially in large organizations where departments speak different languages, have different viewpoints, and drive toward conflicting objectives. Many layers of management only exacerbate the problem as communicating up and down the corporate ladder resembles the telephone game.1 The worst-case scenario materializes when people holding relevant information or expertise aren’t empowered to make decisions, whereas the decision makers lack relevant information. Not a good state to be in for a corporate IT department, especially in the days when technology has become a driving factor for most businesses. The Architect Elevator Architects can fill an important void in large enterprises: they work and communicate closely with technical staff on projects, but are also able to convey technical topics to upper management without losing the essence of the message (Chapter 2). Conversely, they understand the company’s business strategy and can translate it into technical decisions that support it. If you picture the levels of an organization as the floors in a building, architects can ride what I call the architect elevator: they ride the elevator up and down to move between a large enterprise’s board room and the engine room where software is being built. Such a direct linkage between the levels has become more important than ever in times of rapid IT evolution and digital disruption. Stretching the analogy to that of a large ship, if the bridge officers spot an obstacle and need to turn the proverbial tanker, they will set the engines to reverse and the rudder to hard starboard. But if in reality the engines are running full speed ahead, a major disaster is preprogrammed. This is why even old steamboats had a pipe to echo commands directly from the captain to the boiler room and back. In large enterprises architects need to play exactly that role! Some Organizations Have More Floors Than Others Coming back to the building metaphor, the number of floors an architect has to ride in the elevator depends on the type of organization. Flat organizations might not need the elevator at all—a few flights of stairs are sufficient. This also means that the up-and-down role of an architect might be less critical: if management is keenly aware of the technical reality at the necessary level of detail and technical staff have direct access to senior management, fewer “enterprise” architects are needed. We could say that digital companies live in a bungalow and hence don’t need the elevator. However, classic IT shops in large organizations tend to have many, many floors above them. They work in a skyscraper so tall that a single architect elevator might not be able to span all levels. In this case, it’s OK if a technical architect and an enterprise architect meet in the middle and cover their respective “halves” of the building. The value of the architects in this scenario shouldn’t be measured by how “high” they travel but by how many floors they span. In large organizations, the folks in the penthouse might make the mistake of seeing and valuing only the architects in the upper half of the building. Conversely, many developers or technical architects consider such “enterprise” architects less useful because they don’t code. This can be true in some cases—such architects often enjoy life in the upper floors so much that they aren’t keen to take the elevator down ever again. But an “enterprise” architect who travels halfway down the building to share the strategic vision with technical architects can have a significant value. The value of the architects in the elevator metaphor shouldn’t be measured by how “high” they travel but by how many floors they span. Not a One-Way Street Invariably you will meet folks who ride the elevator, but only once to the top and never back down. They enjoy the good view from the penthouse too much and feel that they didn’t work so hard to still be visiting the grimy engine room. Frequently you can identify these folks by statements like: “I used to be technical.” I can’t help but retort: “I used to be a manager” (it’s true) or “Why did you stop? Were you no good at it?” If you want to be more diplomatic (and philosophical) about it, cite Fritz Lang’s movie Metropolis in which the separation between penthouse and engine room almost led to the city’s complete destruction before people realized that “the head and the hands need a mediator.” In any case, the elevator is meant to be ridden up and down. Eating caviar in the penthouse while the basement is flooded isn’t the way to transform corporate IT. Riding the elevator up and down the organization is also an important mechanism for the architect to obtain feedback on decisions and to understand their ramifications at the implementation level. Long project implementation cycles don’t provide a good learning loop (Chapter 36) and can lead to an “Architect’s Dream, Developer’s Nightmare” scenario, in which the architects have achieved their abstract ideals, but the actual implementation is impractical. Allowing architects to only enjoy the view from high up invariably leads to the dreaded authority without responsibility antipattern.2 This pattern can be broken only if architects have to live with, or at least observe, the consequences of their decisions. To do so, they must keep riding the elevator. High-Speed Elevators In the past, IT decisions were fairly far removed from the business strategy: IT was pretty “vanilla,” and the main parameter, or key performance indicator (KPI), was cost. Therefore, riding the elevator wasn’t as critical as new information was rare. Nowadays, though, the linkage between business goals and technology choices has become much more direct, even for “traditional” businesses. For example, the desire for faster time-to-market to meet competitive pressures translates into the need for an elastic cloud approach to computing, which in turn requires applications that scale horizontally and thus should be designed to be stateless. Targeted content on customer channels necessitates analytical models, which are tuned by churning through large amounts of data via a Hadoop cluster, which in turn favors local hard-drive storage over shared-network storage. The fact that in one or two sentences a business need has turned into application or infrastructure design highlights the need for architects to ride the elevator. Increasingly they need to take the express elevator, though, to keep up with the pace at which business and IT are intertwined. In traditional IT shops, the lower floors of the building can be exclusively occupied by external consultants (Chapter 38), which allows enterprise architects to avoid getting their hands dirty. However, because it focuses solely on efficiency and ignores economies of speed (Chapter 35), such a setup doesn’t perform well in times of rapid technology evolution. Architects who are used to such an environment must stretch their role from being pure consumers of vendors’ technology roadmaps to actively defining it. To do so, they need to develop their own IT worldview (Chapter 16). Other Passengers If you are riding the elevator up and down as a successful architect, you might encounter other folks riding with you. You might, for example, meet business or nontechnical folks who learned that a deeper understanding of IT is critical to the business. Be kind to those folks, take them with you, and show them around. Engage them in a dialogue—it will allow you to better understand business needs and goals. They might even take you to the higher floors you haven’t been to. You might also encounter folks who ride the elevator down merely to pick up buzzwords to sell as their own ideas in the penthouse. We don’t call these people architects. People who ride the elevator but don’t get out are commonly called lift boys. They benefit from the ignorance in the penthouse to pursue a “technical” career without touching actual technology. You might be able to convert some of these folks by getting them genuinely interested in what’s going on in the engine room. If you don’t succeed, it’s best to maintain the proverbial elevator silence, avoiding eye contact by examining every ceiling tile in detail. Keep your “elevator pitch” for those moments when you share the cabin with a senior executive, not a mere messenger. The Dangers of Riding the Elevator You would think that architects riding the elevator up and down are highly appreciated by their employer. After all, they provide significant value to businesses transforming their IT to better compete in a digital world. Surprisingly, such architects can encounter resistance. Both the penthouse and the engine room might actually have grown quite content with being disconnected: the company leadership is under the false impression that the digital transformation is proceeding nicely, whereas the folks in the engine room enjoy the freedom to try out new technologies without much supervision. Such a disconnect between penthouse and engine room resembles a cruise ship heading for an iceberg with the engines running at full speed ahead: by the time the leadership realizes what’s going on, it’s likely too late. I was once criticized by the engine room for pushing corporate agenda against the will of the developers while at the same time corporate leadership chastised me for wanting to try new solutions just for fun. Ironically, this likely meant I found a good balance. One can liken such organizations to the Leaning Tower of Pisa where the foundation and the penthouse aren’t vertically aligned. Riding the elevator in such a building is certainly more challenging. When stepping into such an environment, the elevator architect must be prepared to face resistance from both sides. No one ever said being a disruptor is easy, especially as systems resist change (Chapter 10). The best strategy in these situations is to start linking the levels carefully, waiting for the right moment to share information. For example, you could begin by helping the folks in the engine room convey to management what great work they are doing. It will give them more visibility and recognition while you gain access to detailed technical information. Other corporate denizens not content with you riding the elevator can be found on the middle floors: seeing you whiz by to connect leadership and the engine room makes them feel bypassed. Thus, the organization has an “hourglass” shape of appreciation for your work: top management sees you as a critical transformation enabler, whereas the folks in the engine room are happy to have someone to talk to who actually understands and appreciates their work. The folks in the middle, though, see you as a threat to their livelihood, including their children’s education and their vacation home in the mountains. This is a delicate affair. Some might even actively block you on your way: being stopped at every floor to give an explanation —aka aligning (Chapter 30)—makes riding the elevator not really faster than taking the stairs. Lastly, because folks riding the elevator are rare, being good at one thing often leads others to conclude that you aren’t good at anything else. For example, architects giving meaningful and inspiring presentations to management are often assumed to not be great technologists, even though that’s the very reason their presentations are meaningful. So, every once in a while, you’re going to want to let the upper floors know that you can hold your own down in the engine room. Flattening the Building Instead of tirelessly riding the elevator up and down, why not get rid of all those unnecessary floors? After all, the digital companies your business is trying to compete with have much fewer floors. Unfortunately, you can’t simply pull some floors out of a building. And blowing the whole thing up just leaves you with a pile of rubble, not a lower building. The guys on the middle floors are often critical knowledge holders about the organization and IT landscape, especially if there’s a large black market (Chapter 29), so the organization can’t function without them in the near term. Flattening the building little by little can be a sound long-term strategy, but it would take too long because it requires fundamental changes to the company culture. It also changes or eliminates the role played by the folks inhabiting the middle floors, who will put up a fierce resistance. This isn’t a fight an architect can win. However, an architect can start to loosen things up a little bit; for example, by getting the penthouse interested in information from the engine room or by providing faster feedback loops. 1 In the telephone game, children form a circle and relay a message from one child to the next. By the time the message returns to the originator, it typically has completely changed along the way. 2 “Authority Without Responsibility,” Wikiwikiweb, 2004, https://oreil.ly/WhXg-. Chapter 2. Movie-Star Architects Most Architects Carry Multiple Personas The architect walk of fame What should an architect be doing besides riding the elevator? Let’s try another analogy: movie characters. Before the movie starts, you get to watch advertisements or short films. In our case, it’s a short film about the origin of the word architect: it derives from the Greek ἀρχιτέκτων (architekton), which roughly translates into “master builder.” Keeping in mind that this word was meant for people who built houses and structures, not IT systems, we should note that the word implies “builder,” not “designer”—an architect should be someone who actually builds, not someone who only draws pretty pictures. An architect is also expected to be accomplished in their profession as to deserve the attribute of being a “master.” Now to the main feature… The Matrix: The Master Planner If you ask tech folk to name a prototypical architect in the movies, they’ll likely mention the The Matrix trilogy. The Architect of the Matrix is (per Wikipedia) a “cold, humorless, white-haired man in a light-gray suit,” qualities he largely owes to the fact that he is a computer program himself. Wikipedia also notes that the Architect “speaks in long logical chains of reasoning,” something that many IT architects are known to do. So perhaps the analogy holds? Fun fact: Vint Cerf, one of the key architects of the internet, bears a remarkable resemblance to the Matrix Architect. Considering Vint designed much of the Matrix we live in, this might not be pure coincidence. The Matrix Architect is also the ultimate authority: he designed the “Matrix” (the computer program that simulates reality to humans who are being farmed by machines as an energy source) and knows and controls everything. The enterprise architect is sometimes seen as such a person— the all-knowing decision maker. Some even wish themselves into such a role, partly because it is neat to be all-knowing and partly because it gets you a lot of respect. Naturally, this role model has some issues: all-knowingness turns out to be a little too ambitious for humans, leading to poor decision-making and all sorts of other problems. Even if the architect is a super-smart person, they can base decisions on only those facts that are known to them. In large companies with a complex IT, it would be impossible to stay in touch with all technology that is in place, no matter how often they ride the elevator (Chapter 1) down to the engine room. They’ll therefore inevitably need to rely on presentations, documents, or statements from management on the middle floors. Such an information channel to the supreme decision maker has severe challenges: every floor through which such a document passes understands its value as an influencing vehicle. This means that the middle floors are tempted to inject their favorite messages and project proposals, often irrespective of any technical merit. Further up, any real technical content or potentially controversial topics are sure to be removed. As a result, the top architect is being fed indirect, distorted, and often biased information. Making decisions based on such information is dangerous. I have observed senior management briefings in which the proposed solutions were rather a list of people’s favorite projects than actual solutions. Interestingly, and fortunately, despite having less IT experience, top management sensed that there was a missing link between the two. In summary: corporate IT is no movie, and its role isn’t to provide an illusion for humans being farmed as power sources. We should be cautious with this architect model. Edward Scissorhands: The Gardener A slightly more fitting analogy for enterprise architects is that of a gardener. I tend to depict this metaphor with a character from one of my favorite movies, Edward Scissorhands. Large-scale IT is much like a garden: things evolve and grow on their own, with weeds growing the fastest. The role of the gardener is to trim and prune what doesn’t fit and to establish an overall balance and harmony in the garden, keeping in mind the plants’ needs. For example, shade-loving plants should be planted near large trees or bushes, just like automated testing and continuous integration (CI)/continuous development (CD) will be happier in the neighborhood of rapidly evolving systems. A good gardener, just like a good architect, is no dictatorial master planner and certainly doesn’t make all the detailed decisions about which direction a strain of grass should grow—Japanese gardens being a possible exception. Rather, gardeners see themselves as the caretaker of a living ecosystem. Some gardeners, like Edward, are true artists! I like this analogy because it has a soft touch to it. Complex enterprise IT does feel organic, and good architecture has a sense of balance, which we can often find in a nice garden. Top-down governance with weed killer is unlikely to have a lasting effect and usually does more harm than good. Whether this thinking leads to a new application for The Nature of Order,1 I am not sure yet. I should go read it. Vanishing Point: The Guide Erik Dörnenburg, ThoughtWorks’ head of technology, Europe, introduced me to another very apt metaphor. Erik closely works with many software projects, which tend to loathe the ostensibly all-knowing, all-decisionmaking architect who is disconnected from reality. Erik even coined the term architecture without architects, which might cause some architects to worry about their career. Erik likens an architect to a tour guide, someone who has been to a certain place many times, can tell a good story about it, and can gently guide you to pay attention to important aspects and avoid unnecessary risks. This is a guiding role: tour guides cannot force their guests to follow their advice, except maybe those who drop off a bus load of tourists at a tourist-trap restaurant in the middle of nowhere. This type of architect needs to “lead by influence” and must be hands-on enough to earn the respect of those whom they’re leading. The tour guide also stays along for the ride and doesn’t just hand a map to the tourists like some consultant architects are known to do. An architect who acts as a guide often depends on strong management support because evidence that good things happened due to their guidance can be subtle. In purely “business case–driven” environments, this could be limiting the “tour guide” architect’s impact or career. An unconventional guide out of another one of my favorite movies is the blind DJ Super Soul from the 1971 road movie Vanishing Point. Like so many IT projects, the movie’s protagonist, Kowalski, is on a death march to meet an impossible deadline and overcome numerous obstacles along the way. He isn’t delivering code, but a 1970 Dodge Challenger R/T 440 Magnum from Denver to San Francisco—in 15 hours. Kowalski is being guided by Super Soul who has tapped the police network, just like architects plugging into the management network, to get access to crucial information. The guide tracks Kowalski’s progress and keeps the hero clear of all sorts of traps that police (i.e., management) have set up. After Super Soul is compromised by “management,” the “project” goes adrift and ends like too many IT projects: in a fiery crash. The Wizard of Oz Architects can sometimes be seen as wizards who can solve just about any technical challenge. Although that can be a short-term ego boost, it’s not a good job description and expectation to live up to. Hence, by the “wizard” architect analogy, I don’t mean an actual wizard waving the magic wand but the “Mighty Oz”: a video projection that appears large and powerful but is in fact controlled by a mere human “wizard,” who turns out to be an ordinary man using the big machinery to garner respect. A gentle dose of such engineered deception can be of use in large organizations in which “normal” developers are rarely involved in management discussions or major decisions. This is where the “architect” title can be used to make oneself a bit more “great and mighty.” The projection can garner the respect of the general population and can even be a precondition to taking the elevator to the top floors. Is this cheating? I would say “no” as long as you don’t get enamored in so much wizardry that you forget about your technical roots. Superhero? Superglue! Similar to the wizard, a common expectation of an architect is that of the superhero: if you believe some job postings, enterprise architects can single-handedly catapult companies into the digital age, solve just about any technical problem, and are always up to date on the latest technology. These are tough expectations to fulfill, so I’d caution any architect against taking advantage of this common misconception. Amir Shenhav from Intel appropriately pointed out that instead of the superhero we need “super glue” architects—the guys who hold architecture, technical details, business needs, and people together across a large organization or complex projects. I like this metaphor because it resembles the analogy of an architect being a catalyst. We just need to be a little careful: being the glue (or catalyst) means understanding a good bit about the things you glue together. It’s like being a good matchmaker: you need to find matching parts, and to do that you need to understand what your parts are made from. Making the Call Which type of architect should you be? First, there are likely many more types and movie analogies. You could play Inception and create architectural dream worlds with a (dangerous) twist. Or be one of the two impostors debating Chilean architecture in There’s Something about Mary or (more creepily) Anthony Royal in the utopian drama High-Rise—the opportunities are manifold. In the end, most architects exhibit a combination of these prototypical stereotypes. Periodic gluing, gardening, guiding, impressing, and a little bit of all-knowing every now and then can make for a pretty good architect. 1 Christopher Alexander, The Nature of Order (Berkeley, CA: Center for Environmental Structure, 2002). Chapter 3. Architects Live in the First Derivative In a Constantly Moving World, Your Current Position Isn’t Very Meaningful Deriving the need for architecture Defining a system’s architecture is a balancing act between many, oftenconflicting goals: flexible systems can be complex; high-performing systems can be difficult to understand; easy-to-maintain systems can take more effort to construct initially. Although this is what makes an architect’s work so interesting, it also makes it difficult to pin down what exactly drives architectural decisions. Rate of Change Defines Architecture If I had to name one primary factor that influences architecture, I’d put rate of change at the top of my list, based on reasoning about the inverse question: when does a system not need any architecture at all? Although as an architect this isn’t a natural question to ask (nor to answer), it can reveal what system property makes architecture valuable. In my mind, the only system that wouldn’t benefit from architecture is one that doesn’t change at all. If everything about a system is 100% fixed, just getting it working somehow seems good enough. Now, reverting the logic back to the original proposition, it appears natural that the rate of change is a major driver of architecture’s value and architectural decisions. It’s easy to see that a system that doesn’t need to change much will have a substantially different architecture than one that needs to absorb frequent changes over long periods of time. Good architects, therefore, deal with change. This means that they live in the system’s first derivative: the mathematical expression for how quickly a function’s value changes.1 Once we understand the influence change has on architecture, it’s useful to consider the various forms of change affecting an IT system. The first change that comes to mind is a change in functional requirements, but there’s a lot more: changes in the volume of traffic or data to be processed, changing the runtime environment to the cloud, or changes to the business context such as using the system in different languages or by different people. Change = Business as Unusual? Despite the popular saying that “the only constant is change,” traditional IT organizations tend to have a somewhat uneasy relationship with change. This mindset is often revealed by a popular engine room slogan: “never touch a running system” (Chapter 12). When change can’t be avoided, IT departments neatly package it into a project. The most celebrated part of an IT project is the end, or launch, which ironically is often the first time real users actually get to use the system. The reason for celebration is that things can return to “business as usual,” that is, stable operations without any change. Packaging change into projects reflects an organization’s belief that “no change” is the normal, desired state and “change” is the intermittent, unusual state. Thus, many organizational systems are designed to control and prevent change: budgeting processes limit spending on change; quality gates limit changes going to production; project planning and requirements documents limit scope changes. Transforming a software delivery organization such that it embraces constant change requires adjusting these processes to support rather than prevent change without ignoring the (generally useful) motivation for setting them up in the first place. That’s not an easy task and is the reason why this book devotes an entire part to transformation (Part V). Varying Rates of Change Technology is a fast-moving field: we don’t think much of IT products carrying a three-part version number: “well, if you’re still on 2.4.14, I can’t help you much; it’s really time to upgrade to .15.” Luckily, not everything in IT moves fast: the most common processor architecture, the base for Intel’s x86 processors, originates from 1978. The ARM chips that dominate today’s mobile devices are based on a design from around 1985. Both Linux and Windows operating systems are well past their teenage years, and even Java passed the 20-year mark at version 9 some years ago, closely followed by the Java Spring Framework, which has surpassed a respectable 15 years. Naturally, such low rates of change can largely be observed in lower layers of the so-called IT stack: hardware and operating systems have such a vast installed base and so many dependencies that the cost of an all-out replacement would be huge. Hence, we tend to see more evolution than revolution here. These technologies are essentially the base of the pyramid (Chapter 28), giving us a stable foundation to build on. On top, things move a lot faster. For example, the popular AngularJS framework was essentially replaced by the very different Angular framework just five years after its inception. Google’s Fabric framework also lived just five years before being subsumed by Firebase. And Google Mashup Editor, one of my favorites of the day, survived a mere two years. Although we’re surely sad to witness products’ early demise, the rate at which new products and tools arrive paints an even more dramatic picture. For example, a look at the Cloud Native Interactive Landscape offered by the Cloud Native Computing Foundation (CNCF) will quickly convince you that building modern applications requires a fastgrowing list of ingredients. Things are moving fast and are only getting faster. If rate of change is a driver for architecture, it looks like we’ll need more of it! A Software System’s First Derivative If the first derivative is an architect’s primary concern, how does this somewhat abstract concept translate into the reality of systems architecture? We can get a hint by thinking about which part of a software system determines its rate of change. For a custom-built system, the critical element for change is the build toolchain, the part that converts source code into an executable format that is subsequently deployed onto the runtime infrastructure. All changes to the software (better) go through this build and deployment toolchain. Knowing that the software toolchain is the first derivative, increasing a software system’s rate of change requires a well-tuned toolchain (Chapter 13). A software system’s first derivative is its build and deployment toolchain. It’s no surprise, then, that in recent years the industry has put much attention and effort into reducing friction in software delivery: Continuous Integration (CI), Continuous Deployment (CD), and configuration automation are all aspects of increasing the first derivative of software systems and thus speeding up software delivery. Without such innovations, daily or hourly software deployments wouldn’t be possible, and companies wouldn’t be able to compete in digital markets, which thrive on constant improvement and frequent updates. Whereas build systems previously were the proverbial shoemaker’s children, meaning they didn’t get a lot of attention, they now run on the same type of infrastructure as the production systems. Containerized, fully automated, elastic, cloud-based, on-demand build systems are quickly becoming the norm. Teams building and maintaining such sophisticated build systems clearly live in the first derivative! Designing for the First Derivative When designing a system for change, it’s again helpful to think about the opposite—the aspects that impede change: Dependencies Too many interdependencies between a system’s components will result in small changes needing adjustments in many places, increasing both effort and risk. Systems with fewer interdependencies—for example, because they are modular and cleanly separate responsibilities—localize changes and can therefore generally absorb a higher rate of change. The research conducted by the authors of the book Accelerate2 shows that decoupling system components is the biggest contributor to sustained software delivery. Friction Both cost and risk of change increase with friction, generated, for example, by long lead times for infrastructure provisioning or numerous manual deployment steps. Teams that live in the first derivative therefore ensure that their software build chain is fully automated. Poor quality There’s a common misbelief that good quality requires extra time and effort. The inverse is actually true: poor quality slows down software delivery. Changes to a poorly tested or poorly built system take more time and are more likely to break things. Fear Often ignored, a programmer’s attitude has a major impact on the rate of change. Poor quality and low levels of automation make change a risky proposition. Developers will thus be afraid of making changes. This leads to code rot, which in turn increases the risk of change—a nasty spiral. The list shows that an architect has several levers with which they can increase velocity, some technical in nature and others that relate to team attitude. It’s another example of how technical and organizational architecture go hand in hand. Confidence Brings Speed If fear slows you down, confidence should speed you up. Automated tests do just that: they give teams confidence and thus increase the rate of change. That’s why determining whether a system has sufficient test coverage shouldn’t be measured in the percentage of lines of code covered. Rather, it should be measured by whether teams can make changes confidently. Propose to a development team that they let you delete 20 arbitrary lines from their source code. Then, they’ll run their tests—if they pass, they’ll push the code straight into production. From their reaction, you’ll know immediately whether their source code has sufficient test coverage. Despite an abundance of tools that are supposed to speed up software delivery, the determining factor remains decidedly human. The change that’s never made out of fear cannot be accelerated by the world’s best toolchain. Rate of Change Trade-Offs Increasing an organization’s rate of change is not an all-or-nothing affair and involves balancing trade-offs. Borrowing one more time from the routinely overstretched analogy between IT architecture and building architecture yields useful advice on the multiple facets of designing for change. If either a large software project or housing project is undertaken without a conscious decision about its architecture, the “default” architecture converges toward the “Big Ball of Mud,” also referred to by its real-world incarnation of a shantytown (Chapter 8). A shantytown, or slum, is generally constructed using cheap materials and unskilled labor. Low cost and a broad labor pool are actually desirable properties. Additionally, local changes, such as adding a wall or even another floor, are often quick and inexpensive—in contrast to fancier highrise buildings. However, besides not providing a very comfortable living environment, slums also lack common infrastructure, such as a well-built electrical or sewer system. The lack of such infrastructure ultimately limits their rate of growth. This is a good reminder that optimizing for local or short-term change can inhibit global or long-term change. Multispeed Architectures If a system’s rate of change influences its architecture, it would seem natural to construct a system such that components are separated by rate of change. This approach forms the basis for the popular concepts of twospeed architecture or bi-modal IT, which suggest that traditional companies looking to become competitive in a digital world should initially increase the rate of change in the interaction layer (“Systems of Engagement”) while keeping legacy systems (“Systems of Record”) stable. In doing so, rapid changes can supposedly be applied to the customer-facing systems, whereas the record-keeping systems are kept stable and reliable. Although dividing systems by rate of change is a fair idea, this particular approach has significant shortcomings. First, it’s based on the flawed assumption that one can move faster by compromising quality (Chapter 40). Otherwise we wouldn’t need to keep a low rate of change in systems of record to maintain their reliability. Second, a company will be hard pressed to localize change into the interaction layer. For example, the addition of a simple field to the system of engagement typically also requires a change to the system of record, coupling the two systems’ rates of change: if the system of record follows a six-month release cycle, there won’t be much speed inside this two-speed architecture. It turns out that the separation between systems of engagement and systems of record is artificial and doesn’t line up well with the overall rate of change from a business or end-user perspective. This insight is underlined by the fact that hardly any digital business follows such a setup. Digital companies only know one speed: fast. Separating rate of change along a different dimension might well be beneficial, though. For example, a company’s accounting or payroll system will likely have a lower rate of change and can utilize a different architecture from the core business systems, which form a competitive differentiator for the organization, and hence should support a higher rate of change. The Second Derivative If the first derivative describes a software system’s rate of change, following our mathematical analogy, increasing the rate of change is dependent on a positive second derivative. Using the speed of a car as analogy, a car’s speed is the first derivative of its position: it defines how much distance it can cover over a given time interval. Accelerating—that is, increasing the speed—is the second derivative. Back in IT, the second derivative is the essence of most transformation programs: they aim to increase the rate of change in an organization or its IT systems. Thus, for an organization to appreciate and successfully conduct a transformation program, it first needs to appreciate the importance of the first derivative; that is, it must understand economies of speed (Chapter 35). It’s hard to sell a stronger engine and a shorter gear ratio for faster acceleration to someone who prefers to coast along on cruise control. Rate of Change for Architects Lastly, technical systems and organizations aren’t the only systems that need to increase their rate of change. Architects also do because new technologies arrive at an ever-faster pace, leaving architects with an enormous challenge of staying up to date. If they don’t, they might be relegated to life in the ivory tower (Chapter 1), far away from the engine room. How can architects expect to keep up in today’s world of rapid innovation? Trying to do so by yourself appears futile—no one can stay current on everything. Instead, architects should be part of a trusted but diverse network of experts, which can provide unbiased information. When you sit near a large IT budget that’s being vied for by vendors, you’ll have many folks wanting to update you on new technologies, or rather products (Chapter 16). However, neutrality is an architect’s major asset, so they’re expected to cut through the buzzword fog to discern what’s really new and what’s just clever repackaging of old concepts. Even though living in a world that’s moving ever faster can be tiring, it’s also what keeps architects’ jobs interesting and makes architecture more valuable. So, embrace life in the first derivative! 1 The derivative of a function measures the sensitivity to change of the function’s output value with respect to a change in its input value. 2 Nicole Fosgren, Jez Humble, and Gene Kim, Accelerate: Building and Scaling High Performing Technology Organizations (Portland, Oregon: IT Revolution, 2018). Chapter 4. Enterprise Architect or Architect in the Enterprise? The Upper and Lower Floors of the Ivory Tower Architecture from the ivory tower When I was hired as an enterprise architect, the head of enterprise architecture to be more precise, I had little idea what enterprise architecture really entailed. I also wondered whether my team should be called the “Feet of Enterprise Architecture,” but that contemplation didn’t find much love. The driver behind the tendency to prefix titles with “head of” was aptly described in an online forum I stumbled upon:1 This title typically implies that the candidate wanted a director/VP/executive title but the organization refused to extend the title. By using this obfuscation, the candidate appears senior to external parties but without offending internal constituencies. I am not particularly fond of the “head of xyz” title because it focuses on the person heading (no pun intended) a team rather than accomplishing a specific function. I’d rather name the person by what they need to achieve, assuming that they don’t do this alone but have a team supporting them. All title prefixes aside, when IT folks meet an enterprise architect, their initial reaction is to place this person high up into the penthouse (Chapter 1), where they draw pretty pictures that bear little resemblance to reality. To receive a warmer welcome from IT staff, one should therefore be careful with the label enterprise architect. However, what is an architect who works at enterprise scale supposed to be called, then? Enterprise Architecture The recurring challenge with the title enterprise architect tends to be that it could describe a person who architects the enterprise as a whole (including the business strategy level) or someone doing IT architecture at the enterprise level (as opposed to a departmental architect, for example). To help resolve this ambiguity, let’s defer to the defining book on the topic, Enterprise Architecture as Strategy by Jeanne Ross, Peter Weill, and David Robertson.2 Here, we learn the following: Enterprise architecture is the organizing logic for business processes and IT infrastructure reflecting the integration and standardization requirements of the company’s operating model. Following this definition, enterprise architecture (EA) isn’t a pure IT function but also considers business processes, which are part of a company’s operating model. In fact, the book’s most widely publicized diagram shows four quadrants depicting business operating models with higher or lower levels of process standardization (uniformity across lines of business) and process integration (sharing of data and interconnection of processes). Giving industry examples for all quadrants, Weill and Robertson map each model to a suitable high-level IT architecture strategy. For example, a data and process integration program might yield little value if the business operating model is one of highly diversified business units with few shared customers. For such enterprises, IT should instead provide a common infrastructure, on top of which each division can implement its diverse processes. Conversely, a business that’s composed of largely identical units, such as a franchise, benefits from a highly standardized application landscape. The matrix demonstrates perfectly how EA forges the connection between the business and IT. Only if the two are well aligned does IT provide value to the business. Connecting Business and IT Connecting business and IT is easier if the business side of the organization also has a well-defined architecture. Luckily, as business environments become more complex and digital disruptors force traditional enterprises to evolve their business models more rapidly, the notion of business architecture has gained significant attention in recent years. Business architecture translates the structured, architectural way of thinking (Chapter 8) that’s guided by a formalized view of components and interrelationships into the business domain. Rather than connecting technical system components and reasoning about technical system properties such as security and scalability, business architecture describes the “the structure of the enterprise in terms of its governance structure, business processes and business information.”3 The business architecture essentially defines the company operating model, including how business areas are structured and integrated, derived from the business strategy. Meanwhile, the IT architecture builds the corresponding IT capabilities. If the two work seamlessly side by side, you don’t need much else. In the more likely case that the two aren’t well connected, you need something to pull them together. Therefore, here’s my proposed definition of enterprise architecture: Enterprise architecture is the glue between business and IT architecture. This definition clarifies that EA, unlike IT architecture at enterprise level, isn’t an IT function. Accordingly, the EA team should be positioned close to the company leadership and not be buried deep within the IT organization, so that it can balance business, technical, and organizational considerations. The definition also implies that after business and IT are tightly interlinked, you won’t need much EA, which is one reason why you don’t find much EA within so-called digital giants. Alas, don’t panic! The translation between business needs and IT architecture remains a domain that’s perennially short of talent. It appears that most folks find comfort on one or the other side of the fence, but only a few can, and choose to, credibly play in both worlds. It’s a good time to be an enterprise architect. Most digital giants don’t have EA departments because their business and IT are tightly interlinked. IT Is from Mars, Business Is from Venus The strict separation between IT and business that is commonly found in enterprises seems troublesome to me. I tend to jest that in the old days, when everything was running on paper instead of computers, companies also didn’t have a separate “paper” department and a CPO—the chief paper officer. In digital companies business and IT are inseparable; IT is the business, and the business is IT. Connecting business and IT gives EA a whole new relevance but also new challenges. It’s like adding a mid-floor elevator that connects the business folks in the penthouse with the IT folks in the engine room because the respective elevators don’t quite reach each other. Although highly valuable, in the long run such an enterprise architecture department’s objective must be to make itself obsolete, or at least smaller, by extending the respective elevators. But no worries, rapid changes in both the business and technical environments make it unlikely that the need for enterprise architecture disappears altogether. Building a fruitful, bidirectional connection between business and IT architecture becomes easier if the business architecture is at a comparable level of maturity as IT architecture. More often than not, though, business architecture is even less mature as a domain than IT architecture. That’s not because businesses had no architecture; rather, it’s because the folks doing business architecture were not identified as such but were the business leaders, division heads, or COOs. Also, designing the business was rather attributed to business acumen than structured thinking. Where the business produced architecture-like artifacts, they often ended up being “functional capability maps” that don’t include any lines (Chapter 23). Supporting the business is the ultimate goal and raison d’être of all enterprise functions. Positioning IT architecture on par with business architecture highlights, though, that the days when IT was a simple ordertaker that provides a commodity resource at the lowest possible cost are (luckily) over. In the digital age, IT is a competitive differentiator and opportunity driver, not a commodity like electricity. Digital giants like Google or Amazon aren’t technology companies; they are advertising or fulfillment companies that understand how to use technology for competitive advantage. Therefore, the common excuse that “Google and Amazon are technology companies while we are an insurance company/bank/manufacturing business” no longer holds. These companies will compete with you, and if you want to be competitive, you also need to change your view of IT. It’s not an easy thing to do, but the digital giants have demonstrated how powerful that insight is. Value-Driven Architecture The scale and complexity of doing architecture at enterprise scale makes large-scale IT architecture exciting, but it also presents one of the biggest dangers. It’s far too easy to become lost in this complexity and have an interesting time exploring it without ever producing tangible results. Such instances are the source of the stereotype that EA resides in the ivory tower and delivers little value. EA teams therefore need to have a clearly articulated path to value: any effort that is made needs to be paid back by providing value to the organization. Another danger lies in the long feedback cycles. Judging whether someone performs good EA takes even longer than judging good application architecture. Even though the digital world forces shorter cycles, many EA plans still span three to five years. Thus, enterprise architecture can become a hiding ground for wanna-be cartographers. That’s why enterprise architects need to show impact (Chapter 5). Fools with tools Some enterprise architects associate themselves closely with a specific EA tool that captures the diverse aspects of the enterprise landscape. These tools allow structured mapping from business processes and capabilities, ideally produced by the business architects, to IT assets such as applications and servers. Make sure that your tools work for you, not the other way around! Done well, such tools can be the structured repository that builds the bridge between business and IT architecture. Done poorly, they become a neverending discovery and documentation process that produces a deliverable that’s missing an emphasis (Chapter 21) and is outdated by the time it’s published. Needless to say, such a deliverable provides little value. Visit All Floors My definition of EA also implies that some IT architects, who aren’t enterprise architects, work at enterprise scope. These are largely the folks I refer to in this book. Because they are the technical folks who have learned to ride the elevator (Chapter 1) to the upper floors to engage with management and business architects, they are a critical element in any IT transformation. How is being an “enterprise-scale architect” different from a “normal” IT architect? First, everything is bigger. Many large enterprises are conglomerates of different business units and divisions, each of which can be a multibillion-dollar business and can be engaged in a different business model. As things get bigger, you will also find more legacy: businesses grow over time or through acquisitions, both of which breed legacy. This legacy isn’t constrained to systems, but also to people’s mindsets and ways of working. Enterprise-scale architects must therefore be able to navigate organizations (Chapter 34) and complex political situations. Performing true EA is as complex and as valuable as fixing a Java concurrency bug. There’s enormous complexity at all levels, but the good news is that you can use similar patterns of thinking at the different levels. For example, software architects need to balance their system’s granularity and interdependencies: a giant monolith is rather inflexible, whereas a thousand tiny services will be difficult to manage and can incur significant communication overhead. The exact same considerations apply to business architecture when considering the size of divisions and product lines. Lastly, EA also faces the same trade-offs when having to decide which systems should be centralized, which simplifies governance but can also stifle local flexibility. Architecture, if taken seriously, provides significant value at all levels. Enterprises resemble a fractal structure: the more you zoom in or out, the more things look similar. The short film Powers of 10, produced in 1977 by Charles and Ray Eames for IBM, illustrates this beautifully: the film zooms out from a picnic in Chicago by one order of magnitude every 10 seconds until it reaches 10 , showing a sea of galaxies. Subsequently, it zooms in until at 10 it shows the realm of quarks. Interestingly, the two views don’t look all that different. 24 −18 1 Keith Rabois, Quora, May 11, 2010, “What does “Head” usually mean in job titles like “Head of Social,” “Head of Product,” “Head of Sales,” etc.?”, https://oreil.ly/5LmbY. 2 Jeanne W. Ross, Peter Weill, and David C. Robertson, Enterprise Architecture as Strategy: Creating a Foundation for Business Execution (Boston, MA: Harvard Business Review Press, 2006). 3 Object Management Group website, http://www.omg.org/bawg. Chapter 5. An Architect Stands on Three Legs A Three-Legged Stool Does Not Wobble A three-legged stool does not wobble What do IT architects do? You could say that they are the people who make IT architecture, but that leaves you with having to define what architecture is, which we won’t do until Part II. More interesting yet, what sets a good architecture apart from an average one? And what does an architect become after many successful years? A penthouse resident (Chapter 1)? Hopefully not! A chief technology officer (CTO)? Not a bad choice. Or do they remain a (more senior) architect? That’s what famous building architects do, after all. It’s time to have a look at the progression of architects. Skill, Impact, Leadership When asked to characterize the seniority of an architect, I apply a simple framework: a successful architect must stand on three “legs”: Skill The foundation for practicing architects. It requires knowledge and the ability to apply it to solve real problems. Impact The measure of how well an architect applies his or her skill to benefit the company. Leadership Determines whether an architect advances the state of the practice. This nomenclature maps well to other professional fields that rely on highly trained and experienced individuals. For example, in the medical field after studying and acquiring skill, doctors practice and treat patients before they go on to publish in medical journals and pass their learnings on to the next generation of doctors. The legal field works similarly. Let’s have a brief look at each “leg.” Skill Skill is the ability to apply relevant knowledge which can relate to specific technologies (such as Docker) or architectures (such as microservices architectures). Such knowledge can usually be acquired by taking a course, reading a book, or perusing online material. Most (but not all) professional certifications focus on verifying knowledge, partly because it’s easily mapped to a set of multiple choice questions. Skill brings this knowledge to life by successfully applying it to Knowledge is like having a drawer chest full of tools. Skill implies knowing when to open which drawer and which tool to use. specific problems. For example, defining the right domain boundaries and service granularity for a complex microservices architecture is a skill. Knowledge is like having a drawer chest full of tools. Skill implies knowing when to open which drawer and which tool to use. Impact Impact is measured by the benefit achieved for the business, usually in form of additional revenue or reduced cost. Faster times to market or the ability to incorporate unforeseen requirements late in the product cycle also positively affect revenue and therefore count as impact. Focusing on impact is a good exercise for architects to not drift off into PowerPoint-land. As I converse with colleagues about what distinguishes a great architect, we often identify rational and disciplined decision making (Chapter 6) as a key factor in translating skill into impact. This doesn’t mean that just being a good decision maker makes you a good architect. You still need to know your stuff. Leadership The leadership leg acknowledges that experienced architects do more than make architecture. Mentoring junior architects can save a new generation of architects many years of learning by doing. Senior architects should also further the state of the field as a whole; for example, by sharing what they’ve learned or mental models they’ve developed. Such sharing can be done via numerous channels, including academic publications, magazine articles, teaching at university, teaching professional courses, speaking at conferences, or blogging. When someone with the title “senior architect” proposes to meet me, I tend to do a quick internet search for their name before I reply. If nothing much comes up, I have doubts as to how “senior” they are. It will also make it more difficult for them to get my time. A Chair Can’t Stand on Two Legs Just as a stool cannot stand on two legs, it’s important to appreciate the balance between the three aspects. Skill without impact is where new architects start out as students or apprentices. But soon it is time to get out into the world and make an impact—architects who don’t make an impact don’t have a place in a for-profit business. Impact without leadership is a typical place for architects who are deeply ingrained in projects but “don’t get out much.” Such architects will plateau at an intermediate level, which is bad for them and their employer. The architects will likely hit a glass ceiling in their career because they won’t be able to see beyond their current environment. Likewise, such an architect won’t lead the company to much-needed innovative or transformative solutions, ultimately limiting their impact. Many companies are penny wise and pound foolish by not placing sufficient emphasis on nurturing their architects. They fear that any distraction from daily project work will be unproductive. However, they miss out on growing world-class architects. Mature companies, in contrast, go as far as formalizing the aspect of leadership as “give back”: for example, IBM distinguished engineers and fellows are expected to demonstrate giving back to the community both internally (e.g., via mentoring) and externally (e.g., via conference presentations or publications). Lastly, leadership without (prior) impact lacks foundation and might be a warning signal that an architect has become an ivory tower resident with a weak link to reality. This undesirable effect can also occur when the impact stage of an architect lies many years or even decades back: the architect might preach methods or insights that are no longer applicable to current technologies. Although some insights are timeless, others age with technology: putting as much logic as possible into the database as stored procedures because it speeds up processing is no longer a wise approach as databases often turn out to be the bottleneck in modern web-scale architectures. The same is true for architectures that rely on nightly batch cycles. Modern 24/7 real-time processing doesn’t know any nighttime. The Virtuous Cycle But there’s more to the three facets of being a good architect: each element contributes back to the other, as shown in Figure 5-1. As an architect applies their skill to generate impact, they also learn what skills to prioritize to maximize that impact. Most likely you learned a lot of things that don’t easily translate into daily life in corporate IT—the Ackerman function (Chapter 39) is one of my favorites. Given the rate of innovation in our field, being able to prioritize your learning time is a major asset. So, there’s a symbiotic relationship between building up skill and applying it. Figure 5-1. An architect’s virtuous cycle Often, the best way to learn something is to apply it to a real-world problem. That’s why my house is full of home automation. It’s not that I really need all things to be automated; most of these were learning projects. Exercising leadership further amplifies an architect’s impact: 10 wellmentored junior architects will surely generate more impact than one senior architect. As architects, we know that scaling vertically (getting smarter) works only up to a certain level and can lead to a single point of failure (you!). Therefore, you need to scale horizontally (Chapter 30) by deploying your knowledge to multiple architects. The scarcity of good architects makes this step more important than ever. Interestingly, though, mentoring not only benefits the mentee, but also the mentor. The old saying that to really understand something you need to teach it to someone else is most true for architecture. Likewise, giving a talk or writing a paper (Chapter 18) requires you to sharpen your thoughts, which often leads to renewed insight. Also, in a fast-moving world, mentors can receive reverse mentoring1 about new technologies or approaches, which can often help offload existing assumptions that no longer hold true (Chapter 26). Authoring books and sharing openly has given me access to the most amazing communities and has allowed me to have much more impact. Lastly, sharing openly and demonstrating thought leadership offers another huge benefit: it can give you access to a powerful community of other thought leaders, which in turn makes you a better architect. Most tight-knit communities share certain expectations for their members. While usually not spelled out, they typically involve giving back to the community in the form of conference talks, authoring books or blog posts, or contributing to open source projects. You Spin Me Right Round… Experienced architects will correctly interpret this 1980s reference (others can resort to Wikipedia2) to mean that an architect doesn’t complete the virtuous cycle just once. This is partly driven by ever-changing technologies and architectural styles. A person might already be a thought leader in relational databases, but they might need to acquire new skills in NoSQL databases. The second time around acquiring skill is usually significantly faster because you can build on what you already know. After a sufficient number of cycles, we might in fact experience what the curmudgeons always knew: that there is really not much new in software architecture and that we’ve seen it all before. Another reason to repeat the cycle is that the second time around our understanding can be at a much deeper level. The first time around we might have learned how to do things, but only after the second time might we understand why. For example, it’s likely no misrepresentation that writing Enterprise Integration Patterns3 is a form of thought leadership. Still, some of the elements such as the pattern icons or the decision trees and tables in the chapter introductions were more accidental than based on deep insight. It’s only now in hindsight that we understand them as instances of a visual pattern language or pattern-aided decision approaches. Thus, it’s often worthwhile to make another cycle. Architect as Last Stop? Even though architects have one of the most exciting jobs, some people might be sad to see that being an architect implies that you’ll likely remain one for most of your career. I am not so worried about that. First, this puts you in a good peer group of CEOs, presidents, doctors, lawyers, and other high-end professionals. Second, in technically minded organizations, software engineers should feel the same: your next career step should be to remain a software engineer, except a senior one, or staff engineer or perhaps a principal engineer. The goal is, therefore, to detach the job title of software engineer or IT architect from a specific seniority level. At many digital organizations the software engineer career ladder reaches all the way to the senior vice president level, with commensurate standing and compensation. Some organizations even include a chief engineer, which, if you think about it, might be a better title than chief architect. Personally, I prefer to get better at what I like doing than trying to chase something else just for the title. Keep architecting! 1 Jennifer Jordan and Michael Sorell, “Why Reverse Mentoring Works and How to Do It Right,” Harvard Business Review, Oct. 3, 2019, https://oreil.ly/bjAET. 2 Wikipedia, “You Spin Me Round (Like a Record),” https://oreil.ly/fDcRP. 3 Gregor Hohpe and Bobby Woolf, Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions (Boston, MA: Addison-Wesley, 2003). Chapter 6. Making Decisions Deciding Not to Decide Is a Decision (IT) Life is full of choices You buy a lottery ticket and win. What a fantastic decision! You cross the road at night, on red, on a busy street, slightly intoxicated, and with your eyes closed, and arrive safely on the other side. Also a good decision? Doesn’t sound like it. But what’s the difference? Both decisions had positive outcomes. In the latter case, though, we judge by the risk involved while in the former we focus on the outcome, ignoring the ticket price and the (usually low) odds of winning. However, you can’t judge a decision by the outcome alone, simply because you didn’t know the outcome when you made the decision. Here’s another exercise: in front of you is a very large jar. It contains 1,000,000 pills. They all look the same, are all tasteless, and benign— except one, which will kill you instantly and painlessly. How much money does someone have to pay you to take a pill from this jar? Most people will answer 1 million dollars, 10 million dollars, or straight-out refuse. However, the same people are quite willing to cross the road on a red light (with their eyes open), which carries the same risk as swallowing a couple of pills. It’d be difficult to argue that the 30 seconds you saved by crossing on red would have earned you the equivalent of a few million dollars. Humans are terrible decisions makers, especially when small probabilities and grave outcomes like death are involved. Kahneman’s book Thinking, Fast and Slow1 shows so many examples of how our brain can be tricked, it can make you wonder how humanity could get this far despite being such terrible decision makers. I guess we had a lot of tries. Making decisions is a critical part of an enterprise-scale architect’s job. Being a good architect therefore warrants a conscious effort to becoming a better decision maker. The Law of Small Numbers Contrived examples make erratic or illogical behavior quite apparent. But when faced with complex business decisions, poor decision-making discipline often isn’t as obvious. I attended weekly operations meetings that labeled weeks “good” or “bad” based on the number of critical infrastructure outages. I relabeled those weeks as “lucky” because lowering the number and severity of incidents in the long run is the real metric to observe. Hoping for a week with fewer outages is the corporate IT equivalent of the (flawed) roulette strategy of “after five times black it’s gotta be red!” My shocker version of highlighting such flawed thinking consists of a (fictitious) sequence of events during Russian Roulette: “click—I am a genius!—boom.” Kahneman calls this “The law of small numbers”; people tend to jump to conclusions based on sample sizes that are way too small to be significant. For example, zero outages in a week are no cause for celebration in a large enterprise. Google’s mobile ads team used rigid metrics for A/B testing experiments that affected ad appearance or selection. The dashboard included metrics to check click-through rates (more clicks = more money) but also to understand whether ads distract from the search results (users come for search, not ads). Each metric’s confidence interval represented the range that 95% of sample sets would randomly fall into. If your experiment’s improvement landed inside the confidence interval, you’d need to extend the experiment to get valid data before implementing the suggested change (for normal distributions, the confidence interval narrows with the square root of the number of sample points). Alas, not all data leads to better decisions. When selecting a product, IT often compiles extensive requirement lists that are summed up into scores. However, when you pick the “winner” with a score of 82.1 over the “loser” with 79.8, it would be challenging to prove the statistical significance of this decision. Still, numeric scores might be better than traffic light comparison tables that rate each attribute as “green,” “yellow,” or “red.” A product might get “green” for allowing time travel but “red” for requiring planned downtime. Although this might make it look roughly equivalent to one with the opposite properties, I know which one I’d prefer. Traditional IT organizations often reverse engineer score charts from a specific outcome so that they have data to back up their preference. Sadly, such comparison charts are reverse engineered from a preferred outcome. Others are designed to protect the status quo by demanding quirks only present in existing products. I have seen IT requirements analogous to demanding that a new car must rattle at 60 mph and have a squeaky door so that it can appropriately replace the existing one. Bias Kahneman’s book lists many ways in which our thinking is biased. For example, confirmation bias describes our tendency to interpret data in such a way that it supports our own hypotheses. The Google Ad dashboard was designed to overcome this bias. Another well-known bias is prospect theory: when faced with an opportunity, people tend to favor a smaller but guaranteed gain over the uncertain chance for a larger one: “A sparrow in the hand is better than the pigeon on the roof.” When it comes to taking a loss, however, people are likely to take a (long) shot at avoiding the penalty over coughing up a smaller amount for sure. We tend to “feel lucky” when we can hope to escape a negative event, an effect called loss aversion. You have likely seen project managers avoid the certain loss in short-term velocity for performing a major refactoring because the payoff in system stability or sustained velocity is uncertain. The following scenario shows how loss aversion tricks us into making irrational decisions. When you offer someone a coin toss that makes them pay $100 on heads but gives them $120 on tails, the expected return of taking the gamble is $10 (0.5 × –$100 + 0.5 × $120)—easy money. However, most people will kindly decline due to their loss aversion. Losing $100 to them feels worse than the chance to gain $120. Most people will only accept the offer when the payout is between $150 and $200. Priming Another phenomenon, priming, can influence decisions based on recent data we received. In the extreme case, when faced with enormous uncertainty, it can make us pick a number we recently heard or saw even if it’s totally unrelated. This effect plays a role when many people answer one million dollars when faced with the one-million-pills example. Priming is routinely used in retail scenarios. When you go to buy a piece of clothing—let’s say a sweater—the store clerk is almost guaranteed to first show you something expensive, even outside your price range. A sweater for $399? It’s made from cashmere and feels very soft and comfortable; tempting, but it’s simply too expensive. But the almost-as-nice sweater for $199 seems a reasonable compromise, and you’ll happily buy it. Next door, decent sweaters can be had for $59. You fell victim to priming, setting a context that influences your decision. Priming can even make you walk more slowly if your mindset is on elderly people.2 William Poundstone’s book Priceless: The Myth of Fair Value3 shows that products that no one actually buys can shift purchasing behavior significantly, thanks to priming. When presented with a choice between a “premium” beer for $2.60 and a “bargain” one for $1.80, about two thirds of test subjects (students) chose the premium beer. Adding a third, “superpremium” beer for a whopping $3.40 shifted student’s desire so that 90% ordered the premium beer and 10% the super-premium. If we are such horrible decision makers, what can we do to get better at it? Understanding these pitfalls can help you avoid or at least compensate for them. However, mathematics can also help. Micromort One of the most interesting classes that I took at Stanford was Ron Howard’s class on decision analysis, which was entertaining, thought provoking, and challenging. Decision analysis helps us think rationally about our earlier jar-with-pills example. A one-in-one-million chance of dying is called one micromort. Taking one pill from the jar amounts to being exposed to exactly one micromort. The amount you are willing to pay to avoid this risk is called your micromort value. Micromorts help us reason about decisions with small probabilities but very serious outcomes, such as deciding whether to undergo surgery that eliminates lifelong pain but fails with a 1% probability, resulting in immediate death. To calibrate the micromort value, it helps to consider the risks of different life activities: a day of skiing clocks in at between one and nine micromorts, whereas motor vehicle accidents amount to about 0.5 per day. So a ski trip can run you some five micromorts—the same as swallowing five pills. Is it worth it? You’d need to compare the enjoyment value you derive from skiing against the trip’s cash expense plus the “cost” of the micromort risk you are taking. So how much should you demand to take one pill? Most people’s micromort value lies between $1 and $20. Assuming a prototypical value of $10, the ski trip that might cost you $100 in gas and lift tickets costs you an extra $50 in risk of death. You should therefore decide whether a day in the mountains is worth $150 to you. This also shows why a micromort value of $1,000,000 makes little sense: you’d hardly be willing to pay $5,000,100 for a one-day ski trip unless you are filthy rich! Lastly, the model helps you judge whether buying a helmet for $100 is a worthwhile investment for you if it reduces the risk of death in half. The micromort value isn’t the same for all people. It goes up with income (or rather, consumption) and goes down with age. This is to be expected as the monetary value you assign to your remaining life increases with your income. A wealthy person should easily decide to buy a $100 helmet, whereas a person who is struggling to make ends meet is more likely to accept the risk. As you age, the likelihood of death from natural causes unstoppably increases until it reaches about 100,000 micromorts annually, or almost 300 per day, by the age of 80. At that point, the value derived from buying a risk reduction of two micromorts is rather small. Luckily, Ron Howard and Ali Abbas have captured the mathematics of decision making in their book Foundations of Decision Analysis.4 The book isn’t cheap, though, listing at around $200. Should you buy a book for $200 that could make you a better decision maker? Think about it… Model Thinking Decision models can go a long way toward making us better decision makers. Thanks to George Box, it’s well known that “all models are wrong, but some are useful.”5 So, don’t dismiss a model just because it makes simplifying assumptions. It’s likely to help you make a much better decision than your gut. The best overview of models and their application I have come across is Scott Page’s Coursera course on Model Thinking. He also recently published the content in his book The Model Thinker.6 Decision trees are very simple models that help us make more rational decisions (see Figure 6-1). Suppose that you want to buy a car, but there’s a 40% chance that the dealer will offer a $1,000 cash-back promotion starting next month. You need a car now, so if you defer the purchase, you’ll need to rent a car for $500 for the coming month, even if the rebate doesn’t come through. What should you do? If you buy now, you’ll pay the list price, which we calibrate to $0 for simplicity’s sake. If you rent first, you are down by $500 with a 40% chance to gain $1,000, so the expected value is 0.4 × $1,000 – $500 = –$100, lower than the list price. You should buy the car now. Figure 6-1. A decision tree helps you decide whether to buy a car now Let’s make the scenario a bit more interesting: assume that an insider offers to tell you whether the cash-back promotion happens next month or not. He asks $150 for this information. Should you buy it? Having this information, your new decision tree (see Figure 6-2) would allow you to buy now if you’re told that there’s no cash back (in 60% of cases) and to buy later if there is (in 40% of cases). Having information up front increases the expected value to 0.6 × 0 + 0.4 × (1,000 – 500) = $200. As your current best scenario (i.e., buying now) yielded a value of $0, it’s worth paying $150 for the extra information. Figure 6-2. Should you pay someone to tell you whether there will be a rebate? How do you know that the chance of the cashback is exactly 40%? You don’t. But using the model helps you reason in face of uncertainty. You can rerun the model for a 50% likelihood and see whether your decision changes. IT Decisions Deadly pills, premium beers, and car dealer rebates—how do we bring our learnings back to IT decision making? Many IT decisions—especially those related to cybersecurity risks or system outages—share similar characteristics of small probability but severe downsides. Therefore, separating likelihood from impact and baselining probabilities can help remove emotion, resulting in more rational decisions. Maybe you even find it useful to define a concept of microfail for your systems: a one-in-amillion chance of a catastrophic system failure. A classic case for decision making is system uptime. Suppose that a single server can achieve 99.5% availability, meaning that 99.5% of the time it will be available to your application’s users. This means that over the course of an average month, which has 730 hours, the system can be “down” for 730 / 200 = 3.65 hours. That’s not horrible, but also not great. 99.9% is generally considered a good uptime—the allowed downtime would be less than roughly 45 minutes per month. However, to achieve this, you generally need redundant hardware, meaning that you need a second set of servers ready to go in case your primary server fails. This will double your hardware cost, often require additional failover machinery, and in some cases also double your software license cost. Are three hours downtime less per month worth double the cost? Sounds like a perfect case for decision analysis! Avoiding Decisions With all this science behind decision making, what’s the best decision? It’s the one that you don’t need to take! That’s what Martin Fowler indicated when he observed that “one of an architect’s most important tasks is to eliminate irreversibility in software designs.”7 Those are the decisions that don’t need to be made or can be made quickly because they can be easily changed later thanks to you having built-in options (Chapter 9). In a welldesigned software system, decisions aren’t as final as when taking deadly pills from a jar. 1 Daniel Kahneman, Thinking, Fast and Slow (New York: Farrar, Straus and Giroux, 2013). 2 John A. Bargh, Mark Chen, and Lara Burrows, “Automaticity of Social Behavior,” Journal of Personality and Social Psychology 71, no. 2 (Aug. 1996): 230-244. 3 William Poundstone, Priceless: The Myth of Fair Value (New York: Hill and Wang, 2011). 4 Ronald A. Howard and Ali E. Abbas, Foundations of Decision Analysis (Prentice Hall, 2015). 5 George Box, “Science and Statistics,” Journal of the American Statistical Association (1976). 6 Scott E. Page, The Model Thinker: What You Need to Know to Make Data Work for You (New York: Basic Books, 2018). 7 Martin Fowler, “Who Needs an Architect?,” IEEE Software, July/August 2003, https://oreil.ly/djeuH. Chapter 7. Question Everything Wer Nicht Fragt, Bleibt Dumm! The architect riddler It’s a common misconception that chief architects know everything better than “normal” architects—why else would they be the “chief”? Such thinking is actually pretty far from the truth. Hence, I often introduce myself as a person who knows the right questions to ask. Wrangling one more reference from the movie The Matrix, visiting the chief architect is a bit like visiting the Oracle: you won’t get a straight answer, but you will hear what you need to hear. Five Whys Asking questions isn’t a new technique and has been widely publicized in the “five whys” approach devised by Sakichi Toyoda as part of the Toyota Production System. It’s a technique to get to the root cause of an issue by repeatedly asking why something happened. If your car doesn’t start, you should keep asking “why” to find out the starter doesn’t turn because the battery is dead because you left the lights on because the beeper that warns you of parking with your lights on didn’t sound because of an electronics problem. So, before you jump-start the car, you should fix the electronics to keep the problem from happening again. In Japanese the method is called naze-naze-bunseki (なぜなぜ分析), which roughly translates into “why, why analysis.” I therefore consider the “five whys” more of a guideline to not give up too early—you surely didn’t cheat if you identified the actual root cause with just four whys. The technique can be quite useful but requires discipline because people can be tempted to inject their own preferred solutions or assumptions into their answers. I have seen people conducting root-cause analysis on production outages repeatedly answer the second or third question with “because we don’t have sufficient monitoring” and the next one with “because we don’t have enough budget.” The equivalent answer from the car example would be “because the car is old.” That’s not root-cause analysis but opportunism or excuse-ism, a word that made it into the Urban Dictionary, but not yet into Merriam-Webster. Repeatedly asking questions can annoy people a bit, so it’s good to have the reference to the Toyota Production System handy to highlight that this is a widely adopted and useful technique and not you just being difficult. It’s also helpful to remind your counterparts that you are not challenging their work or competence, but that your job requires you to understand systems and problems in detail so that you can spot potential gaps or misalignments. Whys Reveal Decisions and Assumptions When conducting architecture reviews, “why” is a useful question because it helps draw attention to the decisions (Chapter 8) that were made as well as the assumptions and principles that led to those decisions. Too often, results are presented as “god-given” facts that “fell from the sky” or wherever you believe the all-deciding divine creator (the real chief architect!) resides. Uncovering the assumptions that led to a decision can provide much insight and increase the value of an architecture review. An architecture review is not only looking to validate the results but also the thinking and decisions behind it all. To emphasize this fact one should request an architecture decision record1 from any team submitting an architecture for review. Unstated assumptions can be the root of much evil if the environment has changed since the assumptions were made. For example, traditional IT shops often write elaborate graphical configuration tools that could be replaced with a few lines of code and a standard software development tool chain. Their decisions are based on the assumption that writing code is slow and error prone, which no longer holds universally true as we learn once we overcome our fear of code (Chapter 11). If you want to change the behavior of the organization, you often need to identify and overcome outdated assumptions first (Chapter 26). Coming back to The Matrix, the explanation given by the Oracle—“You didn’t come here to make the choice, you’ve already made it. You’re here to try to understand why you made it”—could make a somewhat dramatic but very appropriate opening to an architecture review. A Workshop for Every Question A clear and present danger of asking questions in large organizations lies in the fact that people often don’t know, can’t express, or are unwilling to give the answer. Their counterproposal is usually to hold a meeting, most likely a very long one, which is labeled as “workshop,” with the purported goal of sharing and documenting the answer. In the actual workshop, though, it frequently turns out that the answer is unknown, leaving you with the job of answering your own questions. The team might also bring external support to defend against you asking too many undesired questions. Asking questions in traditional organizations might not get you insights but defensiveness to cover up the lack of decision discipline. Soon, your calendar will be full of workshop invitations, allowing teams to blame you for being the bottleneck that slows their progress because you aren’t available for their important meetings. And they aren’t even lying! Such organizational behavior is an example of systems resisting change (Chapter 10). If your goal is to not just review architecture proposals but also to change the behavior of the organization, you need to take up this challenge and change the system. For example, you can redefine the expectations for architecture documentation and obtain management buy-in for doing so; for example, to increase transparency. If satisfactory documentation isn’t produced before the meeting, the workshop must be canceled. If teams are unable to produce such documentation, you can offer them architects who perform this task on a project basis. The actual workshop becomes more effective when you moderate and work off a list of concrete questions. Cutting the scheduled time in half brings additional focus. On the upside, running architecture documentation workshops and sketching bank robbers (Chapter 24) can give you an invaluable set of system documentation that you can later use as a reference. This effort requires good writing skills (Chapter 18) and adequate staffing, which you can obtain only by taking the architect elevator (Chapter 1) to the upper floors and clearly articulating the value of documenting system architectures. For example, such documents could allow faster staff rampup, reveal architectural inconsistencies, and allow rational, fact-based decision making, which in turn supports evolution toward a harmonized IT landscape. In top-down organizations, sometimes you need to lob things to the top so they can trickle back down. No Free Pass Occasionally, teams that are sent into architecture review would like to just obtain a “rubber stamp” for what they have done, and they aren’t excited about you asking any questions at all. These are often the same candidates who answer the “why” questions with “because we have no time” after they purposefully waited until the very last minute. For such cases, I have a stated principle of, “You can avoid my review, but you cannot get a free pass.” If management decides that no architecture review is needed because it doesn’t see architecture as a first-class citizen, I’d rather avoid the review altogether than hold a show trial. I see this as in line with my professional reputation: be tough but fair and make tasty hamburgers out of holy cows. My boss once summarized this in a nice compliment: she stated that she likes to have the architecture team involved because, “we have nothing to sell, no one can fool us, and we take the time to explain things well.” This would make a nice mandate for any architecture team. If you’re wondering about the meaning of the German subtitle of this chapter, it’s from the title song of the German version of Sesame Street, which rhymes nicely and goes "Wieso, weshalb, warum, wer nicht fragt, bleibt dumm!,” which literally translates into “why? who doesn’t ask, remains stupid!” Don’t remain stupid! 1 Michael Nygard, “Documenting Architecture Decisions,” Relevance, Nov. 15, 2011, https://oreil.ly/1sniB. Part II. Architecture Defining architecture isn’t an easy task—there appear to be almost as many definitions of IT architecture as there are practicing architects. Beyond Software Architecture Most software architecture definitions cite a system’s elements and components plus their interrelationships. In my view, this covers only one aspect of architecture. First, IT architecture is much more than software architecture: unless you outsourced all your IT infrastructure into the public cloud, you need to architect networks, datacenters, computing infrastructure, storage, and much more. And even if you did, you still need a deployment architecture, a data architecture, and a security architecture. Second, defining which “components” you are focusing on constitutes a significant aspect of architecture. A manager once stated that he can’t understand the many network issues despite all the network stuff “being there.” His view was a physical one: Ethernet cables plugged into servers and switches. The complexity of network architecture, however, lies in virtual network segregation, routing, address translation, and much more. Different stakeholders see different parts of the architecture. Three Kinds of Architecture When speaking about architecture, people routinely refer to three quite different concepts, all of which relate to IT but are very different in nature: 1. A system’s architecture, defined by its structure, as in "microservices architecture" 2. The act of defining a system’s structure, as in “the architecture committee" 3. A team that is involved in defining architecture, as in “we’re setting up enterprise architecture" So, while every system has an architecture, not every organization has an architecture (unit) and even if it does, they may not get much architecture done. To make things a little less confusing, when I mention “architecture,” I generally refer to a system’s properties. For organizational aspects, I speak about “architects”—it’s based on humans after all. There Always Is an Architecture When speaking about a system’s architecture, it’s worth pointing out that all systems have one. You can’t build anything out of several pieces that doesn’t have any structure. Even clumping everything together into a giant monolith is an architecture decision. Once we come to this realization, statements like “we don’t have time for architecture” aren’t particularly meaningful. It’s simply a matter of whether you consciously choose your architecture or whether you let it happen to you. History has shown that the latter approach invariably leads to the infamous Big Ball of Mud1 architecture, also referred to as shantytown. Although that architecture does allow for rapid implementation without central planning or specialized skills, it also lacks critical infrastructure and doesn’t make for a great living environment. Fatalism isn’t a great enterprise architecture strategy, so I suggest you pick your architecture. The Value of Architecture Because there always is an architecture, an organization should be clear on what it expects from setting up an architecture function. Setting up an architecture team and then not letting it do its job—for example, by routinely subjecting architecture decisions to management decisions—is actually worse than intentionally letting things drift into a Big Ball of Mud: you pretend to define your architecture, but in reality you don’t. Worse yet, good architects don’t want to be in a place where architecture is seen as a form of corporate entertainment. If you don’t take architecture seriously, you won’t be able to attract and retain serious architects. IT management often believes that “architecture” is a long-term investment that will only pay off far into the future. Although this is true for some aspects—for example, managed system evolution over time—architecture can also pay off in the short-term, such as when you can accommodate a customer requirement late in the development cycle, when you gain leverage in vendor negotiations because you avoided lock-in, or when you can easily migrate your systems to a new datacenter location. Good architecture can also make a team more productive by allowing concurrent development and testing of components. Generally, good architecture buys you flexibility. In a rapidly changing world, this seems like a smart investment. Principles Drive Decisions Architecture is a matter of trade-offs: there rarely is one single “best” architecture. For example, the option to be able to move your application to the cloud likely increased cost and complexity. Architects therefore must take the context into consideration when making architectural decisions, because that context will help them weigh the trade-offs against one another. Architects should also strive for conceptual integrity, that is, uniformity across system designs. This is best accomplished by selecting a welldefined set of architecture principles that are consistently applied to architectural decisions. Deriving these principles from a declared architecture strategy assures that the decisions support the strategy. Vertical Cohesion A good architecture is not only consistent across systems but also considers all layers of a software and hardware stack. Investigating new types of scale-out compute hardware or software-defined networks is useful, but if all your applications are inflexible monoliths with hardcoded IP addresses, you gain little. Architects therefore not only need to ride the elevator (Chapter 1) across the organization but also up and down the technology stack. Vertical cohesion doesn’t stop at technology, but also needs to consider the business architecture. For example, many IT decisions can’t be made by IT alone but require input from the business and an understanding of the business structure and context. Architecting the Real World The real world is full of architectures, not just building architectures but also cities, corporate organizations, or political systems. The real world must deal with many of the same issues faced by large enterprises: lack of central governance, difficult to reverse decisions, enormous complexity, constant evolution, slow feedback cycles. That’s why architects should walk through the world with open eyes, always looking to learn from the architectures they encounter. Architecture in the Enterprise When defining architecture in large organizations, architects need to know a lot more than how to draw UML diagrams. They need to be able to do the following, as well: Chapter 8, Is This Architecture? Distinguish whether something is architecture in the first place. Chapter 9, Architecture Is Selling Options Be able to sell options to the business. Chapter 10, Every System Is Perfect… Tackle complexity by thinking in systems. Chapter 11, Code Fear Not! Know that configuration isn’t better than coding. Chapter 12, If You Never Kill Anything, You Will Live Among Zombies Hunt zombies so that they don’t have their brain eaten. Chapter 13, Never Send a Human to Do a Machine’s Job Automate everything and make the rest self-service. Chapter 14, If Software Eats the World, Better Use Version Control! Think like software developers as everything becomes software defined. Chapter 15, A4 Paper Doesn’t Stifle Creativity Build platforms and set standards that don’t stifle creativity. Chapter 16, The IT World Is Flat Navigate their IT landscape with an undistorted world map. Chapter 17, Your Coffee Shop Doesn’t Use Two-Phase Commit Gain architecture insights from waiting in line at the coffee shop. 1 Brian Foote and Joseph Yoder, “Big Ball of Mud,” Laputan.org, Nov. 21, 2012, http://www.laputan.org/mud. Chapter 8. Is This Architecture? Look for Decisions! Would you pay an architect for this? Part of my job as chief architect is to review and approve system architectures. When I ask teams to show me “their architecture,” I frequently don’t consider what I receive to be an architecture document. Their counter-question of “what do you expect?” isn’t so easy for me to answer: despite many formal definitions, it isn’t immediately clear what architecture is or whether a document really depicts an architecture. Too often we have to fall back to the “I know it when I see it” test famously applied by a US Supreme Court judge to obscene material.1 We’d hope that identifying architecture is a more noble task than identifying obscene material, so let’s try a little harder. I am not a big believer in allencompassing definitions but prefer to use lists of defining characteristics or tests that can be applied. One of my favorite tests for architecture documentation is whether it contains any nontrivial decisions and the rationale behind them. Defining Software Architecture So many attempts at defining software architecture have been made that the Software Engineering Institute (SEI) maintains a reference page of these definitions. Among the most widely used is this definition from Garlan and Perry, from 1995: The structure of the components of a system, their interrelationships, and principles and guidelines governing their design and evolution over time. In 2000 the ANSI/IEEE Std 1471 chose the following definition (adopted as ISO/IEC 42010 in 2007): The fundamental organization of a system, embodied in its components, their relationships to each other and the environment, and the principles governing its design and evolution. The Open Group adopted a variation thereof for TOGAF: The structure of the components, their interrelationships, and principles and guidelines governing their design and evolution over time. One of my personal favorites is from Desmond D’Souza and Alan Cameron Wills’s book on the Catalysis method:2 The set of design decisions about any system that keeps its implementors and maintainers from exercising needless creativity. The key point here isn’t that architecture should dampen all creativity, but needless creativity, of which I witness ample amounts. It also highlights the importance of making decisions (Chapter 6). Architectural Decisions These well-thought-out definitions aren’t easy to work with, however, when someone walks up with a PowerPoint slide showing boxes and lines (Chapter 23), claiming, “This is my system architecture.” The first test I tend to apply is whether the documentation contains meaningful decisions. After all, if no decisions needed to be made, why employ an architect and prepare architectural documentation? Martin Fowler’s knack for explaining the essence of things using extremely simple examples motivated me to illustrate the “architectural decision test” with the simplest example I could think of, drawing from the (admittedly limping) analogy to building architecture. Consider the drawing of a house on the lefthand side in Figure 8-1. It has many of the elements required by the popular definitions of systems architecture: we see the main components of the system (door, windows, roof) and their interrelationships (door and windows in the wall, roof on the top). We might be a tad thin, though, on principles governing its design, but we do notice that we have a single door that reaches the ground and multiple windows, which follows common building principles. Figure 8-1. Is this architecture? Yet, to build such a house I wouldn’t want to pay an architect. This house is “cookie-cutter,” meaning I don’t see any nonobvious decisions that an architect would have made. Consequently, I wouldn’t consider this architecture. Let’s compare this to the sketch on the right side of the figure. The sketch is equally simple, and the house is almost the same, except for the roof. This house has a steep roof and for a good reason: the house is designed for a cold climate where winters bring extensive snowfall. Snow is quite heavy and can easily overload the roof of the house. A steep roof allows the snow to slide off and be easily removed thanks to gravity, a pretty cheap and widely available resource. Additionally, an overhang prevents the sliding snow from piling up right in front of the windows. To me, this is architecture: nontrivial decisions have been made and documented. The decisions are driven by the system context; in this case, the climate: it’s unlikely that the customer explicitly stated a requirement that the roof not be crushed by snowfall. Additionally, the documentation highlights relevant decisions and omits unnecessary noise. If you believe these architectural decisions were pretty obvious, let’s look at a very different house in Figure 8-2. Figure 8-2. Great architecture on a napkin This house in Figure 8-2 is quite different: the walls are made out of glass. While providing a stellar view, glass walls have the problem that the sun heats up the building, making it feel more like a greenhouse than a residence. The solution? Extending the roof well beyond the glass walls keeps the interior in the shade, especially in summer when the sun is high in the sky. In the winter, when the sun is low on the horizon, the sun reaches through the windows and helps warm the building interior. Again, the architecture is defined by a fairly simple but fundamental decision, documented in an easy-to-understand format that highlights the essence of the decision and the rationale behind it. Fundamental Decisions Needn’t Be Complicated If you think the idea of building an overhanging roof isn’t all that original or significant, try buying one of the first homes to feature such a design; for example, the Case Study House No 22 in Los Angeles by architect Pierre Koenig. It’s easily in the league of most recognized residential building in Los Angeles or beyond (aided by Julius Shulman’s iconic photograph) and surely isn’t for sale. You can tour it, though, if you sign up far in advance. Significant architectural decisions may look obvious in hindsight, but that doesn’t diminish their value. No one is perfect, though: UCLA PhD students have measured that the overhang works better on the south-facing facade than west or east.3 Fit for Purpose The simple house example also highlights another important property of architecture: rarely is an architecture simply “good” or “bad.” Rather, architecture is fit or unfit for purpose. A house with glass walls and a flat roof might be regarded as great architecture, but probably not in the Swiss Alps where it will collapse after a few winters or suffer from a leaking roof. It also doesn’t do much good near the equator where the sun’s path on the sky remains fairly constant throughout the year. In those regions, you are better off with thick walls, small windows, and lots of air conditioning. Assessing the context and identifying implicit constraints or assumptions in proposed designs is an architect’s key responsibility. Architects are commonly described as the people dealing with nonfunctional requirements. I generally refer to hidden assumptions as nonrequirements—requirements that were never explicitly stated (Part I). Architecture isn’t good or bad, it’s fit or unfit for a purpose. Even the dreaded Big Ball of Mud can be “fit for purpose”; for example, when you need to make a deadline at all costs and can’t care much about what happens afterward. This may not be the context you wish for, but just like houses in some regions have to be earthquake proof, some architectures have to be management proof. Passing the Test Having stretched the overused building architecture analogy one more time, how do we translate it back to software systems architecture? Systems architecture doesn’t need to be something terribly complicated. It must include, however, significant decisions that are well documented and are based on a clear rationale. The word “significant” can be open to some interpretation and depend on the level of sophistication of the organization, but “we separate frontend from backend code” or “we use monitoring” surely have the ring of “my door reaches the ground so people can walk in” or “I put windows in the walls so light can enter.” Instead, when discussing architectures, let’s talk about what isn’t obvious. For example, “do you use a service layer and why?” (some people may find even this obvious) or “why are you spreading your application across multiple cloud providers?” A good test is whether the chosen option also has downsides—decisions without downsides are unlikely to be meaningful. It’s quite amazing how many “architecture All meaningful decisions have documents” don’t pass this relatively downsides. simple test. I hope our set of house sketches provides a simple and nonthreatening way to provide feedback and to motivate architects to better document their designs and decisions. 1 Wikipedia, "Jacobellis v. Ohio,” Sept. 7, 2019, https://oreil.ly/EwvpU. 2 Desmond F. D’Souza and Alan Cameron Wills, Objects, Components, and Frameworks with UML: The Catalysis Approach (Boston: Addison-Wesley Professional, 1998). 3 P. La Roche, “The Case Study House Program in Los Angeles: A Case for Sustainability,” in Proc. of Conference on Passive and Low Energy Architecture (2002). Chapter 9. Architecture Is Selling Options In Uncertain Times It’s Good to Have a Few Options Options on sale Quite frequently I am being asked about the value of architecture, sometimes out of actual curiosity, and at other times as a (welcome) challenge. Sadly, I also consistently find out just how difficult it can be to answer this seemingly harmless question in a succinct and convincing manner for a nontechnical audience. I thus consider having a good answer to this question a valuable skill for any senior architect. A colleague once suggested that an architect’s key performance indicator (KPI) should be the number of decisions made. While decision making is a defining element of doing architecture, I had a feeling that making as many decisions as possible isn’t what drives my profession. Measuring an architect’s contribution by the number of decisions they’re making reminded me of trying to measure developers’ productivity in lines of code written. That metric is widely known as a bad idea because poor developers tend to write verbose code with lots of duplication, whereas good developers find short and elegant solutions to complex problems. After a little bit of pondering, I remembered one of Martin Fowler’s most popular articles that also involves decision making, but from a very different point of view. Reversing Irreversible Decision Making Many conventional definitions of software architecture include the notion of making difficult- (or expensive-) to-reverse decisions. Ideally, these decisions would be made early in the project to give the project a direction and avoid “analysis paralysis,” the dangerous state in a project when requirements gathering drags on without any code being written. Making critical decisions early comes with a major challenge, though: the beginning of the project is also the time of highest ignorance because little is known about the project as well as the technologies to be used. Therefore, architects are generally expected to draw on their ability to abstract from their past experience to get those decisions “right.” Consistent project cost and timeline overruns have hinted, though, that deciding the system structure early in a project is difficult at best, even for an all-knowing architect (Chapter 2). Martin Fowler concluded some time ago that the opposite is actually true: “one of an architect’s most important tasks is to eliminate irreversibility in software designs.”1 So, instead of entrusting all crucial decisions to one person, a project can be better off by minimizing the number of early and irreversible decisions. For example, choosing a flexible or modular design can localize the scope of a later change and thus minimize the extent of upfront decision making. Now one could posit that deciding on a modular design is a second-degree up-front decision—we’ll come back to that point later. The desire to make decisions up front is frequently driven by the project’s surrounding structures and processes as opposed to technical needs. For example, time-consuming budget approval and procurement processes may require teams to make product selections well before development can start. Likewise, enterprise software and hardware vendors have a tendency to push for early tooling decisions in order to secure a deal. They might promise unsuspecting IT management spectacular results, including reducing or removing the need for expensive programmers, if only their tool is chosen right from the start. So if the organization is better off with an architect not making decisions, how do we eloquently articulate this to upper management? Deferring Decisions with Options Communicating to upper management becomes easier if you avail yourself of the business’s concepts and vocabulary. Along the way you may even discover business concepts that lend themselves to a new way of looking at IT. Financial services present us with just that: options. Decision making is a common activity in financial services, especially in stock trading. Buying shares in a company requires you to put up cash now in hopes of a future return—somewhat similar to buying a new car (Chapter 6), though the future price is unknown. Now, if you could travel into the future and see the stock price one year from now, making a decision would be very easy as long as you can still buy the stock at today’s price. Time travel isn’t available quite yet, but the example makes clear that being able to defer a decision while fixing the parameters has value. That’s intuitive because you’ll know more by the time the decision has to be made, allowing you to make a better decision. I tend to buy my ski passes on the day as I’ll have checked for good weather and snow conditions the night before. I choose to forgo a prepurchase discount for the value of being able to defer my decision. The closest approximation to time travel in financial services is the concept of a financial option. An option is defined as “the right, but not the obligation, to execute a financial transaction at fixed parameters in the future.” It’s a lot simpler to understand with an example: You may acquire the option to buy a stock for $100 (the “strike price”) in a year’s time (assuming a European option). After a year passes, it’s trivial to decide whether to exercise this option: if the stock price trades higher than $100, you can instantly make money by exercising your option to buy the stock for $100 and selling it at a profit. If the actual stock price is less than $100, you let the option expire, meaning you don’t use your right to buy at $100. Coincidentally, this doesn’t mean buying the option was a bad decision (see Chapter 6). An option allows you to defer a decision: instead of deciding to buy or sell a stock today, you can buy the option today and thus acquire the right to make that decision in the future.2 Good IT architecture can also offer options. For example, by coding in Java or another language that’s widely supported you are offering an option to run that software on different operating systems, deferring that decision until a later date. Luckily, your option won’t expire as long as Java keeps being supported on many platforms. Options Have Value The financial industry knows quite well that deferring decisions has value, and therefore an option has a price, C. There’s a whole market for buying and selling options and other derivatives. Two very smart gentlemen, Fischer Black and Myron Scholes, scored a Nobel prize for computing the value of an option, a formula known as the Black-Scholes formula:3 C (S, t) = N (d1 ) S − N (d2 ) Ke d1 = 1 σ√T −t [ln( S K )+(r + σ −r(T −t) 2 2 )(T − t)] d2 = d1 − σ√T − t There’s a lot going on here, but we can see how a few key parameters influence the price. For example, a higher strike price (K) reduces the value of the option as we’d expect. We can also see that if the option is valid right now (T = t), the option price becomes the current price (S) minus the strike price (K). So, it’s nice to have mathematical proof that options have value: if architects sell options, that means they bring value! An Architecture Option: Elasticity Luckily, IT architects need neither complex formulas nor a Nobel prize. All you need to do is design your system such that it defers decisions. You accomplish that by providing options that can be exercised in the future. A classic example is server sizing: before deploying an application, traditional IT teams would conduct a lengthy sizing study to calculate the amount of hardware required to run the application. Sadly, infrastructure sizing leads only to one of two possible outcomes: either too large or too small, which either results in wasted money or a poorly performing application. What a great opportunity to defer some decisions! For this example, the option the architect creates is horizontal scaling, allowing compute resources to be added or subtracted at a later time. Clearly, this option has a value: infrastructure can be sized according to the application’s actual needs and can grow (or shrink) as required. Also, this option isn’t free given that the system has to be designed to be able to scale out; for example, by making application components stateless or by using a distributed database. Essentially, you’re paying for the option with increased complexity. Given that complexity is one of the primary factors slowing down system delivery, it’s no small price to pay. Also, to take advantage of the application’s scaleout capability, you likely need to deploy the application on an elastic cloud platform, which might lock you into a particular vendor. So, in effect you’re paying for one option by giving up another. Architecture options are rarely free. For example, you may pay with increased complexity or loss of another option. Just like financial options, architecture options also allow you to hedge your bets in case you want to limit your downside if a desired outcome doesn’t materialize. For example, providing an abstraction from a vendor-specific interface can hedge against the vendor increasing license fees or going out of business. Strike Prices Now all the architect can do is offer the option for sale, describing the nature and price of the option. Someone has to decide whether to buy it. As mentioned a moment ago, making an application horizontally scalable or adding a layer of indirection isn’t free, so while it might be good architectural practice, decision discipline teaches us to examine whether this option is actually needed. The financial world sells options with different strike prices, which is the price you pay per share when you exercise the option in the future. It’s easy to see (and reflected in the Black-Scholes formula) that options with a lower strike price command a higher up-front price: the lower the price to execute the option in the future, the higher your potential gain. It’s useful to note that the option still has value even if the strike price is higher than today’s price—after all, the price might increase in the future. The effect translates easily into the earlier IT example: by migrating to a cloud provider we can lower the strike price for horizontal scaling to near zero, thanks to full automation. However, this reduction in strike price isn’t free: you’ll most likely pay with being locked into this specific provider’s APIs, access control, account setup, and machine types. So, the strike price for switching providers will be high. To reduce the strike price for switching cloud providers, you can build an abstraction layer that allows you to move your applications to any cloud provider by clicking a button. Container platforms make this feasible, but you also need to abstract all your storage, billing, and access control needs. You may also be bound by commercial agreements. So, aiming for a nearzero-cost cloud migration carries a huge up-front development cost: this is an expensive option to buy. Considering the low chance of needing to switch providers, this option might not be worth buying.4 Minimizing the strike price—that is, switching cost from one vendor to another—is often seen as the architectural ideal, but it’s rarely the most economical choice. Alternatively, consciously managing your application’s dependencies and deploying in containers might be a better balance. It carries a higher strike price—migrating will still incur some effort—but has a much lower upfront investment. Good architects offer a range of options at different strike prices and cost instead of aiming for a minimum strike price at all cost. Uncertainty Increases an Option’s Value Consequently, just as with financial markets, pricing and buying architecture options takes some consideration. There’s a second factor, though, that has a major impact on the value of an option: uncertainty. The more uncertain about the future I am, the more value I derive from deferring a decision. For example, the option to scale horizontally isn’t that valuable if my application is built for a small and constant number of users. However, if I am building an internet-facing application that could have 100 or 100,000 users, the option becomes much more valuable. The same is true in the financial world: the Black-Scholes formula contains a critical parameter, σ (“sigma”), which indicates the volatility. You’ll see this sigma squared in the numerator of the equation, indicating a strong correlation between volatility and option price. The business not wanting to be involved in technical decisions leads to suboptimal decision making because IT alone can’t judge the value of an option. Instead, it’s the architect job to translate technical options into meaningful choices for the business. Therefore, architects who put up options for sale need to understand the context and its volatility. Most likely, such input needs to come from the business side and can’t be made by IT alone. This implies that the business side stating that it doesn’t want to be involved in technical decisions is a bad idea because it will lead to suboptimal decision making. Time Is Fleeting Another parameter influences an option’s value: time. The time at which the option can be exercised—that is, the option’s maturity date—is represented by the parameter T in the Black-Scholes formula, whereas the current time is identified as t. The further out the maturity is in the future, the higher its value. This makes intuitive sense because your uncertainty increases the further you are looking into the future, making the option more valuable. Architects and project managers typically work under different time horizons and thus value the same option differently. This effect can help explain why architects and project managers often debate the merit of architecture options: project managers typically have a shorter time horizon than enterprise architects, who need to assure architectural integrity over many years and sometime decades. Due to the different time horizons, each of them has a different perceived (and, in fact, calculated) value of the same option. Interestingly, during such arguments, both parties are making rational but different decisions because their input parameters differ. A model, such as the options model, can help reduce such arguments to differences in input parameters and thus lead to better decision making. Real Options The idea of applying options theory outside of financial instruments isn’t just limited to IT and is referred to as real options.5 Real options guide corporate investment decisions, such as acquisitions or buying real estate, and are commonly broken down into categories,6 which map very well to software architecture and projects: Option to defer The ability to make an investment, such as adding a feature, at a later time. Option to abandon The ability to use or resell parts of a project in case the project as planned has to be abandoned. In IT architecture, this option can equate to building self-contained modules or services that can be salvaged for use in other projects. Option to expand The ability to increase capacity; for example, to scale out an application by adding hardware. Option to contract The ability to elegantly reduce capacity; for example, by using elastic infrastructure. Just like with buying hot chocolate (Chapter 17), we can learn from looking at the real world outside IT. Arbitrage In the financial world, markets are generally assumed to be efficient, meaning instruments are priced fairly according to their risk and expected return. Every once in a while, though, someone figures out a way to make immediate returns through arbitrage, an opportunity to profit at no risk. Architects should similarly look out for such opportunities where they can provide options at very low cost. For example, using an open source objectrelational mapping (ORM) framework is both best practice and an inexpensive option to make switching database vendors easier. Agile and Architecture Some Agile developers question architecture’s value because it was closely associated with a big, up-front-design approach that would look to make all decisions at the outset. Understanding architecture as providing options, you can easily see that the opposite is true. Both Agile methods and architecture are ways to deal with uncertainty, meaning that working in an Agile fashion allows you to benefit more from architecture. The value of both Agile methods and architecture increases with uncertainty, so they are friend, not foe. Evolutionary Architecture What should you do if meaningful options aren’t known, or at least not known far enough in advance? In that case, you need an architecture that can evolve along with your increased understanding of technology and customer needs—an approach that’s described as evolutionary architecture.7 Just like in natural history, what sets evolution apart from a series of changes is a fitness function that guides change by examining how well a solution serves an intended purpose. Choosing the right fitness function can now become the evolutionary architect’s contribution, rather than choosing a specific architecture up front. If you feel that’s an application of the well-known motto “all problems can be solved with one more level of indirection,” you might be onto something. Amplifying Metaphors When I first shared the “selling options” metaphor with a senior financial services executive, the former head of asset management, he instantly embraced the metaphor and quickly concluded that higher volatility increases the value of an option. Translating this back into IT, he stated that in times of high uncertainty, as we are facing them today both in business and technology, the value of architecture options also increases. Businesses should therefore invest more into architecture. Isn’t it fantastic when a person from a different field adopts a metaphor and takes it to the next level? 1 Fowler, “Who Needs an Architect?” 2 You can also use options for selling shares, the so-called put options. These are commonly used to hedge against major drops in a stock price, essentially acting like an insurance policy for your investment. 3 Wikipedia, “Black–Scholes Model,” https://oreil.ly/2ZcmI. 4 Gregor Hohpe, “Don’t Get Locked Up into Avoiding Lock-in,” MartinFowler.com (2019), https://oreil.ly/jWDAW. 5 Stewart C. Myers, Determinants of Corporate Borrowing (Cambridge, MA: MIT Sloan School of Management, 1976). 6 Lenos Trigeorgis, Real Options: Managerial Flexibility and Strategy in Resource Allocation (Cambridge, MA: MIT Press, 1996). 7 Neal Ford, Matthew McCullough, and Nathaniel Schutta, Presentation Patterns: Techniques for Crafting Better Presentations (Boston: Addison-Wesley Professional, 2012). Chapter 10. Every System Is Perfect… For What It Was Designed to Do! Analyzing system behavior Much of what architects do is reason about the behavior of complex systems: systems that have many pieces and complex interrelationships. There’s an entire field dedicated to such reasoning, called systems thinking or complex systems theory. While popular software architecture definitions focus on a system’s components and interrelationships, systems thinking emphasizes behavior (Chapter 8). As architects, we should view structure simply as a means to achieve a desired behavior. Thinking in systems helps us do so. Heater as a System A residential heater provides a canonical example of a system, which we also look at when we realize that control is an illusion (Chapter 27). As demonstrated in Figure 10-1, a heating system’s typical architecture diagram would depict the components and their relationships: a furnace generates hot water or air, a radiator or air duct delivers the heat to the room, and a thermostat controls the furnace. The structural/control system theory point of view, shown at the top of the figure, considers the thermostat the central element: it switches the furnace on and off as needed to regulate the room temperature. Figure 10-1. A structural view (top) and a systems view (bottom) of a heater In contrast, the systems thinking point of view, at the bottom of Figure 101, focuses on the room temperature as the central variable and the reasons why it is influenced: the burning furnace increases the room temperature while heat dissipation to the outside reduces it. Heat dissipation depends on both the room and the outside temperature: in cold weather more heat dissipates through walls and windows. That’s why smart heating systems increase the heating power in cold weather. In a way, systems thinking is a parallel universe that looks at the same system from a completely different angle, an angle that helps us better understand why we are building something. Feedback Loops Systems thinking helps us understand interrelated behavior; for example, feedback loops. The room thermostat establishes a negative feedback loop, which is typical for control systems: if the room temperature is too high, the furnace turns off, letting the room cool down again. Negative feedback loops usually aim to keep a system in a relatively stable state—the room temperature will still oscillate slightly depending on the hysteresis of the thermostat and the inertia of the heating system. The self-stabilizing range of most systems is limited, though: a heater cannot cool a room in the heat of summer or compensate for an open window during winter. Positive feedback loops behave in the opposite way: an increase in one system variable fuels a further increase. We know the dramatic effects of such behavior from explosives (heat releases more oxygen to burn hotter), nuclear reactions (a classical “chain reaction”), or hyperinflation (a spiral of price and wage increases). Another positive feedback loop consists of more cars on the road leading to investments in roads as opposed to public transit, which makes it more compelling to commute by car. Likewise, rich people tend to have more investment options to achieve higher returns, leading to a “the rich getting richer” symptom, as for example described in Piketty’s Capital in the Twenty-First Century.1 Positive feedback loops can be dangerous due to their “explosive” nature. Policies are often designed to counteract such positive feedback loops with negative ones; for example, by applying higher tax rates to higher incomes or by increasing gasoline tax while subsidizing public transit. However, it’s difficult to balance out the exponential character of positive feedback loops. Thinking in systems helps us reason about such effects. Organized Complexity Gerald Weinberg2 highlighted the importance of thinking in systems by dividing the world into three areas: organized simplicity is the realm of well-understood mechanics, such as levers or electrical systems consisting of discrete resistors and capacitors. You can calculate exactly how these systems behave. On the other end of the spectrum, unorganized complexity doesn’t allow us to understand exactly what’s going on, but we can model the system as a whole statistically because the behavior is unorganized, meaning the parts don’t interrelate much. Modeling the spread of a virus falls into this category. The tricky domain is the one of organized complexity, where structure and interaction between components matter, but the system is too complex to solve it by using a formula. This is the area of systems. And the area of systems architecture. System Effects If we can’t determine system behavior with mathematical formulas, how can systems thinking help us? Complex systems, especially systems involving humans, tend to be subject to recurring system effects or patterns. These effects explain why fishermen keep overfishing, depleting their own livelihood, and why tourists flocking to the same destination destroy exactly what attracted them. Understanding these patterns allows us to better predict system behavior and influence it. Donella Meadows’s book Thinking in Systems3 contains a list of common effects, including these typical ones: Bounded rationality, a term coined by Nobel laureate Herbert A. Simon, captures the effect that people will generally do what is rational, but only within the context that they observe. For example, if an apartment building has a central heating system without consumption-based billing, people will leave the heater on all day and open the windows to cool down the apartment as needed. Obviously, this is a giant waste of energy and leads to pollution, resource depletion, and global warming. However, if your bounded context is just that of the temperature in your apartment and your wallet, this behavior is the rational thing to do, whether you like it or not: keeping the heater running allows you to control the room temperature more easily as you avoid the inertia of the heating system having to warm up. The idea of the tragedy of the commons derives from the concept of the commons, a shared pasture in old Irish and English villages that was open to grazing by all the villagers’ animals. As this common resource is free, villagers are incentivized to acquire more cattle to feed on the commons. Of course, as the commons is a finite resource, this behavior will lead to resource depletion and poverty; hence, the tragedy. One reason such a system doesn’t selfregulate is delay: the effect of the wrong behavior will only become apparent when it is too late. The complexity of these effects is underlined by the fact that Elinor Ostrom, the only woman to win the Nobel Prize in Economics, famously debunked4 the concept of the tragedy of the commons. Understanding System Behavior Systems documentation, especially in IT, tends to depict the static structure but rarely the behavior of the system. Ironically, however, the system’s behavior is what’s most interesting: systems generally exist to exhibit a certain, desirable behavior. For example, the heating system was created to keep the temperature in your house at a comfortable level. Server infrastructure is made redundant to increase availability. In both cases the system structure is simply a means to an end. The difficulty in deriving system behavior A system’s structure is simply from its components can be illustrated by a means to achieve a desired the heating system in my apartment, which behavior. supplies both floor heating and wall radiators with hot water and comprises a handful of major components: the gas burner heats the water inside a primary circuit driven by a built-in pump. Two additional external pumps feed the hot water from the primary circuit to the floor heating and wall radiators, respectively. A misconfiguration caused the secondary pumps to not draw enough water, and therefore heat, from the primary circuit, which quickly overheated. This, in turn, caused the gas heater to shut off for a fixed duration, leading to a lack of heating power: naturally, the house cannot get warm when the heater is not burning. Because the house wouldn’t warm up, the technician’s intuition was to increase the burner’s heating power. However, this only exacerbated the problem: the system wasn’t able to move enough heat away from the primary circuit, so increasing the gas burner’s power only overheated it faster. After almost a dozen attempts, the heating system still wasn’t operating as designed, because the technicians might understand the individual system components but they are not comprehending the complex system behavior. Seems a little complicated? For architects, this stuff is our daily bread and butter. Understanding complex interrelationships between system components and influencing them to achieve a desired behavior is what architects do. Often a good diagram (Chapter 22) will help. Influencing System Behavior Most of what users see from a system are events: things happening as a result of the system behavior, which in turn is determined by the system structure, that is often invisible. If the users are unhappy with those events, such as the heater shutting off despite the room being cold, they often try to inflict a change, such as setting the room thermostat higher, without analyzing or changing the system behind them. The book Inviting Disaster5 provides dramatic examples of how misunderstanding a system led to major catastrophes such as the Three Mile Island nuclear reactor incident or the capsizing of the Deepwater Horizon drilling platform. In both cases, compromised system displays led operators to perform the very action that caused the disaster because they didn’t understand the underlying system and its behavior from the events they observed. Their mental model deviated from the real system, causing them to make fatal decisions. It has repeatedly been observed that humans are particularly bad at steering systems that have slow feedback loops, meaning those that exhibit reactions to changing inputs only after a significant delay. A classic example is MIT’s “beer game” in which participants on average perform almost 10 times worse than the ideal scenario.6 Overuse of credit cards is another classic example: people keep piling on debt until they are no longer able to pay even the interest and wonder how they got themselves into such a mess. Also, humans who don’t think about the system as a whole are prone to taking actions that have the opposite of the intended effect. For example, people react to overly full work calendars by setting up “blockers,” which make the calendars even fuller. Instead, we need to understand and fix what causes the full calendars; for example, a misaligned organizational structure that requires too many alignment meetings. You can’t fix a system by merely addressing the symptoms. Understanding system effects can help you devise more effective ways to influence the system and thus its behavior. For example, transparency is a useful antidote to the bounded rationality effect because it widens people’s bounds. An example from Donella Meadows’s book illustrates that having the electricity meter visible in the hallway caused people to be more conservative with their energy consumption without additional rules or penalties. Interestingly, systems thinking can be applied to both organizational and technical systems. We’ll learn this, for example, when we scale an organization (Chapter 30). John Gall’s Systems Bible7 gives a humorous but also insightful account of the ways in which systems behave, often against our intention or intuition. Systems Resist Change Changing systems is difficult not only because of their complex structure, but also because most of them actively resist change. Organizational systems’ change resistance achieves longevity, for example, through welldefined processes, but presents a challenge when a shift in the environment requires the organization to change. Frederic Laloux8 describes it as a key characteristic of amber organizations: they are built on the assumption that what worked in the past will work in the future, and it often served them well over thousands of years. As described in Chapter 7, if you request better documentation for architecture reviews, “the system” might respond by scheduling lengthy workshops that drain your available time. If you increase pressure, the system will respond with subquality documentation that increases your review cycles. You must therefore get to the root of the problem and highlight the value of good documentation, properly train architects, and allocate time for this task in project schedules. Most organizational systems have settled into a steady state over time and serve their purpose well enough. If the business environment demands a different system behavior, the system will actively resist by wanting to revert to its previous state. It’s like trying to push a car out of a ditch: the car keeps rolling back until you finally get it over the hump. This system effect makes organizational transformation so challenging. 1 Thomas Piketty, Capital in the Twenty-First Century (Boston: Belknap Press, 2014). 2 Gerald M. Weinberg, An Introduction to General Systems Thinking (Dorset House, 2001). 3 Donella H. Meadows, Thinking in Systems: A Primer (White River Junction, VT: Chelsea Green Publishing, 2008). 4 David Bollier, “The Only Woman to Win the Nobel Prize in Economics Also Debunked the Orthodoxy,” Evonomics, July 28, 2015, https://oreil.ly/9Na0H. 5 James R. Chiles, Inviting Disaster: Lessons from the Edge of Technology (New York: Harper Business, 2002). 6 John D. Sterman, “Modeling Managerial Behavior,” Management Science, Vol. 35, No. 3 (March 1989), https://oreil.ly/wrtzb. 7 John Gall, The Systems Bible, Third Edition (Walker, MN: General Systemantics Press, 2002). 8 Frederic Laloux and Ken Wilber, Reinventing Organizations: A Guide to Creating Organizations Inspired by the Next Stage in Human Consciousness (Nelson Parker, 2014). Chapter 11. Code Fear Not! Programming in a Poorly Designed Language Without Tool Support Is No Fun Who dares run this code? Yoda, the wise teacher of Jedi apprentice Luke Skywalker in the Star Wars movies, knows that fear leads to anger; anger leads to hate; hate leads to suffering. Likewise, corporate IT’s fear of code and the love of configuration can lead it down a path to suffering from which it is difficult to escape. Beware of the dark side, which has many faces, including vendors peddling products that “only require configuration,” as opposed to tedious, error-prone coding. Sadly, most complex configuration really is just programming, albeit in a poorly designed, rather constrained language without decent tooling or useful documentation. Fear of Code Corporate IT, which is often driven by operational considerations, tends to consider code the stuff that contains all the bugs, causes performance problems, and is written by expensive external consultants (Chapter 38) who are difficult to hold liable because they’ll have long moved to another project by the time the problems surface. Some IT leaders even proudly proclaim that they are a “proper business” and not a software development company, so they shouldn’t bother with coding stuff. The most grotesque example of fear of code I have observed was corporate IT providing application servers as a shared service. Once you deploy code on them, you’d no longer receive operational support. It’s like voiding a car’s warranty after you start the engine—after all, the manufacturer has no idea what you will do to it! Corporate IT’s eternal fear of code plays to the advantage of enterprise vendors who offer configuration as the safe alternative to coding. As we shall see, that’s a rather short-sighted proposition. Good Intentions Don’t Lead to Good Results IT’s aversion to coding originates from a good principle. Most enterprise IT rightly follows a buy-over-build strategy: buying commercial off-the-shelf (COTS) solutions not only saves IT departments time and money but also lets someone else worry about regular updates and security patches. Once purchased, solutions can be customized and configured to the enterprise’s specific needs. Likewise, common libraries and open source tools are a great way of reusing existing work. Open source tools are also often accompanied by an extensive community that can provide support and make technology adoption easier. For example, who would you want to write their own XML serializer? There’s a library for that. There’s a catch, though…well, actually, two: first, if you expect to configure a piece of software that you bought, you are relying on the vendor having anticipated the need for your case of customization, meaning the vendor gave you the option (Chapter 9). Doing this well would mean the vendor has perfected big, up-front design, correctly anticipating all possible requirements, while the rest of us are still trying to become more Agile (Chapter 31). Second, configuration means working in an abstraction provided by the software vendor. Now, abstractions are generally a good thing because they allow you to get away from the nitty-gritty details, but some abstractions also come with downsides. Levels of Abstraction: Simplicity Versus Flexibility Raising the level of abstraction is one of the primary techniques that makes developers’ lives easier. Thanks to abstraction, very few programmers still write assembly code, read single data blocks from a hard disk, or put individual data packets onto the network. This level of detail has been nicely wrapped behind high-level languages, files, and socket streams. These programming abstractions are very convenient and dramatically increase productivity: try doing without them! If abstractions are this useful, you might legitimately wonder whether adding further abstraction layers could boost productivity even more. For example, you could use libraries or services for all business functions. Ultimately, you could do away with coding altogether and allow solution development simply by, let’s say, configuration. If this sounds a bit too good to be true, that’s because it is. When raising the level of abstraction, you face a fundamental dilemma: how do you make a really simple model without losing too much flexibility? For example, if a developer needs rapid direct-access to any file location, the file stream abstraction actually gets in the way because it requires reading files sequentially. The best abstractions are therefore those that solve and encapsulate the difficult part of the problem while leaving the user with sufficient flexibility. If an abstraction takes away too many or the wrong things, it becomes overly restrictive and no longer applicable. If it takes away too few things, it didn’t accomplish much in terms of simplification and hence isn’t very valuable. Or as Alan Kay elegantly stated: Simple things should be simple, complex things should be possible.1 MapReduce, a framework for distributed data processing, is a positive example: it abstracts away the gnarly parts of distributed data processing, such as controlling and scheduling many worker instances, dealing with failed nodes, aggregating data across nodes, and so on. But it nevertheless leaves the programmer enough flexibility to solve a wide range of problems and was extremely widely used within Google. When Are We Configuring? So, if configuration promises us to abstract away the details of programming, we should look a little closer at the trade-offs that were made. But before we get there, it turns out that it’s not even trivial to determine when something is configuration as opposed to coding. The notion of configuration is mostly made by conflating several, unrelated aspects: The representation (e.g., visuals versus text) Whether you provide data or instructions Whether you make changes before or after deployment Let’s dissect each of these a bit. Model Versus Representation Coding abstractions such as libraries take away implementation details, but you’re still coding, although against more powerful objects and methods. Enterprise software abstraction often comes in different packaging, a graphical user interface (GUI) that enables spiffy drag-and-drop demos, which make the whole exercise seem trivial. At first sight, we might believe that painting a thin visual veneer over an existing programming model can provide a higher level of abstraction. Many business users might at first agree: typing in commands surely looks like coding, whereas drawing diagrams feels a lot more like PowerPoint. Unfortunately, that’s an illusion: a GUI changes the representation, but not the underlying model. A complex model, such as a workflow engine that includes concepts like concurrency, synchronization, correlation, longrunning transactions, compensating actions, and more, inherently carries heavy conceptual weight: there’s a lot of stuff to consider. Wrapping it in pretty visual packaging can make it more appealing, but it won’t remove this weight. If your synchronization bar is drawn in the wrong place, your workflow is just as broken as when making a coding mistake. This isn’t to say visual representations have no value. For example, representing visual workflows as graphs can be naturally expressive. But although they may reduce some of the initial learning curve, they generally don’t scale very well: once applications grow, it becomes difficult to follow what’s going on via a giant canvas of symbols. Zooming out means text won’t be readable anymore. Debugging and version control can also be a nightmare given that these tools mostly lack familiar diff functions. To test whether the visuals are just a thin veneer or really a better model, I generally apply two tests when vendors provide a demo of visual programming tools: I ask them to enter a typo into one of the fields where the user is expected to enter some logic. Often this leads to cryptic error messages or obscure exceptions in generated code later on. This is “tightrope programming”: as long as you stay exactly on the line, everything is well. One misstep and the deep abyss awaits you. I ask them to leave the room for two minutes while we change a few random elements of their demo configuration. Upon return, they would have to debug and figure out what was changed. So far, no vendor has taken the bait; they presumably know that failure doesn’t respect abstraction.2 Code or Data? Or Both? Leaving visuals aside, at which level of abstraction can we call something “configuration” versus “high-level programming”? We’ve seen that despite repeated vendor messaging, a visual user interface doesn’t suffice. Many programmers will tell you that files in XML (or JSON or YAML) syntax are configuration. However, anyone who has programmed in XSLT, which uses XML syntax, can attest that this isn’t configuration but heavy-duty declarative programming. There’s nothing simple about it. A better decision criterion could be whether what you provide to the system is executable or pieces of data. If the algorithm is predefined and you supply only a few key values, it may be fair to call this configuration. For example, let’s assume a program needs to classify users of different ages into children, adults, and seniors. The code will contain a chain of if-else or a switch statement. A configuration file could now supply the values for the decision thresholds; for example, 18 and 65. This would fit our definition of configuration. We might now conclude that changing those values is safe: typing in a number keeps you from having to understand programming language syntax and operator precedence. Alas, it doesn’t save you from screwing up the program. If you accidentally enter the values 65 and 18, the program is likely to not work as expected. The exact program behavior in this case is impossible to predict as it depends on the way the algorithm is coded. If the program checks for children first, you may have declared everyone as a child, whereas if the program checks for seniors first, you may have made everyone a senior. So while configuration is safer, it isn’t foolproof. The distinction between code and data blurs further when the data you enter determines execution order. For example, the “data” you enter may be a sequence of instruction codes. Or the data may resemble a declarative programming language; for example, to configure a rules engine or even XSLT. Aren’t coding instructions just data for the execution engine? Von Neumann3 would have said so. Apparently, it’s not so black and white. Deployment at Design-Time Versus Runtime Another way configuration is commonly distinguished from code is that we can change configuration after the application is deployed. This is certainly useful given that we can’t foresee some parameters until runtime; for example, the number of servers we need (Chapter 9). The distinction is based on the underlying assumption, though, that changing code is something that’s slow (because you have to rebuild and redeploy the whole application) and risky (because you may be introducing new defects). Microservices architectures and automated build-test-and-deploy chains put quite a few question marks behind these assumptions: they enable teams to rebuild, test, and deploy application code rapidly, repeatably, and with high quality. Rather than trying to anticipate changes for configuration, you may want to invest in your tool chain to allow incremental, rapid deployment. This doesn’t mean that configuration is useless, but it does mean that modern software delivery has given us other tools to achieve much of what configuration was intended to do. If we can make changes in the code, we don’t have to decide a priori which parameters we allow to be configured later, leading to much simpler code. Plus, we benefit from a fast range of tools like version control, editor support, and automated testing. When enterprise vendors tout their configuration suites, I challenge them to speed up their software delivery model. The lack of tooling makes the common assumption that configuration is safer, a questionable one. For example, “Configuration changes” have caused major outages at several cloud services providers.4 Higher-Level Programming In many cases, what’s being passed as configuration is really higher-level programming. For example, when composing distributed systems by connecting them via named message channels, “configuration” files often determine over which channel(s) a component communicates. Two components talk to each other when they use the same channel. Entering this data in local XML configuration files seems convenient, but it’s prone to mistakes because a simple typo would mean that components don’t communicate or are chained together in the wrong order. Composing a messaging system isn’t a matter of configuration, but a highlevel programming model for the composition layer of the system. Treating the configuration files as first-class citizens by checking them into source control and by creating validation and management tools5 can help debugging and troubleshooting enormously. Configuration Programming Whenever there’s a choice to be made—in our case programming versus configuration—you can be assured that someone has found a compromise. In our case this would be configuration programming:6 an approach that advocates the use of a separate configuration language to specify the coarse-grained structure of programs. Configuration programming is particularly attractive for concurrent, parallel, and distributed systems that have inherently complex program structures. Configuration Hiding as Code? So, is there a good place for configuration? Yes, for example, injecting runtime parameters into highly distributed programs or setting up cloud infrastructure (Chapter 14) are great use cases for configuration. Oddly, much of these approaches run under the moniker infrastructure as code (IaC) these days, even though most of the tools really are configuration. Someone must have felt that code sounds more powerful than configuation. Abstractions are a very useful technique, but believing that labeling something as “configuration” is going to eliminate complexity or the need to hire developers is a fallacy. Instead, think about whether this “configuration” is really higher-level programming. And in either case, make sure that it undergoes the same best practices of design, testing, version control, and deployment management that defines modern software delivery. Otherwise, you’d have created a proprietary, poorly designed language without tooling support. Then you would have been better off coding. 1 Wikiquote, “Alan Kay,” https://oreil.ly/SBC39. 2 Gregor Hohpe, “Failure Doesn’t Respect Abstraction,” The Architect Elevator (blog), January 21, 2019, https://oreil.ly/ejTmy. 3 Wikipedia, “von Neumann architecture,” https://oreil.ly/ilzNC. 4 Benjamin Treynor Sloss, “An Update on Sunday’s Service Disruption,” Inside Google Cloud (blog), June 3, 2019, https://oreil.ly/yaGr6. 5 Gregor Hohpe, “Visualizing Dependencies,” Enterprise Integration Patterns (blog), July 12, 2004, https://oreil.ly/1j4-7. 6 FOLDOC, “Configuration Programming,” https://oreil.ly/DkiV0. Chapter 12. If You Never Kill Anything, You Will Live Among Zombies And They Will Eat Your Brain The night of the living legacy systems Corporate IT lives among zombies: old systems that are half alive and have everyone in fear of going anywhere near them. They are also tough to kill completely. Worse yet, they eat IT staff’s brains. It’s like Shaun of the Dead minus the funny parts. Despite being a reality in corporate IT, living legacy systems are becoming more difficult to justify in a world that’s changing faster and faster. It’s time to put some zombies to rest. Legacy Legacy systems are built on outdated technology and are often poorly documented but (ostensibly) still perform important business functions. In many cases, the exact scope of the function they perform is not completely known. Ironically, most legacy systems generate a lot of revenue because otherwise they would have been killed a long time ago. When discussing what sets modern “digital” companies apart from traditional ones, “lack of legacy” regularly comes up as a key factor. Systems fall into the state of legacy because technology moves faster than the business: life insurance systems often must maintain data and functionality for decades, rendering much of the technology used to build the system obsolete. With a bit of luck, the systems don’t have to be updated anymore, so IT might be inclined to “simply let it run,” following the popular advice to “never touch a running system.” Unfortunately, changing regulations or security vulnerabilities in old versions of the application or the underlying software stack are likely to interfere with such an approach. Traditional IT sometimes justifies their zombies with having to support the business: how can you shut down a system that may be needed by the business? They also feel that digital companies don’t have such problems because they are too young to have accumulated legacy. 150 Google developers attending Mike Feathers’s talk about Working Effectively with Legacy Code1 might make us question this assumption. Because Google’s systems evolve rapidly, they also accumulate legacy more quickly than traditional IT. So it’s not that they have been blessed with not having legacy, they must have found a better way of dealing with it. Fear of Change Systems become legacy zombies by not evolving with the technology. This happens in classic IT largely because change is seen as a risk (Chapter 26). Once again: “never touch a running system.” System releases are based on extensive, often manual test cycles that can last months, making updates or changes a costly endeavor. Worse yet, there’s no “business case” for updating the system technology. This widespread logic is about as sound as considering changing the oil in your car a waste of money—after all, the car still runs if you don’t. And it even makes your quarterly profit statement look a little better; that is, until the engine seizes. Slogans like “Never touch a running system” reflect the belief that change bears risk. A team from Credit Suisse described how to counterbalance this trap in its aptly titled book Managed Evolution.2 The key driver for managed evolution is to maintain agility in a system. A system that no one wants to touch has no agility at all: it can’t be changed. In a very static business and technology environment, this might not be all that terrible, but that’s not the environment we live in anymore! In today’s environment, the inability to change a system becomes a major liability for IT and the business. Hoping for the Best Isn’t a Strategy Most things are the way they are for a reason. This is also true for the fear of change in corporate IT. These organizations typically lack the tools, processes, and skills to closely observe production metrics and to rapidly deploy fixes in case something goes awry. They hence try to test for all scenarios before deploying and then running the application more or less “blind,” hoping that nothing breaks. This behavior looks to maximize MTBF—the mean time between failures. While increasing the time between failures is a worthwhile approach, focusing on MTBF alone has two major downsides. First, it slows down hardware provisioning and software deployment due to excessive up-front testing. It also leads to a situation where the response to an actual failure becomes “this wasn’t supposed to happen.” It’s unlikely that those are the words you want to hear from an operations team. Such teams often ignore the other side of the equation: the mean time to recovery (MTTR). This metric indicates how quickly a system can recover from an error. Modern teams look at both aspects. As an analogy, you’d want to use fire-retardant materials but also a fire brigade that can be onsite in a few minutes. The top benchmark for incident response time I observed was at a large chemical factory where the fire brigade would be at the incident site in 45 seconds (!). Airports generally achieve two to three minutes.3 Traditional organizations “hope for the best” by relying on ways to maximize MTBF, whereas modern organizations also “prepare for the worst” by minimizing MTTR. Reducing MTTR involves very different mechanisms such as high system transparency, version control, and automation. In fact, reducing MTTR is such a game changer for IT organizations that it’s one of the four software delivery performance measures used by the authors of the book Accelerate.4 Version Upgrades The zombie problem is not limited to systems written in PL/1 running on an IBM/360, though. Often updating basic runtime infrastructures like application servers, JDK versions, browsers, or operating systems scare the living daylights out of IT, causing version updates to be deferred until the vendor ceases support. The natural reaction then is to pay the vendor for extended support to avoid the horror scenario of having to migrate your software to a new version. Often the inability to migrate cascades across multiple layers of the software stack: one cannot upgrade to a newer JDK because it doesn’t run on the current application server version, which can’t be updated because it requires a new version of the operating system, which deprecates some library or feature the software depends on. I have seen IT shops that are stuck on Internet Explorer 6 because their software utilizes a proprietary feature not present in later versions. Looking at the user interfaces of most corporate applications, you would find it difficult to imagine that they eked out every little bit of browser capability. They surely would have been better off not depending on such a peculiar feature and instead being able to benefit from browser evolution. Such a line of thought requires a conscious trade-off between optimizing for the short term versus assuring long-term velocity (Chapter 3). Ironically, IT’s widespread fear of code (Chapter 11) leads it down a dark and narrow road toward heavily customized frameworks. Version upgrades become very difficult and expensive to make, and another zombie grows. Anyone who has done an SAP upgrade can relate. Run Versus Change The fear of change is even encoded in many IT organizations that separate “run” (operating) from “change” (development), establishing that running software doesn’t imply change. Rather, it’s the opposite of change, which is done by application development—those guys who produce the flaky code IT is afraid of. Structuring IT teams this way will guarantee that systems will age and become legacy because no change could be applied to them. You might think that by not changing running systems, IT can keep the operational cost low. Ironically, the opposite is true: many IT departments spend more than half of their IT budget on “run” and “maintenance,” leaving only a fraction of the budget for “change” that can support the evolving demands of the business. That’s because running and supporting legacy applications is expensive: operational processes are often manual; the software may not be stable, necessitating constant attention; the software may not scale well, requiring the procurement of expensive hardware; lack of documentation means time-consuming trial-and-error troubleshooting in case of problems. These are reasons why legacy systems tie up valuable IT resources and skills, effectively devouring the brains of IT that could be applied to more useful tasks; for example, delivering features to the business. Planned Obsolescence When selecting a product or conducting a request for proposal (RFP), classic IT tends to compile a list containing dozens or hundreds of features or capabilities that a candidate product has to offer. Often, these lists are created by external consultants unaware of the business need or the company’s IT strategy. However, they can produce very long lists, and longer appears to be better to some IT staff, whose main motivation lies in demonstrating that the selection was “thorough.” To cite another car analogy, this is a bit like evaluating a car by having an endless list of more or less (ir)relevant features like “must have a 12V lighter outlet,” “speedometer goes above 200 km/h,” “can turn the front wheels,” and then scoring a BMW versus a Mercedes for these. How likely this is to steer (pun intended) you toward the car you will enjoy the most is questionable at best. One item routinely missing from such “features” lists is planned obsolescence: how easy is it to replace the system? Can the data be exported in a well-defined format? Can business logic be extracted and reused in a replacement system to avoid vendor lock-in? During the new product selection honeymoon, this can feel like discussing a prenup5 before the wedding—who likes to think about parting ways when you are about to embark on a lifelong journey? In the case of an IT system, you better hope the journey isn’t lifelong; systems are meant to come and go. So better to have a prenup in place than being held hostage by the system (or vendor) you are trying to part with. If It Hurts, Do It More Often How do you break out of the “change is bad” cycle? As mentioned earlier, without proper instrumentation and automation, making changes is not only scary but indeed risky. The reluctance to upgrade or migrate software is similar to the reluctance to build and test software often. Martin Fowler issued the best advice to break this cycle: “If it hurts, do it more often.” Behind the provocative name sits the insight that deferring a painful task generally makes it disproportionately more painful: if you haven’t built your source code in months, it’s guaranteed not to go smoothly. Likewise, if the application server your software is running on is three versions behind, you’ll have the migration from hell. Performing such tasks more frequently provides a forcing function to automate some of the processes; for example, with automated builds or test suites. Dealing with migration problems will also become routine. This is the reason emergency workers train regularly; otherwise, they’ll freak out in case of an actual emergency and won’t be effective. Of course, training takes time and energy. But what’s the alternative? Culture of Change Digital companies also have to deal with change and obsolescence. The going joke at Google was that every API had two versions: the obsolete one and the not-yet-quite-ready one. Actually, it wasn’t a joke, but pretty close to reality. Dealing with constant change is painful at times—every piece of code you write could break at any time because of changes in its dependencies. But living this culture of change allows Google to keep up the pace (Chapter 35), which is the most important of today’s IT capabilities. Sadly, it’s rarely listed as a performance indicator for project teams. Even Shaun knows that zombies can’t run fast. 1 Michael Feathers, Working Effectively with Legacy Code (Upper Saddle River, NJ: Prentice Hall, 2004). 2 Stephan Murer and Bruno Bonati, Managed Evolution: A Strategy for Very Large Information Systems (Berlin: Springer, 2011). 3 Wikipedia, “Airport crash tender,” https://oreil.ly/e4DNF. 4 Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations (Portland, OR: IT Revolution, 2018). 5 A prenuptial agreement often clarifies asset division in case of a divorce. Chapter 13. Never Send a Human to Do a Machine’s Job Automate Everything; What You Can’t Automate, Make a Self-Service Sending a machine to do a human’s job Who would have thought that you can learn so much about large-scale IT architecture from the movie trilogy The Matrix? Acknowledging that the Matrix is run by machines, it should not be completely surprising to find some nuggets of system design wisdom, though: Agent Smith teaches us that one should never send a human to do a machine’s job after his deal with Cypher, one of Morpheus’ human crew members, to betray and hand over his boss failed. Automate Everything! There’s a certain irony in the fact that corporate IT, which has largely established itself by automating business processes, is often not very automated itself. Early in my corporate career, I shocked a large assembly of infrastructure architects by declaring my strategy as: “automate everything and make those parts that can’t be automated a self-service.” The reaction ranged from confusion and disbelief to mild anger. Still, this is exactly what Amazon et al. have done. And it has revolutionized how people procure and access IT infrastructure along the way. These companies have also attracted the top talent in the industry to build said infrastructure. If corporate IT wants to remain relevant, this is the way it ought to be thinking! It’s Not Only About Efficiency Just like test-driven development is not a testing technique (it’s primarily a design technique), automation is not just about efficiency but primarily about repeatability and resilience. A vendor’s architect once stated that automation shouldn’t be implemented for infrequently performed tasks because it isn’t economically viable. Basically, the vendor calculated that writing the automation would take more hours than would ever be spent completing the task manually (the vendor also appeared to be on a fixedprice contract). I challenged this reasoning with the argument of repeatability and traceability: wherever humans are involved, mistakes are bound to happen, and work will be performed ad hoc without proper documentation. That’s why you don’t send humans to do a machine’s job. The error rate is actually likely to be the highest for infrequently performed tasks because the operators are lacking routine. The second counter-example is disaster scenarios and outages: we hope that they occur infrequently, but when they happen, the systems better be fully automated to make sure they can return to a running state as quickly as possible. The economic argument here isn’t about saving manpower but minimizing the loss of business during the outage, which far exceeds the manual labor cost. To appreciate this thinking, you need to understand economies of speed (Chapter 35). Otherwise, you may as well argue that the fire brigade should use a bucket chain because all those fire trucks and pumps are not economically viable given how rarely buildings actually catch fire. Repeatability Grows Confidence When I automate tasks, the biggest immediate benefit I usually derive is increased confidence. For example, when I wrote the original self-published version of the book in Markdown, I had to maintain two slightly different versions: the ebook version used hyperlinks for chapter references, whereas the print version used chapter numbers. After quickly becoming tired of manually converting between the formats, I developed two simple scripts that switch between print and epub versions of the text. Because it was easy to do, I also made the scripts idempotent, meaning that running a script multiple times caused no harm. With these scripts at hand, I didn’t even worry a split-second about switching between formats because I could be assured that nothing would go wrong. Automation is hugely liberating and hence speeds up work significantly. Self-Service Once things are fully automated, users can directly execute common procedures in a self-service portal. To provide the necessary parameters— for example, the size of a server—they must have a clear mental model of what they are ordering. Amazon Web Services provides a good example of an intuitive user interface, which not only alerts you that your server is reachable from any computer in the world but even detects your IP address to make it easy to restrict access. When filling out the spreadsheet required to order a Linux server, I was told to just copy the network settings from an existing server because I wouldn’t be able to understand what I need anyway. Designing good user interfaces can be a challenging but valuable exercise for infrastructure engineers who are largely used to working in hiding on rather esoteric “plumbing.” It’s also a chance for them to show the Pirate Ship (Chapter 19), which is far more exciting than all the bits and pieces it’s made out of. Self-service gives you better control, accuracy, and traceability than semimanual processes. Self-service doesn’t at all imply that infrastructure changes become a freefor-all. Just like a self-service restaurant still has a cashier, validations and approvals apply to users’ change requests. However, instead of a human having to re-enter a request submitted in free-form text or an Excel spreadsheet, when a self-service request is approved the workflow pushes the requested change into production without further human intervention and possibility of error. Self-service also reduces input errors: because freeform text or an Excel spreadsheet rarely perform validations, input errors lead to lengthy email cycles or pass through unnoticed. An automated approach gives immediate feedback to the user and makes sure the order actually reflects what the user needs. Beyond Self-Service Self-service portals are a major improvement over emailing spreadsheets. However, the best place for configuration changes is the source code repository, where approvals can be handled via pull requests and merge operations. Approved changes trigger an automated deployment into production. Source code management has long known how to administer large volumes of complex changes through review and approval processes, including commenting and audit trails. You should leverage these processes for configuration changes so that you can start to think like a software developer (Chapter 14). Because it seems that any good idea needs a buzzword these days, using a source repository to manage code and configuration is now referred to as “GitOps.” Most enterprise software vendors pitch GUIs as the ultimate in ease of use and cost reduction. However, in large-scale operations the opposite is the case: manual entry into user interfaces is cumbersome and error prone, especially for repeated requests or complex setups. If you need 10 servers with slight variations, would you want to enter this data 10 times by hand? Fully automated configurations should therefore be done via APIs, which can be integrated with other systems or scripted as part of higher-level automation. I once set a rule that no infrastructure changes could be made from a user interface but had to be done through version-controlled automation. This put a monkey wrench into many vendor demos. Allowing users to specify what they want and providing it quickly in high quality would seem like a pretty happy scenario. However, in the digital world, you can always push things a little further. For example, Google’s “zero-click search” initiative, which resulted in Google Now, considered even one user click too much of a burden, especially on mobile devices. The system should anticipate the users’ needs and answer before a question is even asked. It’s like going to McDonalds and finding your favorite happy meal already waiting for you at the counter. Now that’s customer service! An IT world equivalent may be autoscaling, which allows the infrastructure to automatically provision additional capacity under high load situations without any human intervention. Automation Is Not a One-Way Street Automation usually focuses on the top-down part; for example, configuring a piece of low-level equipment based on a customer order or the needs of a higher-level component. However, we will learn that control can be an illusion (Chapter 27) wherever humans are involved. Also, “control” necessitates two-way communication that references the current system state: when your room is too hot, you want the control system to turn on the air conditioning instead of the heater. The same is true in IT system automation: to specify how much hardware to order or what network changes to request, you likely first need to examine the current state. Therefore, full transparency on existing system structures and a clear vocabulary are paramount. In one case, it took us weeks just to understand whether a datacenter has sufficient spare capacity to deploy a new application. All order process automation doesn’t help if it takes weeks to understand the current state of affairs. If you manage to fully automate and make your infrastructure immutable, meaning no manual changes are allowed at all, you can start working under the assumption that reality matches what’s specified in the configuration scripts. In that case, transparency becomes trivial: you just look at the scripts. While such a setup is a desirable end-state, it might take significant effort to consistently implement across a large IT estate. For example, legacy hardware or applications might not be automatable. Explicit Knowledge Is Good Knowledge Tacit knowledge is knowledge that exists only in employees’ heads but isn’t documented or encoded anywhere. Such undocumented knowledge can be a major overhead for large or rapidly growing organizations because it can easily be lost and requires new employees to relearn things the organization already knew. Encoding tacit knowledge, which existed only in an operator’s head, into a set of scripts, tools, or source code makes these processes visible and eases knowledge transfer. Tacit knowledge is also a sore spot for any regulatory body whose job it is to assure that businesses in regulated industries operate according to welldefined and repeatable principles and procedures. Full automation forces processes to be well defined and explicit, eliminating unwritten rules and undesired variation inherent in manual processes. As a result, automated systems are easier to audit for compliance. Ironically, classic IT often insists on manual steps in order to maintain separation of duty, ignoring the fact that manually approving an automated process achieves both separation of concerns and repeatability. A Place for Humans If we automate everything, is there a place left for humans? Computers are much better at executing repetitive tasks, but even though we humans are no longer unbeatable at the board game Go, we are still number one in coming up with new and creative ideas, designing things, or automating stuff. We should stick to this separation of duty and let the machines do the repeatable tasks without fearing that Skynet will take over the world any moment. Chapter 14. If Software Eats the World, Better Use Version Control! When Your Infrastructure Becomes Software-Defined, You Need to Think Like a Software Developer Software eats infrastructure If software does indeed eat the world, it will have IT infrastructure for breakfast: the rapidly advancing virtualization of infrastructure from VMs and containers to serverless architectures turns provisioning code onto a piece of hardware into a pure software problem. While this is an amazing capability and one of the major value propositions of cloud computing, corporate IT’s uneasy relationship with code (Chapter 11) and lack of familiarity with the modern development life cycle can make this a dangerous proposition. SDX: Software-Defined Anything Much of traditional IT infrastructure is either hardwired or semi-manually configured: servers are racked and cabled, network switches are manually configured with tools or configuration files. Operations staff, who endearingly refer to their equipment as “metal,” is usually quite happy with this state of affairs: it keeps the programmer types away from critical infrastructure where the last thing you need is bugs and stuff like “Agile” development, which is still widely misinterpreted (Chapter 31) as doing random stuff and hoping for the best. This is rapidly changing, though, and that’s a good thing. The continuing virtualization of infrastructure makes resources that were once shipped by truck or wired by hand available via a call to a cloud service provider’s API. It’s like going from haggling in a car dealership and waiting four months for delivery just to find out that you should have ordered the premium seats after all to hailing an Uber from your mobile phone and being shuttled off three minutes later. Virtualized and programmable infrastructure is an essential element to keeping up with the scalability and evolution demands of digital applications. You can’t run an Agile business model when it takes you four weeks to get a server and four months to get it on the right network segment. Operating system–level virtualization is by no means a new invention, but the “software defined” trend has extended to software-defined networks (SDNs) and full-blown software-defined datacenters (SDDC). If that isn’t enough, you can opt for SDX—software-defined anything, which includes virtualization of compute, storage, network, and whatever else can be found in a datacenter, hopefully in some coordinated manner. Other marketing departments coined the term infrastructure as code (IaC), apparently oblivious to the fact their tools mostly accomplish it via configuration, not code (Chapter 11). As so often, it’s possible to look into the future of IT by reading Google’s research papers describing its systems of five-plus years ago (the official paper on Borg,1 Google’s cluster manager, was published in 2015, almost a decade after its internal introduction). To get a glimpse of where SDN is headed, look at what Google has done with the so-called Jupiter Network Architecture.2 If you are too busy to read the whole thing, this three-liner will do to get you excited: Our latest-generation Jupiter network [...] delivering more than 1 Petabit/sec of total bisection bandwidth. This means that each of 100,000 servers can communicate with one another in an arbitrary pattern at 10 Gb/s. Such capability can be achieved only by having a network infrastructure that can be configured based on the applications’ needs and is considered as an integral part of the overall infrastructure virtualization. The Loomers’ Riot? New tools necessitate a new way of thinking, though, to be useful. It’s the old “a fool with a tool is still a fool.” I actually don’t like this saying because you don’t have to be a fool to be unfamiliar with a new tool and a new way of thinking. For example, many folks in infrastructure and operations are far detached from the way contemporary software development is done. This doesn’t make them fools in any way, but it prevents them from migrating into the “software-defined” world. They might never have heard of unit tests, continuous integration (CI), or build pipelines. They may have been led to believe that “Agile” is a synonym for “haphazard” and also haven’t had enough time to conclude that immutability is an essential property because rebuilding/regenerating a component from scratch beats making incremental changes. As a result, despite being the bottleneck in an IT ecosystem that demands ever-faster changes and innovation cycles, operations teams are often not ready to hand over their domain to the “application folk” who can script the heck out of the software-defined anything. One could posit that such behavior is akin to the Loomer Riots because the economic benefits of a software-defined infrastructure are too strong for anyone to put a stop to it.3 At the same time, it’s important to get those folks on board who keep the lights on and who understand the existing systems the best. So, we can’t ignore this dilemma. If software eats the world, there will be only two kinds of people: those who tell the machines what to do and those for whom it’s the other way around. Explaining to everyone What Is Code?4 can be a useful first step. Having more senior management role models who can code would be another good step. However, living successfully in a software-defined world isn’t a simple matter of learning programming or scripting. Software Developers Don’t Undo, They ReCreate A vivid example of how software developers think differently is reversibility; that is, the ability to quickly revert to a known stable state if a new configuration isn’t working. When our team requested the ability to revert to a known good infrastructure configuration state from an infrastructure vendor, the response was that this would require an explicit “undo” script for each possible action, a huge additional investment in their eyes. Apparently, they didn’t think like software developers. With manual updates, reverting to a known good state is very difficult and time consuming at best. In a software-defined world, it’s much easier. Experienced software developers know that if their automated build system can build an artifact, such as a binary image or a piece of configuration, from scratch, they can easily revert to a previous version. So, rather than explicitly undoing a change these developers reset version control to the last known good version, rebuild from scratch, and republish this “undone” configuration, as illustrated in Figure 14-1. Figure 14-1. A traditional and a version-controlled mindset This mindset stems from software being ephemeral—re-creating it from scratch isn’t a major effort. By making infrastructure software-defined, it can also become ephemeral. This is a huge shift in mindset, especially when you consider the annual depreciation cost of all that hardware. But only thinking this way can provide the true benefit of being software defined. In complex software projects, rolling things back is a quite normal procedure, often instigated by the so-called “build cop” after failing automated tests cause the build to go “red.” The build cop will ask the developer who checked in the offending code to make a quick fix or simply revert that code submission. Configuration automation tools have a similar ability to regain a known stable state and can be applied to reverting and automatically reconfiguring infrastructure configurations. Melt the Snowflakes Software-defined infrastructure shuns the notion of “snowflake” or “pet” servers—servers that have been running for a long time without a reinstall, have a unique configuration,footnote:[Just like every snowflake is unique, “snowflake servers” are those that don’t match a standard configuration. and are manually maintained with great care. “This server has been up for three years” isn’t bragging rights but a risk: who could re-create this “pet” server if it does go down? In a software-defined world, a server or network component can be reconfigured or re-created automatically with ease, similar to re-creating a Java build artifact. You no longer have to be afraid to mess up a server instance because it can easily be re-created via software in minutes. Software-defined infrastructure therefore isn’t just about replacing hardware configuration with software, but primarily about adopting a rigorous development life cycle based on disciplined development, automated testing, and CI. Over the past decades, software teams have learned how to move quickly while maintaining quality. Turning hardware problems into software problems allows you to take advantage of this body of knowledge. Automated Quality Checks One of Google’s critical infrastructure pieces was a router, which would direct incoming traffic to the correct type of service instance. For example, HTTP requests for maps.google.com would be forwarded to a service serving up maps data, as opposed to the search page. The router was configured via a file consisting of hundreds of regular expressions. Of course, this file was under version control, as it should be. Despite rigorous code reviews, invariably someday someone checked a misconfiguration into the service router, which immediately brought down most of Google’s services because the requests weren’t routed to the corresponding service instance. Luckily, the previous version was quickly restored thanks to version control. Google’s answer wasn’t to disallow changes to this file, because that would have slowed things down. Rather, automatic checks were added to the code submit pipeline to make sure that syntax errors or conflicting regular expressions are detected before the file is checked into the code repository. When working with software-defined infrastructure, you need to work like you would in professional software development. Use Proper Language One curiosity about Google is that no one working there ever used buzzwords like “big data,” “cloud,” or “software-defined datacenter” because Google had all these things well before these buzzwords were created by industry analysts. Much of Google’s infrastructure was already software defined more than a decade ago. As the scale of applications grew, configuring the many process instances that were being deployed into the datacenter became tedious. For example, if an application consists of seven frontends, 1 through 7, and two backends, A and B, frontends 1 through 4 would connect to backend A, whereas frontends 5 to 7 would connect to backend B. Maintaining individual configuration files for each instance would be cumbersome and error prone, especially as the system scales up and down. Instead, developers generated configurations via a well-defined functional language called Borg Configuration Language (BCL), which supports templates, value inheritance, and built-in functions like map() that are convenient for manipulating lists of values. While avoiding the trap of configuration files (Chapter 11), learning a custom functional language to describe deployment descriptors may not be everyone’s cup of tea, but for software developers that’s the natural approach. When configuration programs became more complex, causing testing and debugging configurations to become an issue, folks wrote an interactive expression evaluator and unit testing tools. That’s what software people do to solve a problem: solve software problems with software! The BCL example highlights what a real software-defined system looks like: well-defined languages and tooling that make infrastructure part of the software development life cycle. GUIs for infrastructure configuration, which vendors often like to show off, should be banned because they don’t integrate well into a software life cycle, aren’t testable, and are error prone. Software Eats the World, One Revision at a Time There’s much more to being software defined than a few scripts and configuration files. Rather, it’s about making infrastructure part of your software development life cycle (SDLC). First, make sure your SDLC is fast but disciplined, and automated but quality oriented. Second, apply the same thinking to your software-defined infrastructure; or else you may end up with SDA, Software-Defined Armageddon. 1 A. Verma et al., “Large-Scale Cluster Management at Google with Borg,” Google, Inc., https://oreil.ly/uGbf5. 2 Amin Vahdat, “Pulling Back the Curtain on Google’s Network Infrastructure,” Google AI Blog, August 18, 2015, https://oreil.ly/JWczw. 3 After the introduction of the power loom in the UK in the early 1800s led to widespread unemployment and reduction in wages among loomers, they organized to destroy this new type of loom. 4 Paul Ford, “What Is Code?” BusinessWeek, June 11, 2015, https://oreil.ly/n2hmb. Chapter 15. A4 Paper Doesn’t Stifle Creativity A Solid Platform Gives Developers a Blank Sheet of Paper Creativity knows no boundaries Today’s IT departments must meet two major but seemingly conflicting goals. First, the business environment puts pressure on IT spend, whereas digital disruptors require IT to increase the rate of change and innovation. One of IT’s major cost levers is harmonization of the IT landscape: reducing the number of different applications and technologies in use provides better economies of scale, better negotiating power with vendors, and fewer skills requirements, which can be a major factor in times of skill scarcity. At first sight, such an effort does seem at odds with innovation, though; how can a company be innovative if too many parameters are fixed? Doesn’t innovation require freedom to experiment and questioning established norms and standards? Interestingly, some harmonization not only doesn’t get in the way of innovation but actually boosts it. Following a recurring theme from this book, we can once again get a hint from the real world: paper. A4 Paper One of the most well-known standards—at least outside the US—is the standard for paper sizes. The most common size of paper used around the world for printing or writing is A4 size paper. A4 paper sets a precise standard of 210 mm wide × 297 mm long for a sheet of paper. At first glance, setting such a standard may appear both arbitrary and constraining. On a second look, though, it’s neither. The family of DIN A paper sizes,1 defined in 1922, are far from arbitrary. The ratio between length and width is always equal to the square root of 2. Thanks to this unique property, two sheets of a smaller size put next to each other along the long edge are the same size as a single sheet of paper of the next larger size. For example, two A4 papers make an A3 paper. And if you are out of A5 paper, you can fold a sheet of A4 paper in the center and tear it to receive two perfectly sized sheets of A5 paper. Pretty handy, huh? But there’s more. If two A4-size sheets make an A3-size sheet, and two A3 sheets make an A2 sheet, and so on, 16 A4 sheets make up an A0 size sheet. But how big should such a sheet be? Easy: one square meter, again with the edge dimensions having a square root of 2 relationship, resulting in a size of 841 mm × 1189 mm. So, if you ever wonder if 3 sheets of common “80 gram” paper require extra postage, you can quickly compute that each sheet weighs 1/16th of that of a square meter, which is 80/16 = 5 grams per the paper classification. For comparison, try calculating the weight of three letter size sheets of #20 paper in ounces.2 On top of all this, standardizing paper sizes eliminates the need to select from myriad paper sizes. It also stacks neatly and allows the use of samesize sleeves, envelopes, drawers, paper punches, and copiers, so you don’t have to worry about any of those. A4-size paper is so ubiquitous that even my laptop is A4 size so that it will neatly fit into any briefcase that is designed to hold a sheet of paper. Importantly, despite being rather prescriptive, the paper standard doesn’t stifle creativity. You can still draw and write on it, whatever you prefer. I haven’t seen a person who was unable to work on a blank sheet of paper due to its particular dimensions. It’s fair to say that A4 paper actually increases creativity because it allows users to focus on the creative aspects —what they put on the paper, as opposed to dealing with paper format ecosystem entropy. So, when we are standardizing IT components, we should look for a result that resembles paper formats: standardize what simplifies life and achieves economies of scale, but give users a blank sheet of paper to work on. Product Standards Restrict, Interface Standards Enable When IT departments are looking to harmonize their portfolio, they usually aim to standardize products (Chapter 32); for example, which databases or applications servers should be used across applications. Standardizing products reduces diversity and can save money by bundling purchasing power, a classic economies of scale (Chapter 35) maneuver: the bigger a company’s spend on a particular product or vendor, the better a deal it can likely secure. However, unlike A4-size paper, such product standards do tend to limit developers’ choices and are therefore quite unpopular. The most successful technical standards in the world, in contrast, have been those that affect how products or components can be combined. We call such standards interface standards or compatibility standards. The most dramatic example of an interface standard that affects IT is the hypertext transfer protocol (HTTP). HTTP enabled the internet revolution because it allowed any browser to connect to any web server, implemented in any programming language or technology. As a result, parts became easily interchangeable and enabled independent evolution. For example, anyone could develop a higher-performing web server without having to replace all browsers. Platform Standards There’s a useful and increasingly common approach that, done right, combines the benefits of interface and product standards to act more like A4-size paper than a corporate rule book. These standards are referred to as platform standards or simply platforms. Platform standards essentially split the IT into two parts: a lower layer that standardizes those elements that are unlikely to form a competitive differentiator and an upper layer of in-housedeveloped software that provides direct business value and competitive differentiation. This concept of platforms has long been known to the car industry where multiple, outwardly distinct vehicle models share the same “infrastructure” of chassis, suspension, safety equipment, and engine options. Because these components require significant engineering effort and cost but are less visible to the end customer, it makes sense to reuse them across as many models as possible. Meanwhile, interior and exterior elements differ among models as they often serve as differentiating factors across market segments. For a nice model that allows plotting elements by how visible and how commoditized they are, I highly recommend Wardley maps.3 Back in IT, layering certainly isn’t a new idea. If anything, it’s one of the oldest concepts to reduce complexity and achieve reuse (Chapter 28). The best candidates for the lower layer are traditionally found in the networking and hardware environments. There are two reasons for this. First, for most enterprises, there’s little business value in different types of processor architectures, networking equipment, monitoring frameworks, or application servers. Second, their lower rate of change (Chapter 3) makes it easier to standardize them into a common base layer. Layers Versus Platforms So, if platforms use layering, which is a well-known concept, what makes platforms different and interesting? At least three aspects spring to mind: Self service In traditional IT, the interaction between the layers occurred by means of service requests or emailing spreadsheets (Chapter 13), vaguely based on a model that the lower layers hold the power (that’s governance, after all!) and the folks in the upper layer have to beg for access. Modern platforms, epitomized by cloud service providers, turn this concept on its head by allowing people in the upper layers to request services through online portals or APIs. It’s customer centricity applied to IT services. Dividing line The dividing line between the IT layers used to be infrastructure versus applications, often even reflected in the organization’s structure, where you’d find an application team and an infrastructure team. Cloud computing platforms have shifted the boundary dramatically and keep shifting it. For example, serverless computing shifts the platform all the way up to the code for a single function. Center of gravity Modern platforms don’t just focus on the compute runtime, such as network, servers, and storage as previous approaches did. They also include software delivery tool chains because they are a key element that defines delivery velocity (Chapter 3). They also often include monitoring and communication, such as service meshes. As a result, they offer applications a well-rounded ecosystem of services. Done well, standardizing lower layers doesn’t constrain what functionality can be delivered to the business. However, it relieves development teams from having to choose and operate a whole stack of software and hardware. It also channels developers’ creative energy into those parts that generate business value as opposed to developing yet-another-persistenceframework. Interestingly, this matches my favorite definition of software architecture: design decisions that keep implementors from exercising needless creativity (Chapter 8). At a major financial services provider, we defined an Agile Delivery Platform that was a lot more than just a private cloud runtime; it included an onpremises source code repository, a containerized build tool chain, common monitoring and visualization, and security features. It became the de facto platform for new application delivery and sped up adoption of modern development techniques. Digital Discipline Digital companies are great examples of high velocity necessitating discipline (Chapter 31). They realized that strictness in some aspects actually boosts the rate of innovation. Often, this strictness comes in form of an A4-style platform. For example, Google, which is well-known for rapid innovation, has very strict platform standards (Chapter 32) for application deployment and operations: there’s essentially one way to deploy an application, on one type of operating system, observed by one monitoring framework. Google found the exact level at which to abstract to allow people to innovate where it matters without exercising needless creativity. Google is a great example of enforcing strict platform standards that nevertheless boost the speed of innovation. Avoid the Skipping Stones There’s a very silly TV show called Takeshi’s Castle, which makes contestants compete by enduring several rather sadistic exercises, much to the enjoyment of the audience. An all-time favorite is Skipping Stones, called “Dragon God Pond” in the Japanese original. The contestants are tasked with crossing a pond filled with a murky, rather uninviting liquid via a sequence of stepping stones. It wouldn’t be funny, though, if there weren’t a catch: most stones are solid, but some are merely floating pieces of Styrofoam. They are visually indistinguishable, but designed to quickly give way to any unlucky contestant’s misstep, resulting in spectacular falls, which are best watched in slow motion. Some platforms seem to make their customers play Takeshi’s Castle—their components appear solid, but some suddenly give way. IT platforms give way by deprecating components, having inconsistent interfaces, or being poorly integrated. Needless to say that this isn’t much fun for the contestant: you. So, don’t build platforms that look like the Skipping Stones! Instead, follow a few critical aspects to assure that your platform is solid, but flexible enough to spur adoption and innovation: Choose a useful level of abstraction Would standardizing pens and pencils still improve creativity or run the risk of stifling it? Useful standards are those that shield significant complexity but can be utilized by a wide range of tools: you can draw on A4 paper with pen, pencil, chalk, watercolor, and more, so it’s a concrete standard that allows many uses. Constantly fine tune Nothing is eternal, especially in IT. The same holds true for IT standards. They need to be able to evolve along with the technology and new insights. Today’s best innovation platform can be a road block in just a few years. Keep it up to date Although your customers may want your platform to be stable, they don’t want it to be outdated or full of security holes because of lacking patches. Keep your product versions up to date! Make it real Standards that just exist on paper are unlikely to be followed. Therefore, make sure your standards come alive in ready-to-use tools and platforms. Many people may not care for A4 paper, but if it’s the easy choice available in any store, they probably don’t mind. Reward compliance You want to reward people who adopt the standards; for example, by offering lower prices, better service, or shorter provisioning times when compared to nonstandard solutions. Cloud providers don’t set standards just on paper, but provide an implementation that allows rapid delivery through self-service interfaces. Cloud platforms also continuously evolve and grow, making them excellent examples of solid platforms that enable innovation. One of the critical decisions making the Agile Delivery Platform a success was to regularly update the platform, against common practice (and advice from the infrastructure teams). The traditional approach, which would have required all application owners’ agreement before updating, would have made the platform outdated within just a few months. After initially mostly offering virtual machines as infrastructure as a service (IaaS), most cloud providers now offer platform as a service (PaaS) for applications and functions as a service (FaaS)/“serverless” for single code snippets. Focusing on common (de facto) standards like Docker for containers has fueled the creation of platforms and boosted the rate of innovation among platform users. One Size Might Not Fit All Tastes As powerful as platforms and standards are, establishing global standards can be harder than expected. For example, despite all the virtues of A4-size paper, so-called “letter-size” paper at 8.5 × 11 inches remains a standard in the United States. Though Wikipedia describes its precise origin as “not known”4—the most credible hypothesis ascribes it to historical manufacturing by hand—a migration to DIN-sized paper appears unlikely. Until then, I’ll have to use two different paper cartridges for my venerable HP LaserJet 4 and be frequently reminded to PC LOAD LETTER.5 1 DIN stands for Deutsches Institut für Normung, which is the current name of the German national institution that sets official domestic standards. 2 20 is the weight in pounds of a ream (500 sheets) of Bond paper, which measures 22 × 17 inches. Converting that to ounces per sheet is left as an exercise to the reader. 3 S. Wardley, “Wardley maps,” Medium.com, March 7, 2018, https://oreil.ly/bk3sL. 4 Wikipedia, “Paper Size,” https://oreil.ly/et7UH. 5 Wikipedia, “PC LOAD LETTER,” https://oreil.ly/ou-b8. Chapter 16. The IT World Is Flat Without a Map, Any Road Looks Promising Living in the Middle Kingdom—by Kwong Hing Yen (江慶人) Maps have been valuable tools for millennia, despite most of them, especially world maps, being quite badly distorted. The fundamental challenge of plotting the surface of a sphere onto a flat sheet of paper forces maps to make compromises when depicting angles, sizes, and distances—if the earth were flat, things would be much easier. For example, the historically popular Mercator projection provides true angles for seafarers, meaning you can read an angle off the map and use the same angle on the ship’s compass (compensating for the discrepancy between geographic and magnetic north). The price to pay for this convenient property, which avoids distorting angles, is area distortion: the further away countries are from the equator, the larger they appear on the map. That’s why Africa looks disproportionately small on such maps,1 a trade-off that might be acceptable when navigating by boat: misestimating the distance is likely a lesser problem than heading into the wrong direction. Plotting the surface of a sphere also presents the challenge of deciding where the “middle” is. Most world maps conveniently position Europe in the center, supported by 0 degree longitude (the prime meridian) going through Greenwich, England. This depiction results in Asia being in the “East” and the Americas being in the “West.” The keen observer will quickly conclude that when living on a sphere, notions of West and East are somewhat relative to the viewpoint of the beholder. The same type of thinking likely motivated the residents of East Asia to historically put their country in the middle of the map and even name it accordingly: 中國 , the “middle kingdom.” Although many centuries later we might regard such a world view as a tad self-centered, at the time it simply made practical sense: having the most detail about places that are near you makes putting your starting point in the middle of your map natural. It also roughly lines up the map boundaries with your travel limits. IT landscapes are also vast, and navigating a typical enterprise’s range of products and technologies can be equally daunting to sailing Cape Horn. Despite some similarities, each IT landscape tends to be its own planet, making universal IT world maps hard to come by. Aside from some useful attempts like the Big Data Landscape by Matt Turck, enterprise architects therefore often rely on maps provided by their vendors. Vendors’ Middle Kingdoms As chief architect of a large company, you’ll quickly gather new friends: account managers, (presales) solution architects, field CTOs, and sales executives, to name a few. Their job is to sell their products to large enterprises like yours that rely heavily on external hardware, software, and services. It makes sense to buy systems that aren’t a competitive differentiator or to lease them via a software as a service (SaaS) model. Creating an accounting system yourself is in most cases as valuable as creating your own electricity. It’s important to have such things, but they won’t give you any competitive advantage. So just as you’re unlikely to benefit from operating your own power plant, you should also abstain from building your own accounting system. Enterprise vendors are also an important source of information, especially for architects, as vendors keep close track of industry trends. Do keep in mind, however, that the information you are given might be skewed by the vendor’s worldview. That’s because enterprise vendors live in their own middle kingdom, generally depicting their home state disproportionately large and accepting a fair degree of distortion on the periphery. Distortion can take the form of vendors defining product categories or buzzwords by features that only their product has. For example, I have seen “Zero Trust” pegged to safe web browsing and “GitOps” tied to Kubernetes. Both are a stretch of imagination at best. I often joke that if you have no concept whatsoever of what a car is and only ever talked to one specific German automaker, you’d end up walking away with the firm belief that a star emblem on the hood is a defining feature of an automobile. IT architects in large enterprises must therefore develop their own, balanced worldview so that they can safely navigate the treacherous waters of enterprise architecture and IT transformation. Vendors’ distortion doesn’t imply deception; it’s largely a byproduct of the context people grew up in. If you develop databases, it’s natural to view the database as the center of any application: after all, that’s where the data is stored. Server and storage hardware are viewed as parts of a database appliance, whereas application logic becomes a data feed. Conversely, to a storage hardware manufacturer, everything else is just “data,” and databases are lumped into a generic “middleware” segment. It’s like me on my first trip to Australia considering a quick hop to New Zealand because I thought it was so super close. Realizing that it’s still a good three-and-a-half-hour flight from Melbourne to Auckland proved that my world map is also distorted on the periphery. Plotting Your World Map To avoid falling into the “star on the hood” or the “it’s all a database” trap, it’s important that your architecture team first develops its own, undistorted map of the IT landscape—a great exercise for enterprise architects (Chapter 4). Luckily, the world of IT is flat, so it’s a bit easier to plot on a whiteboard or a piece of paper. Your own map gives you a much better, product-neutral understanding and may, for example, illustrate that a car’s drive train is much more relevant than the hood emblem. Any architect who carries a product name in their title likely carries the vendor’s map as opposed to your own. It’s OK to draw the map piece-by-piece, starting, for example, where a new product needs to be set up or an old one replaced—rate of change (Chapter 3) is once again a good indicator for architecture. Another good starting point is where existing products represent critical differentiators for the company. Drawing your map requires you to piece together information from various sources, which will often be distorted. Maybe one day we’ll have an AIdriven application that can do this for enterprise architecture the same way smartphones can stitch together multiple photos into a panorama. Until then, you have to collect information from vendors, blogs, industry analyst reports, and your infrastructure and development teams. Resist the temptation to simply ask your favorite two- or three-lettered enterprise supplier to make the map for you. For one, it will once again be distorted, and second, at today’s rate of innovation many of them are outdated, as well. When placing countries and territories on your map, focus on function and relationships as opposed to product names. Describing the architecture of a big data system as “Microsoft SQL Server” is no more useful than claiming the architecture of a house is “Ytong.”2 Both may be good choices, but neither describes the architecture. Because IT architecture operates between the buzzwords and the product names, it’s less concerned with the pieces than with how they are put together. This is why it’s so important to look not only at the boxes but also the lines (Chapter 23). Defining Borders Where to place the “borders” in your map is a key aspect of doing architecture in the enterprise. Although we all like to think of boundary-free architecture, if we want to establish a meaningful map and vocabulary for our enterprise, we need to place some borders. For example, should our “data” continent be separated into data warehouses, data lakes, data marts, and databases? Would databases then break down into relational and NoSQL databases, which could further break down into graph databases, object stores, and so on? Would you want to distinguish managed cloud databases like DynamoDB or Spanner from other databases? Would you want to separate operational databases from those used for analytics? There are many ways to slice, and defining these boundaries is a key element of doing architecture at an enterprise level. The word reference architecture even comes to mind, but you need to keep in mind that architecture isn’t a copy-paste exercises. You need to define the continents and countries that are meaningful for your organization, your business strategy, and your business architecture, as illustrated in Figure 16-1. Figure 16-1. A plausible database continent A colleague of mine conducted a thorough mapping exercise for application monitoring that includes black-box monitoring, white-box monitoring, troubleshooting, log analysis, alerting, and predictive monitoring. All are distinct but interrelated aspects of an application monitoring solution. Many vendors, especially those with a history in application performance monitoring, will also include performance testing because that’s the middle of their map. Whether you want to do the same or whether it’s part of the development tool chain is your decision. When I see most reference architectures, I feel that they ought to print a disclaimer at the bottom, similar to those used for movies: “Any similarities with real persons or systems is purely coincidental.” Charting Territory As soon as your IT world map has undisputed borders, you can start populating “countries” with vendor products that may be in use or available on the market. The map will help you assess how well a vendor’s product fits your map. Some products may not completely cover the gap, while others have significant overlap with solutions already in place. Placing products on an IT world map is a bit like playing Tetris: the piece that fits best depends on what you already have in place. This means that rather than picking the “best” product, you should select the one that fits best. Most large IT organizations govern their product portfolio (Chapter 32) via a standards group. Standards reduce product diversity and allow enterprises to harvest economies of scale; for example, by bundling purchasing power. When defining standards, the world map can be an enormous help because it can determine what kind of standards you’ll want and at which level you’d want to apply them. For example, defining different types of databases or data stores on your “database continent” can tell you whether you need a different standard for relational databases and NoSQL databases or whether you distinguish light-weight use cases from mission-critical ones. Having a good map is essential to navigating the complexity of vendor offerings. A vivid example of the difficulty of discussing product fit without a good map came up in a conversation about a web portal: a divisional IT manager using a shared web portal lamented the lack of documentation on port forwarding. The project’s architect replied that a web server isn’t part of their solution, assuming port forwarding is done in a web server. Much debate and confusion ensued because the division implemented port forwarding in an integrated network management tool, not in a web server. They used different world maps and continued to talk past each other for some time. Looking at the map to get the proverbial “lay of the land” can help a lot to resolve misunderstandings. For example, a map might show that port forwarding is part of the concept of an Application Delivery Controller (ADC), which manages web traffic by including functions such as reverse proxying, load balancing, and also port forwarding. You can utilize a web server as ADC in simple cases or purchase an integrated product like F5. Ironically, conducting the worthwhile exercise of plotting your own IT world map can be challenged by traditional IT managers as “academic.” This can be especially amusing in Germany where IT management is littered with PhDs (not necessarily in any technical major) who carry the title “Dr.” as part of their legal name. If pragmatic means “haphazard,” I am happy to be in the “academic” camp: I am paid to think and plan, not to play product lottery. Product Philosophy Compatibility Check When plotting vendor offerings onto your map, it’s not enough to just understand the vendor’s current product portfolio, but also where it is headed—the world of IT never stands still. That’s why I first like to understand whether a vendor’s and our worldviews align. Meeting with vendors’ senior technical staff, such as a CTO, is most effective when discussing worldview and comparing maps because too many “solution architects” are just glorified technical salespeople, who navigate purely off the vendor’s map, the “middle kingdom” so to speak. I need a world map, though. When an account manager starts the meeting with “please help us understand your environment,” which roughly translates into “please tell me what I should sell to you,” I typically preempt the exercise by asking the senior person about their product philosophy. It’s a bit of a big word, but it’s helpful in shifting the conversation to the vendor’s world map. I prefer to ask vendors two key questions to understand their world map: What base assumptions did you have to make? No one can operate on a completely empty map without borders, so the vendor must have made choices and picked boundaries. The answer to this question tells you where the edge of their map is. What’s the toughest problem you had to solve? The answer to this question will tell you where the center of their map is. Discussing what base assumptions and decisions are baked into a product gives you great insight into a vendor’s world map (see Figure 16-2), both about the center and the edge (remember the IT world is flat, so it has edges). Figure 16-2. A product vendor’s core and periphery Naturally, this works only when talking to someone who is actually defining the vendor’s corporate or product strategy. Looking at the company leadership page can help you identify the right people. Looking at the leadership’s history can also help you understand “where they come from”; that is, under which assumptions they operate. When asking these questions of a monitoring vendor, it became clear that the core of their map was being able to monitor running applications without having access to the source code. This feature is particularly useful if you look at the problem from an operational point of view, especially if you work in an organization that separates “change” from “run” (Chapter 12). However, in a “you build it, you run it” environment where development teams are directly involved in operational aspects, this intellectual property would be less valuable. You can end up paying for something that you don’t need. Understanding the vendor’s world map can help you make better decisions. Comparing world maps isn’t about finding out which one is right and which one is wrong; it’s about comparing worldviews. For example, I believe that a good programming language and a disciplined software development life cycle (SDLC) beats “easy” configuration (Chapter 11). That’s because I come from a software engineer mindset. Other folks might be happy to not have to hassle with git stash and compilation errors and prefer the vendor’s configuration tools. Shifting Territory While the real world is relatively static (continental drift is pretty slow and the trend of splitting countries in the ’90s has also slowed down a bit), the world of IT is changing faster than ever. Because it’s difficult for a vendor to change its product philosophy, you will likely encounter old products with a new coat of paint on it. Your job as an architect is to look through the shiny new paint and see whether there’s any rust or filler underneath. 1 “The True Size Of Africa,” Information Is Beautiful, Oct. 14, 2010, https://oreil.ly/yeVps. 2 Ytong is the name of a popular brand of aerated concrete bricks used in Europe for building construction. Chapter 17. Your Coffee Shop Doesn’t Use Two-Phase Commit Learn About Distributed System Design While in the Queue! Grande, durable, nonatomic, soy chai latte When designing solutions, architects often look at technical solutions like ACID (Atomic, Consistent, Isolated, Durable) transactions and binary values in order to craft a well-defined and perfect system. In reality, though, designing complex systems isn’t that easy, so there’s one more source of design guidance that you should consider: the real world.1 Hotto Cocoa o Kudasai You know you’re a geek when going to the coffee shop gets you thinking about interaction patterns between loosely coupled systems. This happened to me on a trip to Japan. Some of the more familiar sights in Tokyo are the numerous Starbucks coffee shops, especially in the areas of Shinjuku and Roppongi. After stretching my limited Japanese skills by muttering “Hotto Cocoa o Kudasai” (“A hot chocolate, please”), I returned to my bubble of foreigner-ness and started thinking about how Starbucks processes drink orders. Starbucks, like most other businesses, is primarily interested in maximizing throughput of orders because more orders equal more revenue. Interestingly, the optimization for throughput results in a concurrent and asynchronous processing model: when you place your order, the cashier marks a coffee cup with the details of your order (e.g., tall, nonfat, soy, dry, extra hot latte with double shot) and places it into the queue, which is quite literally a queue of coffee cups lined up on top of the espresso machine. This queue decouples cashier and barista, allowing the cashier to keep taking orders even if the barista is momentarily backed up. If the store becomes busy, multiple baristas can be deployed in a competing-consumer scenario,2 meaning that they work off items in parallel without duplicating work. Asynchronous processing models can be highly scalable but are not without challenges. Still waiting for my hot chocolate, I started thinking about how Starbucks dealt with some of these issues. Maybe we can learn something from the coffee shop about designing successful asynchronous messaging solutions? Correlation Parallel and asynchronous processing causes drink orders to be not necessarily completed in the same order in which they were placed. This can happen for two reasons. First, order processing time varies by type of beverage: a blended smoothie takes more time to prepare than a basic drip coffee. A drip coffee ordered last might thus arrive first. Second, baristas might make multiple drinks in one batch to optimize processing time. Starbucks therefore has a correlation problem: drinks that are delivered out of sequence must be matched up to the correct customer. Starbucks solves the problem with the same “pattern” used in messaging architectures: a correlation identifier3 uniquely marks each message and is carried through the processing steps. In the US, most Starbucks use an explicit correlation identifier by writing your name on the cup at the time of ordering, calling it out when the drink is ready. Other countries might correlate by the type of drink. When I had difficulties in Japan understanding the baristas calling out the types of drinks, my solution was to order extra-large “venti” drinks because they’re uncommon and therefore easily identifiable, that is, “correlatable.” Exception Handling Exception handling in asynchronous messaging scenarios presents another challenge. What does the coffee shop do if you can’t pay? They will toss the drink if it has already been made or otherwise pull your cup from the “queue.” If they deliver you a drink that’s incorrect or unsatisfactory, they will remake it. If the machine breaks down and they cannot make your drink, they will refund your money. Apparently, we can learn quite a bit about error-handling strategies by standing in the queue! Just like Starbucks, distributed systems often cannot rely on two-phasecommit semantics that guarantee consistent outcomes across multiple actions. They therefore employ the same error-handling strategies. Write Off The simplest error-handling strategy is doing nothing. If the error occurs during a single operation, you just ignore it. If the error happens during a sequence of related actions, you can ignore the error and continue with the subsequent steps, ignoring or discarding any work done so far. This is what the coffee shop would do when a customer is unable to pay: discard the drink and move on. Doing nothing about an error might seem like a bad plan at first, but in the reality of a business transaction, this option might be perfectly acceptable: if the loss is small, building an error correction solution is likely more expensive than just letting things be. When humans are involved, correcting errors also has a cost and might delay serving other customers. Moreover, error handling can lead to additional complexity—the last thing you want is an error-handling mechanism that has errors. So, in many cases “simple does it.” I worked for a number of ISP providers who would choose to write off errors in the billing/provisioning cycle. As a result, a customer might end up with active service but would not get billed. The revenue loss was small enough that it didn’t hurt the business and customers rarely complained about getting free service. Periodically, they would run reconciliation reports to detect the “free” accounts and close them. Retry When simply ignoring an error won’t do, you might want to retry the failing operation. This is a plausible option if there’s a realistic chance that a renewed attempt will actually succeed; for example, because a temporary communications glitch has been fixed or an unavailable system has restarted. Retrying can overcome intermittent errors, but it doesn’t help if the operation violates a firm business rule. Starbucks will retry to make your beverage if it’s not to your liking but they won’t if the power is out. When encountering a failure in a group of operations (i.e., “transaction”), things become simpler if all components are idempotent, meaning they can receive the same command multiple times without duplicating the execution. You can then simply reissue all operations because the receivers that already completed them will simply ignore the retried operation. Shifting some of the error-handling burden, i.e., detecting duplicate messages, to the receivers thus simplifies the overall interaction. It’s amazing how frequently a basic retry operation succeeds in systems that were built out of zeros and ones. The common saying that defines insanity as “doing the same thing over and over again and expecting different results” apparently doesn’t apply to computer systems. Compensating Action The final option to put the system back into a consistent state after a failed operation is to undo the operations that were completed so far. Such “compensating actions” work well for monetary transactions that can recredit money that has been debited. If the coffee shop can’t make the coffee to your satisfaction, it will refund your money to restore your wallet to its original state. Because real life is full of failures, compensating actions can take many forms, such as a business calling a customer to ask them to ignore a letter that has been sent or to return a package that was sent in error. The classic counter-example to compensating an action is sausage making. Some actions are not easily reversible. Transactions All of the strategies described so far differ from a two-phase commit that relies on separate prepare and execute phases. In the Starbucks example, a two-phase commit would equate to waiting at the cashier desk with the receipt and the money on the table until the drink is finished. Once the drink is added to the items on the table, money, receipt, and drink can change hands in one swoop. Neither the cashier nor the customer would be able to leave until this “transaction” is completed. Using such a two-phase-commit approach would eliminate the need for additional error-handling strategies, but it would almost certainly hurt Starbucks’s business because the number of customers it can serve within a set time interval would decrease dramatically. This is a good reminder that a two-phase-commit approach can make life a lot simpler, but it can also hurt the free flow of messages (and therefore the scalability) because it has to maintain stateful transaction resources across multiple, asynchronous actions. It’s also an indication that a high-throughput system should be optimized for the happy path instead of burdening each transaction for the rare case when something goes wrong. Backpressure Despite working asynchronously, the coffee shop cannot scale infinitely. As the queue of labeled coffee cups gets longer and longer, Starbucks can temporarily reassign a cashier to work as a barista. This helps reduce the wait time for customers who have already placed an order while exerting backpressure to customers still waiting to place their order. No one likes waiting in line, but not yet having placed your order provides you with the option to leave the store and forgo the coffee or to wander to the next, notvery-far-away coffee shop. Conversations The coffee shop interaction is also a good example of a simple but common conversation pattern4 that illustrates sequences of message exchanges between participants. The interaction between two parties (customer and coffee shop) consists of a short synchronous interaction (ordering and paying) and a longer, asynchronous interaction (making and receiving the drink). This type of conversation is quite common in purchasing scenarios. For example, when an order is placed on Amazon, the short synchronous interaction assigns an order number, whereas all subsequent steps (charging credit card, packaging, shipping) are performed asynchronously. Customers are notified via email (asynchronous) when the additional steps complete. If anything goes wrong, Amazon usually compensates the customer (refunds payment) or retries (resends the lost goods). Canonical Data Model A coffee shop can teach you even more about distributed system design. When Starbucks was relatively new, customers were both enamored and frustrated by the new language they had to learn just to order a coffee. Small coffees are now “tall,” while a large one is called “venti.” Defining your own language is not only a clever marketing strategy but also establishes a canonical data model5 that optimizes downstream processing. Any uncertainties (soy or nonfat?) are resolved right at the “user interface” by the cashier, thus avoiding a lengthy dialogue that would burden the barista. Welcome to the Real World! The real world is mostly asynchronous: our daily lives consist of many coordinated but asynchronous interactions, such as reading and replying to email, buying coffee, etc. This means that an asynchronous messaging architecture can often be a natural way to model these types of interactions. It also means that looking at daily life can help design successful messaging solutions. Domo arigato gozaimasu!6 1 This chapter was published (in slightly different form) in IEEE Software, Vol. 22, and Best Software Writing, ed. J. Spolsky (Apress). 2 Gregor Hohpe, “Competing Consumers,” Enterprise Integration Patterns, https://oreil.ly/NShD-. 3 Gregor Hohpe, “Correlation Identifier,” Enterprise Integration Patterns, https://oreil.ly/NkR28. 4 Gregor Hohpe, “Conversation Patterns,” Enterprise Integration Patterns, https://oreil.ly/gwvQ. 5 Gregor Hohpe, “Canonical Data Model,” Enterprise Integration Patterns, https://oreil.ly/8SU8U. 6 “Thank you very much!” Part III. Communication Architects don’t live in isolation. It’s their job to gather information from disparate departments, articulate a cohesive strategy, communicate decisions, and win supporters at all levels of the organization. Communication skills are therefore paramount for architects. Conveying technical content to a diverse audience is challenging, though, because many classical presentation or writing techniques don’t work well for highly technical subjects. For example, slides with single words superimposed on dramatic photographs may draw the audience’s attention, but they aren’t going to convey the intricacies of your cloud computing platform strategy. Instead, architects need to focus on a communication style that emphasizes content, but in an engaging and approachable manner. You Can’t Manage What You Can’t Understand “You can’t manage what you can’t measure” is a common management slogan. However, for the measurements to be meaningful, you have to understand the dynamics of the system you are managing. Otherwise, you can’t tell which levers you should pull to influence the system behavior (Chapter 10). Understanding what you are managing becomes an enormous challenge for decision makers in a world in which technology invades all parts of personal and professional lives. Even though business executives aren’t expected to code a solution themselves, ignoring technological evolution and capabilities invariably leads to missed business opportunities or missed expectations when IT systems don’t deliver what the business needs. Managing complex technology projects by timeline, staffing, and budget considerations alone is no longer going to suffice in the digital world that demands ever faster delivery of functionality at high quality (Chapter 40). Architects must help close the gap between technical knowledge holders and high-level decision makers by clearly communicating the ramifications of technical decisions on the business; for example, through development and operational cost, flexibility, or time-to-market. It’s not only the “business types” who face challenges in understanding complex technology, though. Even architects and developers cannot possibly keep up with all aspects of intricate technical solutions, forcing them to also rely on easy-tounderstand but technically accurate descriptions of architectural decisions and their implications. Getting Attention Technical material can be very exciting, but ironically more so to the presenter than to the audience. Keeping attention through a lengthy presentation on code metrics or datacenter infrastructure can be taxing for even the most enthusiastic audience. Decision makers don’t just want to see the hard facts, but also be engaged and motivated to support your proposal. Architects therefore have to use both halves of their brain to not only make the material logically coherent but to also craft an engaging story. Pushing (Less) Paper The technical decision papers published by my team in the past yielded much praise, but also unexpected criticism like, “All you architects do is produce paper.” You might want to preempt such criticism by reminding people that documentation provides value in numerous ways: Coherence Agreeing on and documenting design principles and decisions improves consistency of decision making and thus preserves the conceptual integrity of the system design. Validation Structured documentation can help identify gaps and inconsistencies in the design. Clarity of thought You can write only what you have understood. Education New team members become productive faster if they have access to good documentation. History Decisions (Chapter 8) are based on a specific context, which may have changed since. Documentation can help you understand that context. Stakeholder communication Architecture documentation can help steer a diverse audience to the same level of understanding. Nevertheless there seems to be an unfounded resistance against writing documentation among development teams. If someone claims that writing their thoughts down is too much effort, I routinely challenge them that this is likely because they haven’t really understood things in the first place. Useful documentation doesn’t imply reams of paper, rather the opposite: short documents are more likely to be read. That’s why most technical documents that my teams write are subject to a five-page limit. Isn’t the Code the Documentation? Never shy of arguments, some developers claim that the source code is their documentation. So writing anything down is just duplication, right? They might have a point as long as all audience groups have access to the code, the code is well structured, and tools such as search are available. Still, your source code is highly unlikely to explain your value proposition and your critical decisions to your executive sponsors. For that, you’re going to want to take the Architect Elevator (Chapter 1) up to the penthouse, equipped with a crystal clear piece of documentation. Generating diagrams and documentation from code can be useful, but the resulting visuals often struggle to help people see the forest for the trees. Also, they don’t do a great job at explaining why things were done the way they are because they generally fail to place the appropriate emphasis. Defining what is “interesting” or “noteworthy” luckily remains a human task. Choosing the Right Words Technical writing is difficult, as evidenced by user manuals, which must rank as some of the most ridiculed pieces of literature, if we can even call them that. They might be surpassed in lack of empathy only by tax form instruction sheets. Architects must therefore be able to engage readers who wasted years of their career perusing poorly written manuals and who may never want to read anything technical again outside of the occasional Dilbert comic. Careful choice of words and clean sentence structures go a long way toward assisting readers in grasping difficult concepts. Communication Tools This part helps overcome some common challenges of creating engaging technical communication and highlights that documentation can be a tremendously useful tool for architects: Chapter 18, Explaining Stuff Helping management reason about complex technical topics requires you to build a careful ramp for the audience. Chapter 19, Show the Kids the Pirate Ship! Excite your audience by showing not just the building blocks but also the pirate ship. Chapter 20, Writing for Busy People Busy executives won’t read every line you write, so make it easy for them to navigate your documents. Chapter 21, Emphasis Over Completeness There’s always too much to tell. Focus on the essence. Chapter 22, Diagram-Driven Design Not only can a picture say more than a thousand words, but it can actually help you design better systems. Chapter 23, Drawing the Line Your architecture doesn’t just include a list of components, but also their relationships. You must draw a line. Chapter 24, Sketching Bank Robbers Technical staff might struggle to create a good picture of a system, even though they know it best. Help them by sketching bank robbers. Chapter 25, Software Is Collaboration Version control/continuous integration isn’t just for software development. It’s a key part of collaboration. Chapter 18. Explaining Stuff Build a Ramp for the Reader, Not a Cliff! Build a ramp, not a cliff for the reader—by Miu Tsutsui Martin Fowler occasionally introduces himself as a guy “who is good at explaining things.” Although this certainly has a touch of British Understatement™, it also highlights a critically important but rare skill in IT. Too often technical people either produce an explanation at such a high level that it is almost meaningless or spew out reams of technical jargon with no apparent rhyme or reason. Build a Ramp, Not a Cliff A team of architects once presented a new hardware and software stack for high-performance computing to a management steering committee. The material covered everything from workload management down to storage hardware. It contrasted vertically integrated stacks like Hadoop and Hadoop Distributed File System (HDFS) against standalone workload management solutions like Platform Load Sharing Facility (LSF). In one of the comparison slides “POSIX compliance” jumped out as a selection criteria. While this may be entirely appropriate, how do you explain to someone who knows little about filesystems what this means, why it is important, and what the ramifications are? We often refer to learning curves as steep, meaning it is tough for newcomers to become familiar with, or “ramp up” on, a new system or tool. I tend to assume my executive audience is quite intelligent (you don’t get that high up simply by brown-nosing and playing politics), so they can in fact climb up a pretty steep learning ramp. What they cannot do is climb up a vertical cliff. Building a logical sequence that enables the audience to draw conclusions in an unfamiliar domain can be “steep” but doable. Being bombarded with out-of-context acronyms or technical jargon constitutes a “cliff.” “POSIX compliance” is a cliff for most people. You can turn it into a ramp by explaining that POSIX is a standard programming interface for file access, which is widely adhered to by Unix distributions, thus reducing lock-in in case you’re maintaining multiple Linux flavors. With this ramp, executives can reason that because they already standardized on a single Linux distribution, POSIX compliance doesn’t add much value. It’s also not relevant for vertically integrated systems like Hadoop, which include the filesystem. By building a ramp out of just a few words, you managed to involve someone who isn’t deeply technical in the decision-making process. The ramp might not take the audience into the depths of POSIX versions and Linux flavors, but it provides a mental model to reason within the scope of the proposed decision. A steep ramp is suitable for a quick climb but becomes tiresome if you are trying to lead your audience up Mount Everest. Therefore, consider how high (or deep) your audience needs to go to reason about what is presented. When defining terms, define them within the context of your problem, highlighting the relevant properties and omitting irrelevant detail. For example, details about POSIX history and Linux Standard Base aren’t pertinent to the decision above and should be omitted. Mind the Gap The ramp should not only provide a reasonable incline but also avoid gaps or jumps in logic. Experts often don’t perceive these gaps because their mind silently fills them in. This is a phenomenal feature of our brain, but an audience not intimately familiar with the topic is likely to stumble over even a minor gap and lose track of the line of reasoning. This effect is known as the curse of knowledge: once you know something, it’s very hard to imagine how someone else learns it. At a discussion about network security, a team of architects presented their requirement that servers located in the untrusted network zone have separate network interfaces, so-called NICs, for incoming and outgoing network traffic to avoid a direct network path from the internet to trusted systems. They continued with a statement that the vendor’s “three-NIC design” cannot meet their requirement. To me, this made no sense: why is a server with three network interfaces unable to support a design requiring two interfaces, one for incoming traffic and one for outgoing? The answer was “obvious” to those who are familiar with the context: each server uses one additional network interface each for backup and management tasks, bringing the number of required ports to four, which clearly exceeds three. Skipping this detail created a gap large enough for the audience (and me) to stumble. How big a gap they are creating is difficult to judge for the presenter. That’s the curse of knowledge. In the example above, just a few words or two additional labeled lines in the diagram would have been enough to bridge the gap. That, however, doesn’t imply that the gap itself was small—it might have been narrow, but plenty deep. Presenting your line of reasoning to a person not familiar with the topic and asking them to “teach back” what you explained to them, similar to holding a pop quiz (Chapter 21), can be a great help in finding gaps. First, Create a Language When preparing technical conversations, I tend to use a two-step approach: first I set out to establish a basic mental model based on plain vocabulary without product names or acronyms. Once equipped with this, the audience is able to reason in the problem space and to discern the relevance of parameters. This mental model doesn’t have to be anything formal, it merely needs to give the audience a way to make a connection between the different elements that are being described. In the aforementioned filesystem example, I would first describe how file access is composed of a layered stack spanning from hardware (i.e., disk), basic block storage (like a SAN) to filesystems, and ultimately the operating system, which hosts the applications on top. This explanation doesn’t even occupy half a slide and would nicely fit into a picture of layered blocks (see Figure 18-1). As a second step, I can use this vocabulary to explain that Hadoop is integrated from the application layer all the way down to the local filesystem and disks without any SAN or the like. This setup has specific advantages, such as low cost and data locality, but requires you to build applications for this particular framework. In contrast, standalone filesystems for high-performance computing, for example GPFS or pNFS, either build on top of standard filesystems or provide “adapters” that make the proprietary filesystem available through widespread APIs, such as POSIX. Figure 18-1. Comparing filesystems You depict this in a diagram by having the Hadoop “stack” reach all the way from top to bottom, whereas other systems provide “seams,” including POSIX compliance. The audience can now easily understand why the POSIX feature is important, but HDFS doesn’t need to provide it. Consistent Level of Detail Determining the appropriate level of detail to support the line of reasoning is difficult. For example, we pretended “POSIX” is a single thing when in reality there are many different versions and components, the Linux Standard Base, and so on. The ability to draw the line at roughly the right level of detail is an important skill of an architect. Many developers or IT specialists love to inundate their audience with irrelevant jargon. Others consider it all terribly obvious and leave giant gaps by omitting critical details. As so often, the middle ground is where you want to be. Drawing the line at the correct level of detail depends on you knowing your audience. If your audience is mixed, building a good ramp is ever more important because it allows you to catch up folks who aren’t familiar with the details without boring those who are. The highest form is building a ramp that audience members already familiar with the subject matter appreciate despite not having learned anything new. This is tough to achieve but is a noble goal to aim for. Building a steep, but logical ramp allows those unfamiliar with the topic to get up to speed without boring those who are. Getting the level of detail “just right” is usually a crapshoot, even if you do know the audience. At least as important, though, is sticking to a consistent level of detail. If you describe high-level filesystems on slide one and then dive into bit encoding on magnetic disks in slide two, you are almost guaranteed to either bore or lose your audience. Therefore, strive to find a line that maintains cohesion for reasoning about the architectural decision at hand, without leaving too many “dangling” aspects. Algorithm-minded people would phrase this challenge as a graph partition problem: your topic consists of many elements that are logically connected, just like a graph of nodes connected by edges. Your task is to split the graph (i.e., to cover only a subset of the elements), while minimizing the number of edges (i.e., logical connections) being cut. I Wanted to Have Liked To, but Didn’t Dare Be Allowed This poor translation of Karl Valentin’s famous quote “Mögen hätt’ ich schon wollen, aber dürfen habe ich mich nicht getraut” reminds me of the biggest challenge in explaining technical matter: too many architects believe their audience will never “get” their explanations, anyway. Some are also afraid that presenting technical detail will make them appear unfit for management. Therefore, even though they might have been able to, they’re shying away from attempting to present technical concepts to a senior audience. In my view, this is a missed opportunity. I see every interaction with management also as a teaching opportunity. It’s the basis for the Architect Elevator. Others go a step further and actually prefer to confuse management with random jargon, acronyms, and product names so that their “decisions” (often simply preferences or vendor recommendations) aren’t unnecessarily put into question by the audience. This usually happens when technical teams, which see approval meetings as a nuisance rather than an opportunity to gather feedback, play off management’s insecurity when it comes to technical topics. Every interaction with senior management is also a teaching opportunity. Use it! I have a rather critical view of such behavior and generally advise management not to approve anything that isn’t crystal clear to them. After all, if something isn’t easily comprehensible, it’s due to lack of clarity, not the audience. Your role as an architect is to build a broad understanding of the ramifications of decisions and assumptions that were made. Without it, big problems are bound to pop up. For example, if a few years down the road an IT system can no longer serve the business needs, it is often due to a constraint or an invalid assumption that was made but never clearly communicated. Communicating decisions and explaining trade-offs clearly protects both you and the business. Chapter 19. Show the Kids the Pirate Ship! Why the Whole Is Much More Than the Parts This is what people want to see When you look at the cover of a box of LEGOs you don’t see a picture of each individual brick that’s inside. Instead, you see the picture of an exciting, fully assembled model, such as a pirate ship. To make it even more exciting, the model isn’t sitting on a living room table but is positioned in a life-like pirate’s bay with cliffs and sharks—Captain Jack Sparrow would be jealous. What does this have to do with communicating system architecture and design? Sadly, not much, but it should! Technical communication too frequently does the opposite: it lists all the individual elements in painstaking detail but forgets to show the pirate ship. The results are tons of boxes (and hopefully some lines; see Chapter 23), without a clear gestalt or overall value proposition. Is this a fair comparison, though? LEGO is selling toys to kids, whereas architects need to explain the complex interplay between components to management and other professionals. Furthermore, IT professionals have to explain issues like network outages due to flooded network segments, something much less fun than playing pirates. I’d posit that the analogy holds and we can learn quite a few things from the pirate ship for the presentation of IT architecture. Grab Attention The initial purpose of the pirate ship is to draw attention among all the other competing toy boxes. While kids come to the toy store to hunt for new and shiny toys, many corporate meeting attendees are there because they were delegated by their boss, not because they want to hear your content. Grabbing their attention and getting them to put down their smartphones requires you to show something exciting. Sadly, many presentations start with a table of contents, which I consider rather silly. First, it isn’t exciting: it’s like a list of assembly instructions instead of the ship. Second, the purpose of a table of contents is to allow a reader to navigate a book or a magazine. If the audience must sit through the entire presentation anyhow, there is no point in giving them a table of contents at the beginning. Starting a presentation with a table of contents isn’t useful, because the audience doesn’t get to jump to Chapter 3. It also makes for a boring start: have you ever seen a movie that begins with the outline of its storyline? The old adage of “tell them what you are going to tell them,” which is vaguely attributed to Aristotle, certainly doesn’t translate into a slide showing a table of contents. You are going to tell them how to build a pirate ship! Build Excitement The moment children and your audience look at the pirate ship, they should feel excitement. How cool is this? There are sharks and pirates, daggers and cannons, chests of gold, and the parrot. You can feel the story unravel in your head just as you are reading the list of play pieces. Why should PaaS, API gateways, web application firewalls, and build pipelines tell a less exciting story? It’s a story of gaining speed in the treacherous waters of the digital world where automated tests and build pipelines keep you safe despite the fast pace. Automated deployments industrialize your delivery, and PaaS allows your fleet to grow and shrink as needed while you’re trying to avoid running ashore in the vicious land of vendor lock-in. That’s at least as exciting as a pirate story! I am convinced that IT architecture can be much more exciting and interesting than people commonly believe. In an interview with my friend Yuji back in 2004, I explained that software development is quite a bit more exciting than it appears on the outside—it is as exciting as you make it. If you regard software development as a pile of LEGOs, you haven’t seen the pirate ship! People who find software and architecture boring or just a necessary tedium haven’t scratched the surface of software design and architecture thinking. They also haven’t understood that IT isn’t any longer a means to an end but an innovation driver for the business. They consider IT as randomly stacking LEGO bricks, when in reality we are building exciting pirate ships! Focus on Purpose Coming back to the pirate ship, the box also clearly shows the purpose of the pieces inside. The purpose isn’t for the bricks to be randomly stacked together but to build a cohesive, balanced solution. The whole really is much more than the sum of the parts in this case. It’s the same with system design: a database and a few servers are nothing special, but a scale-out, masterless NoSQL database is quite exciting. Alas, the technical staff who had to put all the pieces together is prone to dwell on said pieces instead of drawing attention to the purpose of the solution they built. They feel that the audience should appreciate all the hard work that went into assembling the pieces as opposed to the usefulness of the complete solution. Here’s the bad news: no one is interested in how much work it took you; people want to see the results you achieved. Pirate Ship Leads to Better Decisions A pirate ship can do more than build excitement. It can also be a tool to make better decisions. My Architect Elevator workshops include an exercise to draw a system architecture. To see different ways of illustrating a common architecture, I picked a system that’s quite well understood to most attendees, an application monitoring system. I hand each group of attendees about a dozen cards, each of which contains common monitoring components like log aggregator, time series database, thresholds, alerting, and ask them to draw an architecture containing these pieces. Attendees will typically draw diagrams that put the components into a logical sequence; for example, by data flow, as demonstrated in Figure 191. Sometimes components are further grouped into major functions, such as data collection, data processing, and user interface. That’s what architecture diagrams normally look like. Figure 19-1. A typical architecture drawing of a monitoring system After looking at such diagrams, I ask an innocent-sounding question: “What’s this system’s purpose?” Initially, attendees mention detecting anomalies and alerting someone. After some contemplation and prodding, the architects start to see the bigger picture. They correctly identify the real purpose of a monitoring system as maximizing system availability by minimizing system downtime. This is easily validated by assuming the opposite: the only time you don’t need any monitoring is if you don’t care about system availability. Soon after, participants realize that the original picture shows only half the equation: a monitoring system is useful only if a detected problem can be analyzed and corrected. Based on this insight, they start to augment or redo the diagrams to show the pirate ship; that is, the main purpose, as depicted in Figure 19-2. Figure 19-2. Showing the pirate ship of a monitoring system They can now add the equivalent of the shark and the parrot by illustrating the purpose in the center of the diagram: minimize MTTR, the time from the error to recovery. As the MTTR spans the whole circle, we can think about both sides: how long does it take to detect an outage, and how long does it take to resolve it? Thanks to the completed model, this aspect is apparent, and we can better reason about whether the company should invest in an upgraded monitoring system. Investing in a monitoring system that reduces the time to detect outages from half an hour to a few minutes thanks to better sensors and smarter analytics may seem like a good idea. If resolving an outage takes several hours, though, the picture changes: spending, for instance, half a million dollars to reduce the MTTR from 4.5 hours to 4.1 hours doesn’t look that great anymore. Instead, you’d be looking to reduce the time spent resolving outages. This can be achieved, for example, by better transparency across systems or higher levels of automation (Chapter 13) that can quickly roll back the deployed software to an earlier, stable version. Drawing a better picture has helped us make better decisions (Chapter 22). The Product Box A successful concept similar to the pirate ship is the product box, one of Luke Hohmann’s “innovation games,” from his book of the same title.1 This game asks participants to design a physical retail box for their product. To be appealing to potential buyers, such a box would want to show common usages and highlight benefits instead of just features. Thinking of your product like a retail item can help focus on tangible benefits instead of technical features. If teams do well, they’ll put an exciting pirate ship on the cover, as shown in Figure 19-3. Figure 19-3. A product box for cloud computing Designing the Pirate Ship Drawing a pirate ship is generally a new, and occasionally uncomfortable, exercise for product and engineering teams. A few techniques can overcome initial hurdles. Show Context The LEGO box cover image shows the pirate ship within a useful context, such as a (fake) pirate’s bay. Likewise, the context in which an IT system is embedded is at least as relevant as the intricacies of the internal design. Hardly any system lives in isolation, and the interplay between systems is often more difficult to engineer than the innards of a single system. So you should show a system in its natural habitat. Many architecture methods begin with a system context diagram. While well intentioned, too many times it fails to be useful because it aims for a complete system specification without placing an emphasis (Chapter 21). Such diagrams show an endless ocean, but not the pirate ship. The Content on the Inside LEGO toys also show the exact part count and their assembly, but they do so on a leaflet inside the box, not on the cover. Correspondingly, technical communication should display the pirate ship on the first page or slide and keep the description of the bricks and how to stack them together for the subsequent pages. Get your audience’s attention, then take them through the details. If you do it the other way around, they might all be asleep by the time the exciting part finally comes. Consider the Audience Just like LEGO has different product ranges for different age groups, not every IT audience is suitable for the pirate ship. To some levels of management that are far removed from technology, you may need to show the little duckie made from a handful of LEGO DUPLO bricks. Pack Some Pathos Some might feel that excitement is a bit too frivolous for a serious workplace discussion. That’s where you should look back at Aristotle, who gave us great advice on communicating, some 2,300 years ago (Figure 194). He concluded that a good argument is based on logos, facts and reasoning; ethos, trust and authority; and pathos, emotion! Most technical presentations deliver 90% logos, 9% ethos, and maybe 1% pathos. From that starting point, a small extra dose of pathos can go a long way. You just have to make sure that your content can match the picture presented on the cover: pitching a pirate ship and not having the cannons inside the box is bound to lead to disappointment. Figure 19-4. Three modes of persuasion Play Is Work While on the topic of toys: building pirate ships would be classified by most people as playing—something that is commonly seen as the opposite of work. Pulling another reference from the ’80s movie archives reminds us that “all work and no play makes Jack a dull boy.” Let’s hope that lack of play doesn’t have the same effect on IT architects as it had on the author Jack in the movie The Shining—he went insane and attempted to kill his family. But it certainly stifles learning and innovation. Most of what we know we didn’t learn from our school teachers, but from playing and experimenting. Sadly, most people seem to have forgotten how to play, or were told not to, when they entered their professional life. This happens due to social norms, pressure to always be (or appear) productive, and fear. Playing knows no fear and no judgment; that’s why it gives you an open mind for new things. Playing is learning, so in times of rapid change architects need to play more. If playing is learning, times of rapid change that require us to learn new technologies and adapt to new ways of working should re-emphasize the importance of playing. I actively encourage engineers and architects in my team to play. Interestingly, LEGO offers a successful method called Serious Play for executives to improve group problem solving. They might be building pirate ships. 1 Luke Hohmann, Innovation Games: Creating Breakthrough Products Through Collaborative Play (Boston: Addison-Wesley), 2007. Chapter 20. Writing for Busy People Don’t Expect Everyone to Read Word for Word If you don’t have time to read, look at the pictures Most organizations are full of boring documents that remain largely unread. That doesn’t mean that documentation is a bad idea. Done well, it’s still the best vehicle to proverbially get everyone on the same page across a wide audience. Over time, brief but accurate technical position and decision papers have become a trademark of my architecture teams. While the title of this chapter is a pun on the titles of popular books such as Japanese for Busy People, it intentionally implies an ambiguity that we are both writing for a busy audience and are busy authors as well. Writing Scales Sadly, writing takes much more effort than reading, but skimping on writing is penny-wise and pound foolish because the written word has enormous advantages over the spoken word or slide presentations: It scales You can address a large audience without gathering everyone in one room (podcasts, admittedly, can also accomplish that). It’s fast People read two to three times faster than they can listen. It’s searchable You can find what you want to read quickly. It can be edited and versioned Everybody sees the same, versioned content. So, writing pays off when you have a large (or important) enough audience. The biggest benefit, though, is Richard Guindon’s insight that “Writing is nature’s way of telling us how sloppy our thinking is.” That alone makes writing a worthwhile exercise because it requires you to sort out your thoughts so that you can put them into a somewhat cohesive storyline. Unlike most slide decks, well-written documents are also self-contained, so they can be widely distributed without further commentary. Quality Versus Impact The catch with writing is that although you can to some extent force people to (at least pretend to) listen to you, it’s much more difficult to force anyone to read your text. I remind writers that “the reader is by no means required to turn the page. They decide based on what they read so far.” Assuming the topic is interesting and relevant to the readership, I have repeatedly observed a nonlinear relationship between the quality of the writing and the attention it will receive, which is a good proxy metric for the impact of a technical paper. If the paper doesn’t meet a minimum bar for quality—for example, because it is verbose, poorly structured, full of typos, or displayed in some ridiculous, difficult to read font—people won’t read it at all, resulting in zero impact. I call this the “trash-bin” zone, named after the likely reader reaction. At the other end of the spectrum, additional impact from quality improvement ultimately tapers off as the document approaches the “gold-plating” zone. So, you want to get the quality of your writing into the “sweet spot” and then focus on content instead of polishing further. While the sweet spot depends on the topic and the audience, I posit that the trash-bin zone is wider, and therefore more dangerous, than most developers believe. Key influencers—your most important readers—are very busy people and tend to shy away from anything that is more than a few pages long, perhaps unless it is from a high-paid consultancy, in which case they make someone else read it because they paid so much money for it. A senior executive once refused to read a paper because his first name was misspelled on the cover page. I think he was right. For this impatient readership, clarity of wording and brevity aren’t nice-tohaves: a lack thereof will quickly put your paper quite literally into the trash-bin zone. Blatant typos or grammar issues are like the proverbial fly in the soup: the taste is arguably the same, but the customer is unlikely to come back for more. “In the Hand”—First Impressions Count When Bobby Woolf and I wrote Enterprise Integration Patterns, the publisher highlighted the importance of the “in the hand” moment, which occurs when a potential buyer picks the book from the shelf to give a quick glimpse at the front and back cover, maybe the table of contents, and to leaf through. The reader makes the purchasing decision at this very moment, not when they stumble on your ingenious conclusion on page 326. This is one reason why we included many diagrams in that book: almost all facing pages contain a graphical element, such as an icon (aka “Gregorgram”), a pattern sketch, a screenshot, or a UML diagram: roughly 350 in total. We wanted to send a strong message to potential readers that it isn’t an academic book, but a pragmatic and approachable one. Technical papers should do the same: use a clean layout, insert a handful of expressive diagrams, and, above all, keep it short and to the point! To assess what a short paper will “feel” like to the reader without wasting printer paper, I zoom out my WYSIWYG editor far enough that all pages appear on the screen, as illustrated in Figure 20-1. I can’t read the text anymore, but I can see the headings, diagrams, and overall flow; for example, the length of paragraphs and sections. This is exactly how a reader will see it when flipping through your document to decide whether it’s worth reading. If they see an endless parade of bullet points, bulky paragraphs, or a giant mess, the paper will leave “the hand” quite quickly as gravity teleports it into the recycling bin. Figure 20-1. Zooming out from a technical paper The Curse of Writing: Linearity Text is linear: one word comes after the other, one paragraph after the previous. However, hardly any relevant technical topic is one-dimensional. One of the major challenges of technical writing (or speaking) is therefore to map a complex topic space into a linear storyline. For the algorithmically inclined, writing is a bit like coding a graph traversal problem: you can go breadth-first or depth-first. Breadth-first means that you cover all your topics at a high level, gradually descending down into the detail. Depth-first covers each topic in depth before moving on to the next topic. A well-thought-out logical structure can help overcome this limitation. It’s easier to traverse a tree than a complex graph with many loops. Barbara Minto captures the essence of this approach in her book The Pyramid Principle.1 The “pyramid” in this context denotes the hierarchy of content; that is, a tree, not the pyramids in IT (Chapter 28). A Good Paper Is Like the Movie Shrek Most animated movies have to entertain multiple audiences: the kids who love the cute characters plus the adults who had to shell out 30 bucks to take the family to the movies and spend two hours watching cute characters. Great animated movies like Shrek manage to address both audiences by including humor for kids and adults. The audiences might laugh at slightly different scenes but aren’t distracted by each other. Technical papers that address a diverse audience should aim to do the same. They need to supply technical detail while also highlighting important decisions and recommendations, so they can be read at two levels. A few simple techniques can help make reading your paper a little bit like watching Shrek: Storytelling headings These replace an executive summary: your reader should get the gist of the paper just by reading the headings. Headings like “introduction” or “conclusion” aren’t telling a story and have no place in a short paper. Anchor diagrams These provide a visual cue for important sections. Readers who flip through a paper likely pause at a diagram, so it’s good to position them strategically. Sidebars These are the short sections that are offset in a different font or color, indicating to the reader that this additional detail can be safely skipped without losing the train of thought. This way, executives can just read the headings and look at the diagram to get the essence of your paper in a minute or two (Figure 20-2). Most readers will read the paper but might skip the callouts, whereas specialists will pay particular attention to the detail in the callouts. This way, you can help break the curse of linearity a tiny bit by giving different readers different paths through the document. Figure 20-2. Breaking the curse of linearity Making It Easy for the Reader After a positive first impression, your readers will begin reading your paper. For advice on technical writing, I recommend the book Technical Writing and Professional Communication,2 which sadly appears out of print but is widely available used. It covers a lot of ground in its 700 pages, including authoring different types of documents, such as resumes. I find the sections toward the end on parallelism and paragraph structure most helpful. Parallelism demands that all entries in a list follow the same grammatical structure; for example, all start with a verb or an adjective. A counterexample would be the left column of the following, with the righthand side showing a better approach: System A is preferred because: System A is preferred due to: It’s faster Performance Flexible Flexibility We want to reduce cost Economics Stable Stability Inconsistent writing uses too many of your reader’s brain cells just to parse the text instead of focusing on your message. Taking the “noise” out of the language reduces friction and allows your reader to focus on the content. Parallelism is not only useful in lists but also in sentences; for example, when drawing analogies or contrasting. Each paragraph should focus on a single topic and introduce that topic at the beginning, like this very paragraph: readers can glean from the first few words that this paragraph is about paragraphs. They can also rest assured that I don’t start talking about lists halfway through, so if they already know how to write a good paragraph, they can safely skip this one. That’s why “It is further important to note that in some circumstances one has to pay special attention to…” makes for a very poor paragraph opening. Lists, Sets, Null Pointers, and Symbol Tables Most programming languages support sets—i.e., unordered collections of elements—but books (and speeches) don’t: every list has an order. Because you can’t avoid it, you’d better choose the order consciously. Valid options are time (chronological), structure (relationships), or ranking (importance). Note that “alphabetical” and “serendipitous” aren’t valid choices. “How is this ordered?” has become a standard question I ask when reviewing documents containing a list or grouping. Loose usage of the word this as a stand-alone reference is another pet peeve of mine; for example, stating that “this is a problem” without being clear what “this” actually refers to. Jeff Ullman cites such a “non-referential this” as one of the major impediments to clear writing, exemplified in his canonical example:3 If you turn the sproggle left, it will jam, and the glorp will not be able to move. This is why we foo the bar. Do we foo the bar because the glorp doesn’t move or because the sproggle jammed? Programmers well understand the dangers of dangling pointers and Null Pointer Exceptions, but they don’t seem to apply the same rigor to writing—maybe because your readers don’t throw a stack trace at you? Another fantastic piece of advice from Minto is the following: Making a statement to a reader that tells him something he doesn’t know will automatically raise a logical question in his mind […] the writer is now obliged to answer that question. The way to ensure total reader attention, therefore, is to refrain from raising any questions in the reader’s mind before you are ready to answer them. My translation for software engineers: when writing, assume that your readers use a single-pass compilation algorithm and don’t have access to a complete symbol table. This means that forward references aren’t allowed: you can only refer to terms and concepts that were already introduced. For the algorithmically minded, you’ll need to do a topological sort on your topic graph. What if there’s a circle? You’ll get a stack overflow, just like your audience! Following this simple advice will place your technical paper above 80% of the rest, because, sadly, the bar for technical documents is so low. An internal presentation once stated on the first slide: “only technology ABCD has proven to be a viable solution.” When I asked for proof, it turned out that none existed due to “lack of time and funding.” These aren’t just wording issues, but fatal flaws. A reader no longer wants to see page 2 if they cannot trust page 1. Lastly, make sure to avoid unsubstantiated claims. I refer to this phenomenon as the “hourglass presentation”: it starts with a lot of buzzwords and promises, then becomes very narrow, and ends with bold requests for funding and headcount. In der Kürze liegt die Würze4 In technical writing, your readers are not out to appreciate your literary creativity, but to understand what you are saying. Therefore, less is more when it comes to word count. Although Walker Royce5 spends a good part of his book musing about English words, his advice on brevity and editing is sound. His paraphrased citation from Zinsser6 on the usage of “I might add,” “It should be pointed out,” and “It is interesting to note,” hits the mark: If you might add, add it. If it should be pointed out, point it out. If it is interesting to note, make it interesting. Royce also gives many concrete suggestions on how to replace long-winded expressions or “big” words with single, simple words, thereby not only reducing noise but also aiding non-native speakers. If you are up to a more rigorous evaluation of properly linking words into sentences and you are willing to put up with a few tirades and snipes, I recommend Barzun’s Simple & Direct,7 which isn’t simple, but pedantically direct. Our team’s internal editing cycles routinely cut word count by 20 to 30% despite including additional material or detail. To the first-time author this might be shocking, but Saint-Exupéry’s adage that “perfection is achieved not when there is nothing more to add, but when nothing is left to take away” is especially true for technical papers (and good code for that matter). I actually edited this very chapter down by 15%. When this type of cruel editing was first bestowed upon me by a professional copy editor, I felt that the document no longer sounded “like me.” Over the years, I have come to appreciate that being crisp and accurate is a great way to have a technical paper sound like me. Longer, more personal pieces like this book allow some “slack” to help the reader keep attention after many pages. Unit Testing Technical Papers The most effective vehicle for improving technical papers is to hold a writer’s workshop.8 Such a workshop entails attendees discussing a paper, which they have read, while the author is allowed to listen but not to speak. This setup simulates someone reading and trying to understand a paper. The author must remain silent because they cannot pop out of their paper to explain to each reader what was really meant—a document must be selfcontained. Because writer’s workshops are time intensive, they are best applied after the paper has gone through an initial review. Technical Memos A document doesn’t need to be all encompassing—who reads an encyclopedia, anyway? Twenty years ago, Ward Cunningham defined the notion of a technical memo, a document that describes a particular aspect of the system, in his Episodes pattern language:9 Maintain a series of well formatted technical memoranda addressing subjects not easily expressed in the program under development. Focus each memo on a single subject. […] Traditional, comprehensive design documentation […] rarely shines except in isolated spots. Elevate those spots in technical memos and forget about the rest. Keep in mind, though, that writing technical memos is more useful, but not necessarily easier, than producing reams of mediocre documentation. The classic example of this noble idea gone wrong is a project wiki full of random, mostly outdated, and incohesive documentation. This isn’t the tool’s fault (the wiki was not quite coincidentally also invented by Ward); rather, it’s due to a lack of emphasis over completeness (Chapter 21) by the writers. The Pen Is Mightier Than the Sword, but Not Mightier Than Corporate Politics Producing high-quality position papers can lead to an unexpected amount of organizational headwind. The word perfection is invariably used with a negative connotation by those who are poor writers or want to avoid sharing their team’s work. Ironically, these are often the same departments that love to be entertained by colorful vendor presentations. Other teams claim that their “Agile” approach spares them from any need to produce documentation, notwithstanding the fact that those teams have no running code to show either. Agile software development places the emphasis on producing working code that is worth reading, but multiyear IT strategy plans are unlikely to manifest themselves in code alone. Alas, good documents seem to be even more difficult to find than good code. Some corporate denizens actively resent writing clear and self-contained documents because they prefer to “tune” their story for each audience. Naturally, this approach doesn’t scale (Chapter 30). Writing good documents in an organization that is generally poor at writing can give you significant visibility, but it can also rock the political system. The first time I sent a positioning paper on digital ecosystems to senior management, a person complained to both my boss and my boss’s boss about me not having “aligned” the paper with her. Communication is a mighty tool, and some people in your organization will fight hard to control it. Pick your targets wisely. 1 Barbara Minto, The Pyramid Principle: Logic in Writing and Thinking (Upper Saddle River, NJ: Prentice Hall, 2010). 2 Leslie A. Olsen and Thomas N. Huckin, Technical Writing and Professional Communication, 2nd ed. (New York: McGraw-Hill, 1991). 3 Jeff Ullman, “Viewpoint: Advising students for success,” Communications of the ACM 52, No. 3 (March 2009). 4 Literally, “brevity gives spice,” ironically translating into “short and sweet.” 5 Walker Royce, Eureka!: Discover and Enjoy the Hidden Power of the English Language (New York: Morgan James Publishing, 2011). 6 William Zinsser, On Writing Well: The Classic Guide to Writing Nonfiction (New York: Harper, 2006). 7 Jacques Barzun, Simple & Direct (New York: Harper Perennial, 2001). 8 Richard P. Gabriel, Writers’ Workshops & the Work of Making Things: Patterns, Poetry… (Upper Saddle River, NJ: Pearson Education, 2002). 9 John Vlissides, James O. Coplien, and Norman L. Kerth, Pattern Languages of Program Design 2 (Reading, MA: Addison-Wesley, 1996). Chapter 21. Emphasis Over Completeness Show the Forest, Not the Trees Can you spot the performance bottleneck in this database schema? When sharing a diagram, you might receive feedback such as, “System ABC is missing.” Even though it’s well intentioned, completeness shouldn’t be your architecture diagrams’ primary goal. Rather, you should depict the appropriate scope. What’s the right scope? One that’s big enough to be meaningful, small enough to be comprehensible, and cohesive enough to make sense. In large organizations, there’s a constant danger of being overcome by the sheer size and complexity of the environment. So, putting some blinders on is allowed, and in fact encouraged. Diagrams Are Models When discussing architecture diagrams, it’s good to remind ourselves why we draw them in the first place. Architecture diagrams are models of reality (Chapter 22). The most common model of reality we use in daily life is a map: maps help us decide where to go and how to get there. To do so, maps select a specific scope and emphasis. For example, a Chicago street map that shows only half of downtown would be awkward. However, including all of Lake Michigan wouldn’t be very useful, just like adding Springfield at the same scale. A map designer chooses conscious boundaries and a conscious level of detail based on the map’s intended purpose. Models, whether maps or architecture diagrams, aren’t about being right or wrong. In fact, they’re all wrong (Chapter 6) because they aren’t reality. The opening paragraph of William Kent’s book Data and Reality1 aptly reminds us: “Rivers do not have dotted lines in them and freeways are not painted red.” Instead of trying to make models right, you should think about whether your models are useful. To answer that question, though, you need to first know what the model’s use, or purpose, is. For a model to be useful, it needs to help you answer a question or make a better decision. Otherwise, your diagram is just art, and having looked at thousands of architecture diagrams, my impression is that most architects aren’t particularly gifted artists. So, before setting out to draw a specific diagram or design a presentation slide, you must first decide which questions you are looking to answer. A broad “lay of the land” might be needed to build your world map (Chapter 16), but it isn’t very useful as an architecture diagram. Think of it this way: a travel bureau will show you beaches and palm trees, not a map of the whole continent. All models are wrong, but some are useful. To know which ones, you must first know which question you’re trying to answer. When deciding on a diagram’s scope and boundaries, I am not always able to do so a priori. Sometimes, I need to have the diagram in front of my eyes to decide whether I prefer to split it into two. I therefore almost always work iteratively. The Five-Second Test Architecture diagrams or slides are designed to get a specific point across and therefore must place a clear emphasis. This distinguishes them from reference books or manuals, which aim to be comprehensive. Still, too many slides I see try to give the big picture as the best possible approximation of reality without knowing which part is actually worth looking at. When faced with overly “noisy” slides, I tend to apply a strict but useful five-second rule, which isn’t related to food safety:2 I show the audience a slide for a mere five seconds and ask them to describe what they saw. In most cases, the responses boil down to a few words from the headline and statements like “two yellow boxes and one blue barrel below.” If you are aiming to convey a shared database pattern, you likely succeeded, but most authors will be disappointed to hear such a dramatic simplification of their precious content. Slides that don’t pass this test are likely to confuse the audience when first shown: viewers’ eyes will chase across the visuals, trying to discern what’s important and what’s the meaning of it all. During that time your audience isn’t listening to you explaining the content because they’re busy with the visuals. Of course, you will show the actual slide for more than five seconds, but first impressions count—for every slide you show. A useful presentation technique is to verbally introduce the concept of the next slide before actually showing it. The audience is more likely to listen because they’re aren’t distracted by the new visual and you’re building up a bit of suspense. Naturally, this requires you to know which slide comes next rather than use the slides as a reminder what to talk about. Some organizations create slides that try to act like documents, meaning they are also meant to be read as handouts. The resulting slideument, a term coined by Garr Reynolds,3 is rarely a useful presentation and never passes the five-second test because there’s way too much content on a slide. Sadly, most of them don’t make a meaningful document either because it usually lacks a clear structure and storyline. Interestingly, Martin Fowler realized that there’s a use case for documents created in a presentation tool, for which he’s coined the term Infodecks.4 Nancy Duarte shows a similar approach with SlideDocs.5 Both can be a useful communication medium when being read as opposed to being projected. A Pop Quiz I participate in many architecture reviews and decision boards. While such boards often exist due to an undesirable separation of decision makers and knowledge holders (Chapter 1), many large enterprises depend on them to harmonize their technical landscape and to gain an overview across many functional silos. The topics for these meetings can be fairly technical in nature, making me skeptical whether the audience is truly following along. A presentation pop quiz consists of blanking the presentation and having a member of the audience explain what they saw and understood. It’s a test for the presenter, not the audience. To test whether the decision body understands what they are deciding, I inject a pop quiz6 into the presentation by telling the presenter to pause and blank the slide (hitting “B” will do this in PowerPoint) and asking who among the audience would like to recap what was said up to this point in their own words. Sadly, this exercise is more likely to trigger nervous laughter, frantic staring at the floor, and sudden checking of incoming emails as opposed to a good summary. As a result, I might ask the presenter to briefly recap the key points for everyone’s benefit. It’s also useful to highlight to the audience that this is a test for the presenter, not for them. Simple Language I don’t exclude myself from the pop quiz. When replaying what the speaker said, I often intentionally use very simple language to make sure I really capture the essence. In a presentation about network security architecture in the untrusted network zone, after watching a handful of rather busy slides, I summarized the speaker’s statement as follows: “What worries you is the black line going all the way from top to bottom?” His resounding “yes” confirmed both that I had correctly summarized the issue and that the presenter took away an insight into how to better communicate this very aspect. This technique might seem overly simplistic at first, but it validates that there is a solid connection between the model being presented (such as vertical lines depicting legal network paths from the internet to the trusted network) and the problem statement (direct paths pose a security risk). Removing all noise and reducing the statement down to the “black line” sharpens the message. Diagramming Basics If I had to name the number-one enemy of useful architecture diagrams, it would likely be Visio’s default 10-point font size and skimpy line width, augmented by poor user judgment regarding component placement. It’s really in the same league as PowerPoint’s autosize feature that gives people an endless supply of bullet points to neutralize their audience. True, the tool isn’t solely responsible, but Visio’s default settings, which are tuned for detailed engineering schematics, lure the user into creating visuals that are unsuitable for projecting something evocative on the wall. My advice for creating diagrams that can convey a clear message without dumbing down the content therefore starts with the following basic techniques. Avoid the Ant Font Text that isn’t readable isn’t adding value, so avoid ant fonts7 unless you consider “I know you can’t read this” an engaging introduction into a slide. Using sans-serif fonts of decent size and good color contrast will be appreciated in any presentation. I can’t count the number of times I see slides that contain tiny fonts but consist of 50% empty space that could have been used for larger boxes and larger fonts, similar to Figure 21-1. Architecture diagrams aren’t the place for minimalism—go bold. Figure 21-1. Use the available space to make text easily readable Most tools allow you to set defaults for line width and font sizes. Use them. Also, periodically zoom down the diagram on your screen to 25% to see what’s still readable. Maximize the Signal-to-Noise Ratio Differences in elements that don’t have meaning are nothing but distractions. Therefore, reduce visual noise; for example, by properly aligning elements and using a consistent form and shape (see Figure 21-2). It’s also good to be careful with too much decoration, such as rounded corners, shadows, and so on—they can distract from the core message you’re trying to convey. If things look different, make sure that this expresses meaning, as detailed in Chapter 23. Figure 21-2. Make same things look the same Great advice on placement, visual layout, and emphasis can be found in Nancy Duarte’s book slide:ology.8 Let Arrows Point One of the my most frequent maneuvers in presentation tools is to increase the size of arrowheads. If you use directed arrows to express semantics (Chapter 23), you’re going to want them to be easily recognizable, as depicted in Figure 21-3. If direction isn’t critical to understanding the diagram, omit the arrowheads to reduce noise. Figure 21-3. If direction is important, make the arrowhead big enough to see If your tool won’t cooperate, place a triangle over the line; never let the tool be an excuse for poor diagrams. It’s like the cook coming out of the kitchen and telling you that your meal isn’t tasty because the farmer didn’t grow tasty tomatoes. You’re unlikely to be extraordinarily sympathetic. Legends Are Crutches Although they’re a standard feature in scientific circles and charts exported from Excel, a visual legend requires a viewer to correlate patterns or colors in a diagram with explanatory labels below or next to it. Having the label where the data is located is much easier to digest, as shown in Figure 21-4. Figure 21-4. Label your data as opposed to making the reader read a legend Therefore, use legends only when absolutely unavoidable. Most of the time you can remove clutter or increase the size of boxes to put the labels where they belong. I have redrawn stacked bar graphs exported from Excel export to have better control over sizing and labeling. The investment of a mere five minutes saved a room full of executives much time and effort in reading the data. Layer Visually As we already learned, a good document reads like watching the movie Shrek (Chapter 20). The same is true for diagrams, which might be required to illustrate complex interrelationships that lead to a particular system behavior (Chapter 10). They should nevertheless pass the five-second test by having a clear high-level structure that is visible first, augmented by additional detail that reveals itself later and doesn’t interfere with the big picture. Figure 21-5 first reveals that the system consists of two identical zones, after which you can “zoom in” to see how each zone is internally composed. Figure 21-5. Give your diagrams a clear high-level structure I occasionally use a build slide or incremental reveal for this purpose, acknowledging that there are strong opinions for and against such visual effects. I find build slides to work well as they give the viewer time to understand each element before adding the next batch. For this to work, let the new elements simply appear, avoiding any temptation to select one of those amazing spiral-twist-fade-rotate reveals. If you layer a diagram perfectly, the visuals will reveal themselves incrementally. If you can’t quite get there every time, incremental build slides are a reasonable substitute. The Style of Elements Most architects will develop their own visual style over time and can use it as a valuable branding tool. Many of my technical papers and diagrams are easily recognizable—for example, when sitting on someone’s desk—thanks to a consistent set of colors and a bold, almost cartoon-like style that favors large lettering over subtle aesthetics. My diagrams virtually always have lines (Chapter 23), but I keep the lines’ semantics to two or at most three concepts. Each type of relationship that I depict with lines should be intuitive. For example, I might depict a data flow with broad, gray arrows, whereas control flow is shown in thin, black lines, as illustrated in Figure 21-6, which depicts the Control Bus pattern from Enterprise Integration Patterns.9 Figure 21-6. The Control Bus pattern illustrates line semantics The line width suggests that a large amount of data flows through the system’s data flow while the control flow is much smaller but significant. The best visual style, borrowed from advice on writing, is the one “that keeps solely in view the thought one wants to convey.”10 Making a Statement When preparing a slide or a document paragraph, the title sets the tone for a clear and focused statement. For most circumstances, I prefer titles that are full sentences because the title alone tells the essence of the story. Using this approach also assures that each slide or paragraph focuses on a single main statement. I make an exception for keynote presentations to a large and diverse audience for which I use titles consisting of single words or short phrases like the Architect Elevator (Chapter 1). Such short titles mesh well with simple visuals that are truly a visual aid to me, the speaker, to draw the audience’s attention and help them memorize the content via a visual metaphor. For technical presentations that are prepared for a review or decisionmaking session, however, I prefer clear statements, with which one can either agree or disagree. These statements are much better represented as full sentences, akin to the story-telling headings (Chapter 20) in documents for busy people. In such cases, “Stateless services and automation support elastic scale-out” is a better title than “Server Architecture.” What you certainly want to avoid are verbose phrases or crippled sentences that confuse the reader but don’t make any form of statement: “Server infrastructure and application architecture overview diagram (abstracted for simplicity’s sake).” Trust me, I’ve seen even worse. Twenty Slides, One Story When structuring presentations, I realize that too many technical presentations tell one story per slide. While it’s good to focus on one message per slide, the sequence of messages needs to form a cohesive story, as demonstrated in the bottom half of Figure 21-7. Interestingly, you can test this easily using PowerPoint’s Outline View, which shows all slide headings in a sidebar. Figure 21-7. Telling one story supported by slides aids flow and saves time Creating this cohesion not only makes a single story line for a more logical flow, it also drastically shortens the time to needed to present. If each slide tells a new story, the speaker will easily spend half a minute looking at and introducing each slide. Multiply this by the typical 20 to 30 slides in a presentation, and you’ll find that having a connected storyline can save you up 15 minutes. So, when someone worries that they don’t have enough time for their content, I advise them to make sure that they have a single story. For a good collection of slide decks that tell a story, I recommend a visit to https://speakerdeck.com. Nothing Is Confusing in and of Itself The closing advice I give to anyone creating documents and visuals that need to explain complex topics is the following: things might be complicated, but whether it’s confusing, that’s up to you. 1 William Kent, Data and Reality: A Timeless Perspective on Perceiving and Managing Information in Our Imprecise World, 3rd ed. (Westfield, NJ: Technics Publications, LLC, 2012). 2 Wikipedia, “Five-Second Rule,” https://oreil.ly/1Z397. 3 Garr Reynolds, "Slideuments and the Catch-22 for Conference Speakers,” Presentation Zen (blog), April 5, 2006, https://oreil.ly/yw45r. 4 Martin Fowler, “Infodeck,” MartinFowler.com, Nov. 16, 2012, https://oreil.ly/yvgTq. 5 Nancy Duarte, “PowerPoint Presentations vs. Slidedocs,” Duarte.com, https://oreil.ly/MjKny. 6 A pop quiz is a short test given by a teacher in class without prior announcement. It goes without saying that this is fairly unpopular with students. 7 Neal Ford, Matthew McCullough, and Nathaniel Schutta, Presentation Patterns: Techniques for Crafting Better Presentations (Boston: Addison-Wesley Professional, 2012). 8 Nancy Duarte, slide:ology: The Art and Science of Creating Great Presentations (Sebastopol, CA: O’Reilly Media, 2008). 9 Hohpe and Woolf, Enterprise Integration Patterns. 10 Barzun, Simple & Direct. Chapter 22. Diagram-Driven Design Cheating in a Picture Is Much Harder Than Cheating in Words Designing with Diagrams Some years ago, the Crested Butte Enterprise Architecture Summit once again proved that sticking a bunch of geeks in a remote town can lead to creative results. In our case, the result was an A-to-Z list of 26 new development strategies, starting from activity-driven development (ADD) and ending on zero-defect development (ZDD). Domain-driven design (DDD) was dedicated to Eric Evans’s fantastic book Domain-Driven Design.1 However, another “DDD” sprang to mind: diagram-driven design, and it turned out that there’s actually a serious idea behind the fun exercise. Presentation Skills: More Than a Wide Stance While working for Google in Japan, I created and taught a class on presentation skills for engineers, which included some common ideas of using strong, impactful visuals inspired by books like Presentation Zen.2 Following my own advice equipped me with high-resolution graphics of confident managers, fuel gauges indicating that your mileage may indeed vary, shoes that apparently do not fit all, and so on. Still, however impactful fancy graphics may be, for most technical presentations a wide stance, deep voice, and Steve Jobs–like hand gestures (turtleneck optional) are unlikely to teach the audience how a multicloud strategy increases your system architecture complexity. Instead, you need “meat”: what design alternatives did the team have? How do they differ? What design principles made you choose one over the other? What are the main building blocks of the systems and how do they interact (Chapter 23)? How did you track down that performance bottleneck and what did you learn from it? When Garr Reynolds, the author of Presentation Zen, came to Google to talk about his book, he acknowledged that technical discussions often require detailed diagrams or even snippets of source code. He suggested to provide those as a handout instead of including it in the presentation to make it easier for the audience to read and digest them. Still, most technical presentations I see do contain source code or diagrams to explain technical concepts in detail, so we’d better figure out how to do so effectively. Ed Tufte already ran bullet points through the grinder by blaming them for the inaction that led to the Space Shuttle Columbia disaster3 upon re-entry (and he might not be wrong judging from the slides they put together). “Death by PowerPoint” was immortalized by a Dilbert comic strip as early as 2000. You can’t fit a lot of source code on a slide either, especially if you are using a verbose language with checked exceptions. That leaves you with diagrams as your main communication vehicle for technical concepts. Diagramming as Design Technique Back at Crested Butte, we looked at our list and pondered whether some of our concoctions actually had meaning. Interestingly, as we discussed how to draw meaningful diagrams in a later session, I highlighted the importance of a consistent visual vocabulary, which would omit unnecessary details but highlight the essence of the design decisions (Chapter 8). During this discussion, we realized that to draw a good picture, you need to have a decent design in the first place. If reality is completely convoluted, it’s hard to depict order in retrospect. Taking this thought a step further, we realized that good diagramming contributes to good system design in general. Diagram-driven design had become a reality! When talking about diagram-driven design, I don’t imply that we’d generate code from UML diagrams. I am pretty firmly rooted in Martin Fowler’s UML as Sketch4 camp, meaning UML is a picture to aid human comprehension, not a programming language or specification. If people don’t quite agree, I refer to Grady Booch, who as co-creator of the UML remarked that “The UML was never intended to be a programming language.”5 Instead, I am talking about a picture that conveys important concepts—the proverbial big picture that does not get caught up in irrelevant details. Designing with Diagrams An excellent example of designing with diagrams is the book Enterprise Integration Patterns coauthored by Bobby Woolf and myself in 2003. The book defines a pattern language for designing asynchronous messaging solutions, which is represented both in text form and as a set of icons. The consistent visual style and the simple composition model of messaging solutions allows the visual language to become a design tool, as shown in the illustration from the book (Figure 22-1). Figure 22-1. Designing with Enterprise Integration Patterns The resulting diagrams aren’t just illustrations but also help validate the design. For example, the visual language reminds you that for every splitting or distribution of messages, you need to aggregate them back later. It also validates a logical grouping of the elements. One great historical example of diagram-driven design are the graphical train schedules created to plot trains’ paths along two axes by distance (vertical) and time (horizontal). The faster a train moves, the steeper the line will drop. The lines on such charts intersect where trains running in opposing directions pass each other (see Figure 22-2). On a single-track railroad, you’d want to make sure that these occur at a station where there are two tracks and platforms. Having the train schedules laid out visually is a great design aid. Well-known examples of these maps go back to Étienne-Jules Marey and his book La Méthode Graphique (1878). They are also featured prominently in Tufte’s The Visual Display of Quantitative Information,6 perhaps the standard text on charting and diagramming. Figure 22-2. Visually designing train schedules Diagram-Driven Design Techniques Once you embrace diagramming as a design technique, you’ll find several connections between good visual design and good system design, which we examine in the following sections. Establish a Visual Vocabulary and Viewpoints Good diagrams use a consistent visual language. A box means something (for example, a component, a class, a process), a solid line something else (maybe a build dependency, data flow, or an HTTP request), and a dashed line means something else yet. No, you don’t need a Meta-Object Facility and correctness-proven semantics, but you need to have an idea what element or relationship you are depicting how. Picking this visual vocabulary is important to define the architectural viewpoint you are going to concern yourself with, such as source code dependencies, runtime dependencies, call trees, or allocation of processes to machines. Good design is often tied to the ability to think in abstractions. Diagrams are visual abstractions and can be instrumental in this process. Limit the Levels of Abstraction One of the most frequent problems I encounter in technical documents is a wild mix of different levels of abstraction (the same problem can be found in source code). For example, the way configuration data affects a system’s behavior can be described like this: The system configuration is stored in an XML file, whose “timetravel” entry can be set to either true or false. The file is read from the local filesystem or alternatively from the network, but then you need NFS access or to have Samba installed. It uses a SAX parser to avoid building the whole DOM tree in memory. The “Config” class, which reads these settings, is a singleton because… In these few sentences you learn about the file format, project design decisions, implementation detail, performance optimizations, and more. It’s rather unlikely that a single reader is actually interested in this smörgåsbord of facts. Now try to draw a picture of this paragraph! It will be nearly impossible to get all of these concepts onto a single sheet of paper. Drawing a diagram thus forces us to clean up our thinking by considering one level of abstraction at a time. While drawing a picture doesn’t automagically make the problem of mixing abstractions disappear, it puts it in your face much more bluntly than a meandering chain of prose, which from afar might not look all that bad. A well-known German proverb proclaims that Papier ist geduldig (“paper is patient”), meaning paper is unlikely to object to what garbage you scribble on it. Diagrams are a little less patient. If you do compare architecture diagrams to modern art, you’ll want the Mondrian, not the Pollock. Reduce to the Essence Billboard-sized database schema posters, which include every single table, stick to a single level of abstraction but are still fairly useless because they try to convey reality without placing an emphasis (Chapter 21). When shrunken down to fit on a single presentation slide, they start to look like abstract art—something better placed in the museum than in architecture documentation. Therefore, omit unimportant detail to concentrate on what’s most relevant! The same is true for system design: it’s important to know “what kind of thing” your system is; for example, by defining a system metaphor (Chapter 24). Find Balance and Harmony Limiting the levels of abstraction and scope does not yet guarantee a useful diagram. Good diagrams lay out important entities such that they are logically grouped, relationships become naturally clear, and an overall balance and harmony emerges. If such a balance doesn’t emerge, it may just be that your system doesn’t have one. I once reviewed a relatively small module of code that consisted of a rather entangled mess of classes and relationships. When the developer and I tried to document this module, we just couldn’t come up with a half-decent way to sketch what was going on. After a lot of drawing and erasing we came up with a picture that vaguely resembled a data-processing pipeline. We subsequently refactored the entangled code to match this new system metaphor. It improved the structure and testability of the code significantly, thanks to diagram-driven design! A well-balanced diagram will show coupling, cohesion, and a high-level structure, concepts that equally help with good system design. Indicate Degrees of Uncertainty When looking at a piece of code, you can always figure out what was done, but it’s much harder to understand why it was done. It can be even more difficult to understand which decisions were made consciously and which ones simply happened. When creating diagrams, you have more tools at hand to express these nuances, so you should use them. For example, you can use a hand-drawn sketch to convey that your design is merely a basis for discussion. Once you have full agreement and want to convey that every detail is critical, you can use a visual style that resembles an engineering blueprint. Many books, including Eric Evans’s, use this technique effectively. That’s also the reason this book uses sketches: we are discussing architecture approaches and ways of thinking, not concrete tools and processes. When drawing, consider the precision versus accuracy dilemma: “next week it will be roughly 15.235 degrees” doesn’t make sense as it’s precise but inaccurate. Don’t make precise-looking slides if you know they aren’t accurate. Diagrams Are Art Diagrams can (and should) be beautiful—little works of art, even. I am a firm believer that system design has a close relationship to art and (nontechnical) design. Both visual and technical design start with a blank slate and virtually unlimited possibilities. Decisions are often influenced by multiple, usually conflicting, forces. Good design resolves these forces to create a functional solution, which attains a good balance and some degree of beauty. This may explain why many of my friends who are great (software) designers and architects have an artistic vein or at least interest. No Silver Bullet (Point) Not all diagrams are useful as a design technique. Drawing a messy picture won’t make your poor design any better. Beautiful marchitecture diagrams,7 which have little to do with the actual system being built, are also of limited value. In technical discussions, though, I have observed many occasions for which drawing a good diagram has greatly improved the conversation and the resulting design decisions. If you are unable to draw a good diagram (and it isn’t due to lack of skill), it might just be because your actual system structure is not what it should be. 1 Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software (Upper Saddle River, NJ: Addison-Wesley, 2003). 2 Garr Reynolds, Presentation Zen: Simple Ideas on Presentation Design and Delivery, 3rd ed. (New Riders, 2019). 3 Edward Tufte, “PowerPoint Does Rocket Science: and Better Techniques for Technical Reports,” EdwardTufte.com, https://oreil.ly/kDihX. 4 Martin Fowler, “UML as Sketch,” MartinFowler.com, https://oreil.ly/WLUgR. 5 Mark Collins-Cope, “Interview with Grady Booch,” Objective View Magazine, Issue 12, Sept. 12, 2014, https://oreil.ly/HGc5j. 6 Edward R. Tufte, The Visual Display of Quantitative Information (Cheshire, CT: Graphics Press, 2001). 7 Marchitecture denotes marketing pictures disguised as architecture. Chapter 23. Drawing the Line Architecture Without Lines Likely Isn’t One A functional architecture of a car The sketch above depicts the architecture of a car. All the important components are there, including their relationships: the engine is under the hood; passenger seats are appropriately located inside the passenger compartment, close to the steering wheel; wheels are assembled nicely at the bottom of the car in the chassis. This diagram appears to fulfill most definitions of architecture (except my favorite one because I am looking for decisions; see Chapter 8). However, it does precious little to help you understand how a car functions: could you omit the gas tank because it’s far away from the engine, anyway? Are engine and transmission side by side under the hood by coincidence or do they have a special relationship? Does the car need exactly four wheels or will three also do? If you had to build the car in stages, what subset would make sense to assemble first? Would just the cabin with the seats be a good start? How can you distinguish a good car from a bad one? Which aspects are common in virtually all cars (e.g., the wheels being at the bottom) and which ones vary (Porsche 911, VW Beetle, or DeLorean owners would be quick to point out that their engine isn’t under the hood)? The picture doesn’t really answer any of these questions. It depicts the location of the components, but it doesn’t convey their relationships or function in the overall system “car.” Even though the picture is factually correct and actually reasonably detailed, it doesn’t allow us to reason much about the system it is describing, especially its behavior. Coincidentally, it might also not be a good example of diagram-driven design (Chapter 22). Behold the Line! The critical element that’s missing in the picture are lines connecting the components. Without lines, it’s quite difficult to represent rich relationships. The line is so important that boxes, labels, and lines suffice to make up Kent Beck’s only half-joking Galactic Modeling Language.1 Without lines, there wouldn’t be much of a modeling language left. Also, as often stated, “the lines are more interesting than the boxes.” Where does stuff usually go wrong? In the integration between two well-tested pieces. Where do I need to look to achieve strong or loose coupling? Between the boxes. How do I tell a well-structured architecture from a Big Ball of Mud?2 By the lines. The importance of lines is most easily understood from a simple example, illustrated in Figure 23-1. Figure 23-1. Without lines, an architecture diagram is rather meaningless The system on the left and the system on the right are made from the same components, A, B, C, and D. Would the two systems have different properties and behaviors? The system on the left has a neat, layered architecture, which provides clear dependencies and makes it easy to replace a component with a different one. It can also suffer from long latency because messages or commands have to travel through each component in sequence. Also, each component can become a single point of failure: if C fails, the chain is broken, and the system is unable to function. The system on the right has almost the exact opposite properties: interdependencies are a bit messy, making it difficult to replace a component. However, the system provides shorter communication paths and is more resilient: if C fails, A can still talk to D. Now imagine that this diagram had no If I see an architecture lines. You would never know whether the diagram without lines, I am system is built like the one on the left or inclined to reject it because it the one on the right, resulting in a rather won’t convey the system’s meaningless architecture diagram. behavior. Therefore, if I see an architecture diagram without any connecting lines, I am skeptical as to whether it qualifies as a meaningful depiction of an architecture. Unfortunately, many diagrams fail this basic test. The Metamodel Stating that the diagram of the car doesn’t show any relationships isn’t quite true. The picture does contain two primary relationships between components: Containment One box is enclosed by another. Proximity Some boxes are close to one another, whereas others are farther apart. Containment corresponds to real-world semantics in this drawing: seats are actually contained inside the passenger cell, and the hood (the engine compartment to be more precise) houses engine and transmission. Engine and transmission are also next to each other, giving them proximity, which underlines them sharing a strong relationship: one makes little sense without the other. But the proximity semantics in this picture are relatively weak: the gas tank and spare tire are also next to each other, but for the function of the car this doesn’t have any meaning. The vague correspondence of proximity in the diagram to real-life proximity has no relationship to function and thus renders an odd mix of a logical and physical representation. I routinely challenge diagrams that limit relationships between components to containment. Such diagrams make it difficult to reason about the system, as seen in the car example we looked at earlier. Reasoning about the system is one of the main purposes of drawing an (architecture) diagram, so we need to do better. Diagrams that are based only on containment and proximity generally could have been just as easily represented as an indented bullet list: subbullets are contained by outer bullets and bullets next to each other are in proximity. In our example, you would end up with a list like this (showing only a portion to avoid death by bullet points): Hood Engine Transmission Passenger cell Speedometer Steering wheel Four seats In this case, the picture doesn’t say the proverbial 1,000 words. The list and the picture are just different projections of the same tree structure. And people say intentional programming is difficult!3 You might like the picture better than the list, but you must be aware that both representations have the same richness, or poorness, of expression. The picture adds the size and shape of the boxes, which aren’t represented in the textual list, but the semantics of size and shape in our example are unclear: all components are rectangles, but the wheels are circles. It’s a crude approximation of reality, but for reasoning about the system it doesn’t add much. The Semantics of Semantics When I was told for the first time that “UML sequence diagrams have weak semantics,” I was doubtful whether this rather academic statement had any relevance for me as a normal programmer. The short answer is: “yes, it does.” Prior to UML 2, sequence diagrams depicted only one possible sequence of interactions between objects, albeit allowing for concurrency. They couldn’t express the complete set of legal interaction sequences, such as loops (repeating interactions) or branches (either/or choices). Because loops and branches are some of the most fundamental control flow constructs, sequence diagrams’ weak semantics rendered them essentially useless as a specification. UML 2 improved the semantics but at the cost of much reduced readability. Why worry so much about the semantics of a diagram? The purpose of design diagrams or engineering drawings is to give viewers an understanding of the system, particularly the system behavior. A drawing is a model, so it’s by definition wrong (Chapter 6). However, it can be useful; for example, by allowing the viewer to reason about the system. The visual elements, such as boxes and lines, must neatly map to concepts in the abstract model so that the viewer can build the model in their head. For the viewer to grasp the meaning of the drawing, the visual elements need semantics: semantics is the study of meaning. Elements—Relationship—Behavior Without lines, it is impossible to ascertain a system’s behavior. It’s like listing the ingredients for a meal without the recipe. Whether something tasty comes out primarily depends on the way it’s prepared: potatoes can turn into French fries, gratin, boiled potatoes, mashed potatoes, baked potatoes, fried potatoes, hash browns, and more. A meaningful architecture diagram, therefore, needs to depict the relationships between components and provide semantics for these relationships. Electric circuit diagrams provide a canonical example of system behavior that depends heavily on connections between components. One of the most versatile elements in analog circuitry is the operational amplifier, or op-amp for short. Paired with a few resistors and a capacitor or two, this element can act as a comparator, amplifier, inverted amplifier, differentiator, filter, oscillator, wave generator, and much more. The system’s behavior, which varies widely, doesn’t depend on the list of elements, but solely on how they are connected. In the world of IT, a database can act as a cache, ledger, file storage, data store, content store, queue, configuration input, and much more. How the database is connected to its surrounding elements is fundamental, just like the op-amp. Architecture Diagrams If you feel that this is about as much as one could and should say about a contrived sketch of a car, rest assured that I get to see many architecture diagrams without any lines. These diagrams do depict proximity, simply because some boxes have to be next to each other, but whether any semantics are tied to this fact remains unclear. If you are lucky, proximity represents a form of “layering” from top to bottom, which in turn implies a dependency from things on “top” to things “further down.” In the worst case, proximity was defined by the order in which the author drew the boxes. So-called “capability diagrams” or “functional architectures” are particularly likely to be devoid of lines. These diagrams tend to list (pun intended) capabilities that are needed to perform a certain business function. For example, to manage customer relationships you need customer channels, campaign management, a reporting dashboard, etc. The set of capabilities forms a “laundry list” of things that are needed, but we aren’t closer to architecture than listing windows, doors, roof for a house. I, therefore, prefer such input to be represented as textual lists so that this distinction becomes clear. Wrapping text in boxes doesn’t constitute architecture. UML Speaking of lines, UML has a beautiful abundance of line styles: in a class diagram, classes (boxes) can be connected through association (a simple line), aggregation (with a hollow diamond on one end), composition (a solid diamond), or generalization (triangle). Navigability can be indicated by an open arrow, and a dependency by a dashed line. On top of this, multiplicities—for example, a truck having four to eight wheels but only one engine—can be added to the relationship lines. In fact, UML class diagrams allow so many kinds of relationships that Martin Fowler decided to split the discussion into two separate chapters inside his defining book UML Distilled.4 Interestingly, UML allows composition to be visually expressed through a line or as containment, that is, drawing one box inside the other. With such a rich visual vocabulary, why invent your own? The challenge with UML notation is that you can appreciate the nuances of the relationship semantics between classes only if you have in fact read UML Distilled or the UML specification. That’s why such diagrams aren’t as useful when addressing a broad audience: the visual translation of solid diamond versus hollow diamond or solid line versus dotted line isn’t immediately intuitive. This is where containment works well: a box inside another is easily understood without having to add a legend. Beware of Extremes As so often, the opposite of bad is also troublesome. I have seen diagrams in which elements have different shapes, sizes, colors, and border widths; connecting lines have solid arrows, open arrows, no arrows; are dotted, dashed, and of different color. These cases either result from sloppiness, in which case the visual variation has no meaning and is simply “noise,” or from a metamodel that’s so rich (or convoluted) that a diagram likely isn’t the right way to convey it. The rule I apply is that any visual variation in a diagram should have meaning—in other words, semantics. If it doesn’t, the variance should be eliminated to reduce visual noise, which only distracts the viewer and, worse yet, can cause the viewer to interpret this noise as semantics that were in fact never intended. Because you cannot look inside the viewer’s head, such misunderstandings or misinterpretations are difficult to detect. In short: making all boxes the same size won’t crimp your artistic talent, but it will make clear to the viewer that the model behind the diagram considers all boxes to have the same properties. It’ll also help draw attention to the lines. The standard text on charting and diagramming is Tufte’s The Visual Display of Quantitative Information5 plus his subsequent books. Although the books initially focus on display of numeric information, later volumes cover broader aspects, including many examples that package complex concepts into diagrams that remain crisp and easy to grasp. 1 “Galactic Modeling Language,” Wikiwikiweb, https://oreil.ly/XT4lF. 2 Neal Harrison, Brian Foote, and Hans Rohnert, Pattern Languages of Program Design 4 (Boston: Addison-Wesley, 1999). 3 “Intentional Programming,” Wikiwikiweb, https://oreil.ly/5bGf-. 4 Martin Fowler, UML Distilled: A Brief Guide to the Standard Object Modeling Language, 3rd ed. (Boston: Addison-Wesley Professional, 2003). 5 Edward R. Tufte, The Visual Display of Quantitative Information (Cheshire, CT: Graphics Press, 2001). Chapter 24. Sketching Bank Robbers Architects as Police Sketch Artists That’s what he looked like! With a demanding job like that of an architect in a large IT organization, it’s a healthy exercise to do more of those things you enjoy and fewer of those you don’t enjoy. Of course, this requires you to know what you truly enjoy (and truly despise) in the first place—a task that can be a little more challenging than it sounds, especially for left-brained IT architects. The latter is generally more easily answered: in my case it’s 8 a.m. meetings with no particular objective that end up in a monologue by the highest-paid person. The former usually takes a bit more reflection. Over the years, I have realized that one of my favorite work activities is to listen to system owners or solution architects describe their system, often in fragments, and to draw a cohesive picture for them. The most satisfying moment happens when they exclaim, “That’s exactly what it looks like,” without them having been able to draw the picture themselves. This exercise is also a great opportunity to learn about those system details that aren’t documented anywhere. Asking people to tell you about their system so that you can draw it for them may remind you of the old joke that describes consultants (Chapter 38) as those people who borrow your watch to tell you what time it is (and charge you a lot of money for it). Drawing expressive architecture diagrams, though, is a bit more involved than reading the time off a watch. It extracts people’s knowledge and presents it in a way that they weren’t able to create themselves. Being able to build a system doesn’t automatically mean the same person is gifted at representing it in an intuitive way. Therefore, helping such a person draw a picture of their system can be quite valuable. I liken this task to that of a police sketch artist. Everyone Saw the Perpetrator If a bank is robbed and you ask those people who saw the perpetrator to draw a picture, you’ll likely end up with stick figures or very rough sketches. In any case, you won’t get anything particularly useful even though the witnesses have a firsthand account of the person. Knowing something, being able to articulate it, and being able to draw it are three very different skills. That’s why a professional police sketch artist is usually brought in, especially in cases where security cameras could not get precise footage. The artist interviews the witnesses, asking them a series of questions that they can easily answer, such as “Was the person tall?” Based on the descriptions the artist draws the picture, frequently obtaining feedback from the witnesses. After initially giving trivial facts like “He was tall,” people end up confirming, “He looked just like that!” A Police Sketch Artist A police sketch artist is a fairly specialized job whose education includes both art and human anatomy. For example, a police sketch artist will undergo training in dental and bone structure because they influence the appearance of the suspect. The same is true for architecture artists: they need to have a minimum level of artistic skill, probably not quite at the level of the criminal sketch artist, but they must also have the mental model and visual vocabulary to express architectural concepts. Interestingly, sketch artists break down the problem and work with wellknown “patterns”: after initially asking very broad questions like “tell me about the person,” the artist will guide the witness with these typical patterns, for example, ethnicity, or defining features such as nose, eyes, or hair. To exaggerate, they won’t discover that the person had two ears, two eyes, and one nose (if they don’t, that’s certainly worth mentioning!), but they do drive toward discriminating and defining features, just like we do when we try to tell whether something is architecture (Chapter 8). In the world of IT, we would do the equivalent. For example, when looking at data storage, we’d ask if it’s an RDBMS or a NoSQL DB, perhaps a combination, whether it uses caching, replication, and so on. Sketching Architectures When assuming the role of an “architecture sketch artist,” I tend to combine two different approaches: The System Metaphor First, I look for noteworthy or defining features; for instance, for the key decisions (Chapter 8). Is it a pretty vanilla website for a customer to review information, like a customer information portal? Or, is it rather a new sales channel, or even a piece of a cross-channel strategy? Is it designed to handle tons of volume, or is it rather an experiment that will see little traffic but must evolve very quickly? Or, is it a spike to test out new technologies and the use case is secondary? After I have established this frame, I start filling in the details. I am a big fan of Kent Beck’s notion of a system metaphor that describes what kind of “thing” the system is. As Kent wisely states in Extreme Programming Explained:1 We need to emphasize the goal of architecture, which is to give everyone a coherent story within which to work, a story that can easily be shared by the business and technical folks. By asking for a metaphor we are likely to get an architecture that is easy to communicate and elaborate. In the same book, Kent also states that “Architecture is just as important in XP [Extreme Programming] projects as it is in any software project,” something to be kept in mind by folks who are tempted to shun architecture because they are Agile (Chapter 31). Just like with diagram-driven design (Chapter 22), architecture sketching can also be a useful design technique. If the picture makes no sense (and the architecture sketch artist is talented), something might be inconsistent or wrong in the architecture. Viewpoints As soon as I have a rough idea about the nature of the system, I let the metaphor drive which aspects or viewpoints to examine. This is where doing an architecture sketch differs from performing an architecture analysis. An analysis typically walks through a fixed, structured set of aspects, as defined for example by methods such as C4 or arc42. This is useful as a “checklist” to uncover missing aspects or gaps. In contrast, a police sketch artist doesn’t want to draw the details of a person’s trouser finishings (hemmed? cuffed?), but wants to highlight those characteristics that are unique or noteworthy. The same is true for the architecture sketch artist. Following a fixed set of viewpoints always runs the risk of becoming a paint-by-numbers exercise in which one fills in every section of a template, but forgets to place an emphasis (Chapter 21) or omits critical points in the process. I therefore find the viewpoint descriptions in Nick Rozanski and Eoin Woods’s Software Systems Architecture2 useful because they don’t prescribe a fixed notation, but highlight concerns and pitfalls. Nick and Eoin also separate perspectives from views. When sketching an architecture, you are most likely interested in a specific perspective, such as performance and security, that spans multiple viewpoints; for example, a deployment or functional view. Visuals Each artist has their own style, and to some degree architecture sketches will also differ. I am not a big fan of molding all system documentation into a single notation because we are not creating a system specification (that’s in the code), but a sketch that gives humans a better vehicle to reason about the system. For me, it’s important that every visual feature of the notation has meaning in the context, or perspective, that we are analyzing. Otherwise, it’s just noise. Of course, the diagram must not only show the components but also their relationships (Chapter 23). The best diagrams are rich in expressiveness but don’t require a legend because the notation is intuitive from the start, or because the viewer can learn the notation from simple examples and apply what they learned to more complex aspects of the diagram. This is very much how user interfaces work: no user wants to read a long manual, but they will use what they see to build a mental model and use it to set expectations for how more complex features should work. Why not think of a diagram as a user interface? You might feel that it lacks interactivity, and you are right, but viewers navigate complex diagrams very much like users navigate user interfaces. Architecture Therapy Grady Booch drew analogies between having teams depicting their architecture and family therapy,3 which asks children to draw a picture of their family in a method referred to as Kinetic Family Drawings (KFD). The drawings give therapists insight into the family dynamics, such as proximity, hierarchy, or behavioral patterns. I have experienced the same with development teams, so you shouldn’t outright discard their drawings as meaningless or incomplete, but derive insight into the team’s thinking and hierarchy from them: is the database in the middle of it all? Maybe the schema designer is calling the shots in the team (I know a case of that happening). Are there many boxes, but no lines? Probably the team’s thinking is focused on structural concerns but ignores system behavior. This is often the case when the architect is too far removed from code and operational aspects. That’s Wrong! Do It Again! A common situation when sketching an architecture for someone else is them stating, “This is wrong!” This is a good thing; it means that you discovered a mismatch between your and their understanding. If you hadn’t drawn it, you would have never realized. Also, if you assume you are a reasonable proxy for subsequent consumers of the diagram, you also saved them from the same misunderstanding. Therefore, sketching out architecture is almost always an iterative process. Bring an eraser. 1 Kent Beck, Extreme Programming Explained: Embrace Change (Boston: Addison-Wesley, 1999). 2 Nick Rozanski and Eoin Woods, Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives, 2nd ed. (Upper Saddle River, NJ: Addison-Wesley, 2011). 3 Grady Booch, “Draw Me a Picture,” IEEE Software 28, no. 1 (Jan./Feb. 2011). Chapter 25. Software Is Collaboration Got Git? Hello Peter, what’s happening? Much has (rightly) been said and written about the differences between IT architecture and classic building architecture, which we often refer to in our metaphors. For example, although buildings do evolve over time (just very slowly1), achieving high rates of change at low cost is something that brick- and-mortar objects can’t do. But many things can, and it’s not at all limited to software development. Who Says Software Is for Computers Only? Enterprises spend significant effort creating, revising, and sharing documents, be they strategic plans, schedules, design documents, or status reports (Chapter 30). Typically, these documents need input from multiple parties and undergo iterations and quality checks until they are released. Such artifacts are really a form of software—they surely aren’t hardware, even though they may be printed on physical paper on occasion (witnessing someone print 25 copies of a large slide deck on digital transformation will forever be burned into my mind). So, if documents are in fact software, if we want to optimize and accelerate our collaboration and communication, we might be able to learn a bit by looking at how software delivery teams, especially widely distributed open source teams, work. Version Control The one tool you won’t be able to pry out of any developer’s dead cold hands is version control (Chapter 14). Version control is the safety net that gives developers the confidence to move fast because they have the assurance that they can revert quickly in case they take a wrong turn. One of the most popular version control tools these days is Git. The model behind this software takes some getting used to, but once you adapt your flow, you’ll never want to go back to anything else. I wrote the precursor to this book in Markdown,2 a simple text format. I used Git for version control and Dropbox for file synchronization with the publishing engine. After the book was published, I kept ideas for additional chapters (like this one) in a backlog. Without thinking too much about it, I reverted to writing the backlog in Microsoft Word as these chapters weren’t done and weren’t going to be published soon. I instantly noticed that my rate of progress slowed down: should I remove or rewrite this paragraph? What if I change my mind later and want to keep it? I better make a copy and “park” it somewhere for later. By the way, where did I keep the latest version? Should I use the Track Changes feature instead? When working with text files and version control, I would not have spent a second on any of this because I’d be assured that I could revert to a prior version at any time. I’d also be able to see all the changes I made over time, so I could track progress easily. Of course, I could have checked my Word documents into Git or a document management system like Microsoft SharePoint. However, two main factors would be missing: first, version comparison between Word files is much more laborious than on simple text files. Word’s review mode tracks history but is much more geared toward minor revision changes as opposed to iterative creation of a document. More important, the build tool chain to produce the book works with Markdown files, so I would not enjoy the benefits of Continuous Integration, meaning I can create a preview copy of the book any time I make a change. Anyone who has looked at a corporate file server notices that I am not the only one who appreciates version control. You’ll find 20-some copies of the same document, with the filename either suffixed with a version number, prefixed with the date (for easy sorting), tagged with the last author’s initials to indicate branching, etc. Someone had the right idea but stumbled in the implementation. Single Source of Truth Version control is a powerful tool, if all team members look at the same version. Emailing documents around that are kept on local drives means that each person has their own source of truth, which is going to lead to friction in the best case and lost information in the worst. Therefore, version control must be coordinated among team members. The most transformative change in collaboration patterns I have witnessed was the advent of Google Docs (then called “Writely”) around 2006, and it wasn’t due to my seven years of drinking Google Kool-Aid. Google Docs popularized a browser-based document editing model that allows multiple users to simultaneously edit the same document. Interestingly, when Google Docs first became available internally at Google for dogfooding (Chapter 37), its feature maturity resembled that of Microsoft Word 5.0 from 1989. Getting two bullet points to be the same size was already a challenge. Still, being able to collaborate in real time on a shared document fundamentally changed the way people worked together. No time was wasted on maintaining, mailing, finding, or merging multiple versions of documents. Almost all “my version versus your version” discussions went away as it was clear that the team worked toward a single shared outcome. Adding collaborators became easy and natural. Having had to go back to sharing Word and PowerPoint documents by email has been a rather frustrating experience. Trunk-Based Development Most version control tools allow branching. Branches are separate versions of the codebase, often used to develop a special feature that’s not yet ready to be released. The major advantage of branching is that a person working in a branch can make many changes without having to worry about what else is going on. Alas, that freedom is usually short lived. Sooner or later, the branch must be “merged” back into the authoritative version, also called the trunk, following the analogy of a version “tree.” Unfortunately, while a person was working in the branch, time wasn’t actually standing still: many other changes occurred to the documents or source code. As a result, merging becomes a rather unpleasant and often wasteful exercise: perhaps someone copyedited the paragraph that you just rewrote. That’s wasted effort! Also, while you are working in your branch, no one else can benefit from what you have done. If branches remind you of locally stored document versions, you are onto something. A version control system where each person works in their own branch doesn’t really help much in terms of collaboration. As a result, many folks advocate trunk-based development,3 an approach that mandates all changes going into a single authoritative version of the codebase or document. Naturally, doing so avoids any drift between different authors’ versions. However, how can you put unfinished work into the main version of a document? There are quite a few options: The most obvious but also most underused solution is to break down big changes into a series of smaller tasks (Chapter 30). Software teams use feature toggles to enable or disable a feature, allowing code to be integrated into the system but not yet available to users. The equivalent for presentations are hidden slides: you can happily work on those knowing that they won’t be shown to the audience. Making very short branches that last only a single day is also OK and won’t break the trunk-based model. This way you can iterate and tinker and merge before you leave work. Having code in the trunk doesn’t mean it’s instantly released to production. Many teams use separate release branches, which undergo additional review and testing. The equivalent for documents and presentations would be to cut a PDF at a known-good-state for subsequent distribution. Always Be Ready to Ship When collaborating on a slide deck, usually multiple authors contribute parts, which are then reviewed over the course of multiple iterations until it’s considered good enough and meets the corporate style guidelines. The key question is when is something good “good enough”? To me, the most important elements of a presentation are the key messages and the storyline they are woven into (Chapter 20). Interestingly, both can be done well before painstakingly aligning all the boxes and converting graphics to the corporate color palette. A presentation with a solid storyline and simple graphics is also far more impactful than a half-finished one with fancy stock photographs, so we should work on those aspects first. When working on slides, we can learn from modern software development techniques such as Agile development and DevOps, which aim to always have software that could be released if need be. I tend to ask my teams: “what if we had to present in one hour?” Do we have a core storyline and some essential slides to support it? When you are at that point, you can refine and improve slides with much less stress. Always being ready to ship highlights the difference between working iteratively and incrementally.4 Many people make slides incrementally and have only half a slide deck after half the time elapsed. They are not ready to present. Following the DevOps mindset nudges you to work iteratively, meaning you have a rough version of the whole story that you can share immediately if needed (see Figure 25-1). Figure 25-1. Building presentations incrementally versus iteratively Style Versus Substance Some folks might counter that even if a storyline is decent, presenting it in rough packaging means it “isn’t ready” or even “not professional.” I am a big fan of good design and spend a fair amount of time giving my stage presentations a clean and professional appearance. However, if I have to choose between a solid message and pretty pictures, I’d have to choose the message because I am an architect and not an artist—analogous to the Agile Manifesto’s preference of running software over documents. Documentation is important, but if you can have only one or the other, you’d want to pick running software. You’ll encounter the same argument when working in formats like Markdown or simple collaboration tools. Because such systems aren’t as full featured as desktop-publishing or word-processing tools, teams that are used to focusing on visuals over content often dismiss them as not meeting their needs, failing to realize that this is exactly what they need. Transparency On many software projects, you can see a monitor or a glowing orb that shows the project’s current build status. Just by walking by, you can see how many builds have been made, how many are green (free of errors), and how many are red. Such a project is fully transparent, which builds trust outside the project and motivation within. The same level of transparency can be applied to any project; for example, showing how many servers were migrated out of an old datacenter over time or how many systems have become compliant with the IT security guidelines. On a prior team we had a glowing LED display that showed the total number of pushes to our source code repository. Not only was it a great conversation piece, it also led to a minor celebration when four digits weren’t sufficient anymore. In an enterprise context, you are likely to encounter two major hurdles against such transparency. First, project managers prefer to “massage” their message carefully in status meetings instead of sharing it widely. Second, many teams don’t have the relevant data ready at hand. Whereas the former is annoying, the latter is worse: how do they steer the project if they don’t have the vital metrics at their fingertips? Pairing The most debated practice of modern software delivery is pair programming. However, when producing slides or documents, you’ll find a joint working session to be much more productive than emailing redlined documents and comments back and forth. I have seen slide review cycles that oscillate between review meetings, assigning tasks, people making changes (often misunderstanding what was discussed), and reconvening for weeks and months. If everybody sat in a room and developed the slides together, they could have been done in a few hours. “Pairing” on slide decks—I call it “pair PowerPointing”—can save lengthy review and edit cycles and generally leads to better results. Resistance Of course, there’s always resistance. Star Wars couldn’t possibly have had nine episodes’ worth of storyline without The Resistance. Besides purely politically motivated arguments against transparency, you may find that people consider working in text formats like Markdown to be “too technical.” I was once alerted by a large company’s digital innovation branch that “Markdown is too technical.” My initial reaction was to ask them whether it’s the hash mark or the star that tripped them up… More serious, you’ll find that working with a version control system like Git is not to everyone’s liking and carries a learning curve. During my early days using Git I missed staging a new file. When I checked out an older branch, the file was still in my working directory (it wasn’t under Git’s control), so I concluded that I should delete it. When reverting to the original branch, I was shocked to find my file didn’t come back. Thank God for hard drive backups. When asking people to embrace version control, it’s important to teach the concept of a version control system first; that is, a commit, a branch, etc., in the context of real work scenarios. It then becomes easier to get used to Git’s occasionally quirky model. When they’re past this hurdle, though, people will consider working without version control like driving without a seatbelt. 1 Stewart Brand, How Buildings Learn: What Happens After They’re Built (New York: Penguin Books, 1995). 2 A simple text-based language, originally intended to author web pages without having to learn HTML. 3 Paul Hammant et al., “Trunk Based Development,” https://trunkbaseddevelopment.com. 4 Jeff Patton, “Don’t Know What I Want, But I Know How to Get It,” Jeff Patton and Associates website, https://oreil.ly/biPNX. Part IV. Organizations Architects in the enterprise live at the intersection of the technical and business worlds. In fact, getting these two pieces to work together seamlessly is one of an architect’s key contributions (Chapter 4). Therefore, a good architect needs to not only understand the interplay between system components, but also the interplay in a large and dynamic system that is known as organization. Organizational Architecture: The Static View The most common depiction of an organization’s architecture is the organizational chart (“org chart”). These charts depict who reports to whom, and one can measure people’s importance by how far they are from the CEO. Assuming you count from zero in good computer-science tradition, I am often at level two or three below a group CEO, a divisional CEO, and perhaps a COO in between. For an architect in a large organization, this isn’t bad at all—many people find themselves at level 6 or 7. Luckily, org charts have lines and thus pass our test for architecture diagrams (Chapter 23). Computer-science-educated folks may recognize an org chart as a tree, a noncyclical, connected directed graph with a single root (math folks consider trees to be undirected, but that’s fine also). Alas, it’s only showing part of the picture: depicting the static structure tells us little about how people interact to make the business work. Organizational Architecture: The Dynamic View An org chart depicts engineering, manufacturing, marketing, and finance departments as separate pillars of the organizational pyramid. However, in reality, engineering must design a product that can be easily and reliably manufactured, marketed to customers, and sold at a profit. How well organizations work is rarely defined by the organization’s structure—most organizations will have the aforementioned functions—but by how they interact: how slow or fast are their development cycles; do they work in a Waterfall or an Agile model; who talks to customers, who, interestingly, aren’t depicted in the org chart? Coworkers also routinely talk to one another to solve problems without following the lines in the organizational pyramid. This is a good thing because otherwise managers would quickly become communication bottlenecks. In many cases the org chart shows the control flow of the organization—for example, to give budget approvals, whereas the data flow is much more open and dynamic. Ironically, the way people actually work with one another is rarely depicted in a diagram. Part of the reason might be that this data is difficult to gather; the other part could be that it doesn’t look nearly as neat as the org chart pyramid. When people coordinate and communicate electronically, the actual, dynamic organizational structure can be more easily observed. For example, if developers collaborate via a version-control system, we can analyze code reviews or check-in approvals to see the real collaboration taking place. Google had another interesting system that allowed you to see which persons are sitting nearby a given person. Because interaction and collaboration are often still based on ad hoc conversations, physical proximity can be a better predictor of collaboration patterns than the org chart structure. The Matrix (Not the Movie) In large organizations, people can have multiple reporting lines: a “dotted line” to their project or program manager, and a “solid line” to their department or “line manager.” Such an arrangement is often part of a socalled matrix organization in which people report horizontally to the project and vertically to their manager. Or is it the other way around? If you find this a little confusing, you’re not alone. High-performance delivery organizations generally shun such arrangements, making sure people are fully assigned to, and responsible for, a single project. I often jest that I want all people working on a project to be on the same boat without life vests and no rescue lines to other parts of the organization. A team needs a shared success or, if it so happens, shared failure. Don’t worry, they are all able to swim. Organizations as Systems As architects, we know well how to design systems; for example, when to apply horizontal scaling, loose coupling, caching. We often are also trained in systems thinking (Chapter 9), which teaches us how to reason about the relationship between elements in a system and the overall system behavior, driven, for example, by positive or negative feedback loops. However, we often hesitate to apply such rational thinking to organizations because organizations have a very human face, which makes us feel bad if we degrade our nice and not-so-nice coworkers into the boxes and lines (Chapter 23) of some system architecture. However, even though they’re composed of individuals, large organizations behave much more like complex systems, including technical ones. Therefore, as architects we can apply our architectural mindset and rational systems thinking to large organizations in order to understand and influence them. It’s a bit like a reverse engineering, debugging, and refactoring exercise. Organizations as People All rational reasoning aside, organizations are made up of individuals. We also shouldn’t forget that for many of them work is just a small part of their lives: they have families to take care of, bills to pay, doctors to visit, home repairs to make, or hangovers from the party last night to overcome. Understanding organizations depends on understanding people’s emotions and motivations. This can be a stretch for left-brain-type architects, but one they need to make. Consider this yoga for your brain. Navigating Large Organizations Dealing with organizations can be challenging for architects. However, many concepts that are well known in the context of architect systems can also be applied to understanding organizations: Chapter 26, Reverse-Engineering Organizations To bring lasting change, you need to help organizations unlearn existing beliefs. Chapter 27, Control Is an Illusion Command-and-control structures aren’t a one-way street. Chapter 28, They Don’t Build ’Em Quite Like That Anymore Pyramids went out of vogue 4,500 years ago, but are still widely used in IT. Chapter 29, Black Markets Are Not Efficient High-friction organizations breed black markets, which are dangerous. Chapter 30, Scaling an Organization Experience in distributed systems design can be applied to organizations. Chapter 31, Slow Chaos Is Not Order Slow-moving things can seem well coordinated when in reality they’re just slow-motion chaos. Chapter 32, Governance Through Inception Governance by decree is difficult and better done by planting ideas. Chapter 26. ReverseEngineering Organizations Learning Is Hard; Unlearning Is Much Harder Attaching some probes to the organization To change a system’s observed behavior, you need to change the system itself (Chapter 10). For organizational systems, the systemic behavior is primarily guided by its culture. A significant portion of this culture derives from shared beliefs held by the organization’s members. So, to permanently change an organization’s observed behavior, you need to identify and change those beliefs. Unfortunately, these shared beliefs aren’t written down anywhere; there aren’t any motivational posters for shared beliefs. Also, most people won’t even be aware that they carry them. So, you’ll need to apply one of your well-honed engineering skills: reverse engineering. Dissecting IT Slogans A good starting point for reverse-engineering an organization’s hidden beliefs are popular slogans. Anyone who has worked in IT for a little bit surely has heard the saying “never touch a running system” (Chapter 12). Why would people not want to touch a system that’s running? Apparently because they believe that change is risky: if you touch it, you might break it. Deeper down, they may also believe that fixing broken things is cumbersome, so it’s better not to break them in the first place. The well-known IT slogan “never touch a running system” reflects the underlying belief that change is risky. And, worse yet, it also assumes that not changing anything bears no risk. Importantly, though, there’s an additional assumption behind this simple slogan: if you don’t touch the system, all will be fine. This belief—that no change implies no risk, is worrisome. First, from an operational perspective, systems that aren’t maintained will rot and, for example, use outdated libraries and operating systems that pose security risks. Also, in the digital world, which is constantly evolving, standstill is regress: competitors move ahead with frequent updates and rapid feature evolution. Ultimately, not changing can be fatal for organizations—consider Kodak, Blockbuster, or BlackBerry. Second, you’ll notice how simple slogans can become self-fulfilling prophecies. When you avoid changing a system for a long time this actually does increase the risk of change: important details will have been long forgotten, and undocumented manual steps increase the odds that something will go wrong. Such experiences confirm and fuel the belief. Unknown Beliefs Not all organizational beliefs manifest in slogans, though. In most cases, people might not even be aware that they carry certain beliefs until their assumptions are being challenged. I had that very experience at a Munich beer festival. The well-known Munich Oktoberfest has a springtime cousin, the Starkbier Fest (“strong beer festival”). As the name suggests, this festival serves beer with an alcohol content about 50% higher than the Oktoberfest, in the same 1 liter jugs. Needless to say, “having a beer or two” can make the way home somewhat challenging. The more surprised I was when my younger colleague commented that he drove to the festival in a convertible to take advantage of the sunny weather. My immediate reaction was: “Are you out of your mind to drive to a beer fest?” His calm answer was, “No, I leave the car here.” You can’t just ask people what Not only did I feel old, I also realized that I had been carrying a fundamental belief their beliefs are because most about cars: if you drive somewhere, you are unaware of them. (hopefully) come back with the same car; otherwise, it would be difficult to go somewhere else tomorrow. What broke this assumption, which was useful in the past? Car sharing—the ability to pick up a car near your home, to rent it by the minute, and to leave it at your destination. Without ubiquitous smart phones, GPS, telematics, and other good stuff, my assumption was handy and never challenged; now it limited my thinking. Because most people are unaware of the assumptions they hold, you can’t just ask them what their beliefs are. If you had asked me about my beliefs about cars, I might have said that you must have liability insurance and fuel in the tank. Beliefs Are Proven Until Disproven Beliefs stick because people often have living proof or firsthand experience. When the environment changes, though, and the belief is no longer applicable, their past experience makes them reluctant to change. Think of kids who learned that touching a stove top is a bad idea. Some of them learned this the very hard and painful way, others were told many times. They therefore embraced a useful belief that prevents accidents. Alas, the invention of the induction cooktop makes this belief obsolete: induction cookers heat pots directly through electromagnetic fields, leaving the cooktop surface relatively cool. You will find it difficult, though, to make children touch this new cooktop, because they learned their lesson well. The best way might be to touch the cooktop yourself to demonstrate the change. Because most people have living proof for their beliefs, just telling them otherwise is unlikely to succeed. The same is true in IT: most IT folks will be able to tell you a story in which someone did touch a running system, causing it to fail, and needing operations teams to stay up 48 hours straight trying to get it back up and running. Simply telling people that they are wrong or that magically everything is different is unlikely to be successful. Trying to convince them that change isn’t risky by spewing out buzzwords and acronyms like TDD, IaC, Git, and Spinnaker is like explaining to kids that an induction stove is safe to touch by citing Faraday’s law. Instead, you might start by explaining the tenets of DevOps, such as deployment automation, version control, and automated testing. Also, making smaller changes reduces the risk of change. Better yet, you demonstrate the effect with a real software delivery project. Unlearning Old Habits Bringing change into an organization can easily stumble over existing, strongly held beliefs that are part of the existing culture. For example, selling Continuous Delivery to a person who equates change with risk is going to be quite a challenge. We therefore need to help the organization unlearn these old habits before lasting change can take hold. Learning new things isn’t easy, but unlearning old habits, especially ones that served us well many times, is much more difficult.1 It seems that replacing existing beliefs requires you to free up a memory slot in the brain first before you can program it again. Common IT Beliefs When trying to reverse-engineer existing beliefs, there’s one ray of hope: many IT organizations hold similar beliefs because they were subject to the same learning, or priming (Chapter 6). Hence, the following list of beliefs can give you a head start: Speed and Quality Are Opposed (“Quick and Dirty”) The so-called project management triangle is both one of the most popular and most dangerous tools in IT management because it purports that scope, time, and resources have a simple relationship. For example, twice as many people would be able to accomplish the same work in half the time. Worse yet, it assumes that by compromising quality, things can be sped up further. While the triangle might work for simple, physical tasks, it surely doesn’t work in software delivery where the opposite is often the case. For example, if a developer wanted to secretly sabotage a software project, a suitable way would be to introduce subtle, hard-to-find bugs. How then, can lower quality speed things up? Modern developers know that in software development the opposite effect takes place: we speed things up by automating them, which also increases quality (Chapter 40) and repeatability. Quality Can Be Added Later Classic software projects end with a “QA” phase, the quality assurance phase, which consists of a team of testers checking whether the deliverable is of high quality. Behind this common approach rests a fundamental belief that quality is something that can be added to an existing work product: if something of poor quality goes into QA, it comes out with higher quality. Detecting bugs and reworking can improve some aspects of software, but it cannot correct fundamental deficiencies in the internal quality of a software system, such as its structure or testability. Those must be built in from the beginning. Methods such as shift-left testing2 follow this approach. This belief and the previous one are related. If you work under the assumption that quality is added at the end, compressing the schedule will most likely decimate the (manual) QA activity, actually resulting in lower quality. All Problems Can Be Solved with More People or Money Also guided by the scope-resources-time triangle, some organizations assume that more people reduce time under constant scope. There’s a common saying that when a project manager in a typical bank tells the business that it’s impossible to deliver the project in three months, the business responds by asking “how much more money do you need?” Fred Brooks already documented four decades ago3 that adding people requires onboarding and increases communication overhead, both of which will slow down a project. Also, large projects often come to a grinding halt due to excessive complexity. Adding more resources is likely to increase complexity, exacerbating the problem. Therefore, if you want to speed up a project, look to reduce friction instead of adding more resources. If your car’s (or your organization’s) handbrake is set, you’ll want to release the brake, not step harder on the gas pedal. Following a Proven Process Leads to Proven Good Results Much of an organization’s way of working is encoded in well-defined processes, which aim to reduce risk, control spending, and assure high- quality deliverables. Many large organizations even have entire departments whose job it is to define and update processes. Even though most processes, such as approvals or budget reviews, are well intended, following them assures only one thing: that the process was followed. There’s a big leap from someone checking a box or completing a task to actually achieving the desired result, such as reducing spend or assuring architecture compliance. Especially if processes are cumbersome, people will be inclined to just obtain the necessary process check marks without fulfilling the intention of the process, possibly leading to a flourishing black market (Chapter 29). Some organizations therefore police and audit for process compliance by investigating projects and their implementations, resulting in a catch me if you can kind of game. Attempting to achieve desirable results via processes and check lists typically stems from a lack of transparency. If you can’t see what a project does or what kind of code it develops, the next best thing you could do is to make sure a certain process was followed. Modern development and deployment practices such as central code repositories, automated code quality checks, automated policy checkers, and cloud runtimes provide a much higher level of transparency and allow much more effective compliance checks. Late Changes Are Expensive or Impossible Did you ever wonder why a typical IT project has an endless list of requirements that appears to anticipate any possible use case or scenario for the next five years? It’s based on a simple belief: the business has learned that late changes are expensive or even impossible. An old joke goes that IT tends to deliver only half of what it promised, so the business asks for twice as much in the hope that they get the right half. This belief stems from IT service providers’ common practice of charging large sums for late changes. Because they compete aggressively on the initial bid for a project, they compensate by charging astronomical sums for changes during project execution when the customer doesn’t have a lot of alternatives. Even internal projects that are subject to budget approvals or are poorly architected might have reconfirmed this belief because they are difficult to change later. Welcoming late changes is a key tenet of Agile development, dispelling this common belief. It’s best illustrated by Mary Poppendieck’s observation that “a late change in requirements is a competitive advantage.” Agility Opposes Discipline Because Agile development welcomes change during a project, it is often seen as being at odds with stable processes. After all, change and stability are opposites of each other. Following this logic, some organizations even believe that in the absence of rigid steering and control mechanisms, things descend into utter chaos. However, the opposite is true: Agile development is actually a very disciplined process because speed and lack of discipline don’t mix (Chapter 31). Agile methods prioritize on delivering value early and maintain velocity through a rigid adherence to regular (re)planning and tracking of progress and quality, something often missing in traditional projects. The Unexpected Is Undesired After spending a lot of time creating a plan, traditional organizations expect things to go according to their plan. Deviations from the plan or unexpected outcomes are undesired and considered a failure. However, when something unexpected happens, the most learning happens. That’s because the unexpected can tell you that you made a poor assumption or that an error was hidden in the system. Therefore, successful businesses run experiments to verify or falsify a hypothesis. Either way, the outcome implies learning and isn’t a failure at all. This means that traditional enterprises learn less, which is dangerous in a world that’s constantly changing. Rather than avoid deviations, enterprise should identify valuable hypothesis that they can test quickly and cheaply. Therefore, minimizing the cost of experimentation is a better goal than minimizing deviation. Reprogramming the Organization Given how strongly existing beliefs influence an organization’s ability to transform, how do you best go about identifying and changing them? There is no magic three-step recipe, but you can combine several behaviors to tackle them: Observe carefully You can’t just ask people about their beliefs, because most of the time they aren’t even aware of them. Instead, observe how people behave and look for unusual or unexpected decisions. Then, consider which belief would make such a decision appear rational. Ask questions Keep asking people (Chapter 7) why they would choose a particular option to uncover what drives their behavior. Explain carefully Acknowledge the usefulness of their belief in the past, but explain what has changed since then. Define new beliefs Because it’s difficult for people to unlearn things, establish clear new beliefs that can replace the old ones. Be patient Change takes time (Part V). Your goal isn’t to turn everyone’s belief upside-down but to identify and dislodge those beliefs that impede the change you are trying to bring. Reverting too many beliefs will make folks insecure and confused. Handed-Down Beliefs Whereas most beliefs stem from actual experience, others are handed down through generations. A classic (unconfirmed) story of beliefs taking a life of their own involves monkeys, water, and bananas: several primates were in a cage with bananas hung in the middle. As soon as any monkey would reach for the tempting bananas, all monkeys would be sprayed with cold water, something they weren’t particularly fond of. Every now and then, a monkey was replaced with a new one. If the new arrival would reach for the bananas, the others would be quick to hold them back because they knew what was going to happen. Even after all the original monkeys have been replaced and none of the resident monkeys have ever been sprayed with water, the best practice of “don’t touch the bananas” lives on. This story isn’t based on scientific evidence, but sometimes it does feel that despite all the technical advances we have made, our basic behavior hasn’t evolved that much from reaching for bananas. 1 Barry O’Reilly, Unlearn: Let Go of Past Success to Achieve Extraordinary Results (New York: McGraw-Hill, 2018). 2 Wikipedia, “Shift-Left Testing,” https://oreil.ly/iotex. 3 Fredrick P. Brooks, The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition (Boston: Addison-Wesley Professional, 1995). Chapter 27. Control Is an Illusion It’s When You’re Told Exactly What You Want to Hear Who’s in control here? While working in Asia, I’ve become accustomed to sharing a few of my personal details before presenting to a group of people. I like the idea because it didn’t have the flavor of bragging about professional accomplishments; rather, it gives the audience an impression about the speaker’s background to better understand what shaped their thinking. In a presentation to a group of CEEMA (the Central-Eastern Europe, MiddleEast, and Africa region) COOs and CIOs, I once opened with a slide summarizing my core beliefs in the form of the pin buttons that many people used to wear in the 1980s. The one slogan that received immediate attention was “Control is an illusion.” Even more attention was paid to my explanation: “You feel that you have control when people tell you exactly what you want to hear.” Perhaps this wasn’t the kind of control these senior executives wanted to have over their business. The Illusion How can control be an illusion? “Having control” is based on the assumption that a direction set from top down is actually being followed and has the desired effect. And this can be a big illusion. How would you know that it does, if you are simply sitting at the top, pushing (control) buttons instead of working alongside the staff? You can rely on management status reports, but then you make a major assumption that the presented information reflects reality. This might be yet another big illusion. Steven Denning uses the term semblance of control1 in contrast to actual control for this phenomenon in large organizations. A more cynical version would be to claim that the inmates are running the asylum. In either case, it’s not the state you want your organization to be in. Control Circuits A brief look at control theory sheds some light on where the illusion originates. Control circuits, such as a room thermostat, keep a system in a stable condition—in this example, keeping a room at a constant temperature. They do so based on sensors and feedback: the thermostat senses the room temperature and turns the furnace on when the room is cold. When the desired temperature is reached, it turns off the furnace. The feedback loop compensates for external factors such as the outside temperature or someone opening the window. This runs counter to many project planning approaches that attempt to predict all factors up front and then look to execute according to plan. It’s like running the heater for exactly two hours and then blaming the cold weather for the room not being warm enough. Embarrassingly, a cheap thermostat can give us better control than some project managers. A Two-Way Street Jeff Sussna describes the importance of feedback loops in his book Designing Delivery,2 drawing on the notion of cybernetics. While most people think of cyborgs and terminators when they hear the term, cybernetics is actually a field of study that deals with “control and communication in the animal and the machine.” Such control and communication is almost always based on a closed signaling loop. When we portray large organizations as “command-and-control” structures, we often focus only on the top-down steering part, and less on the feedback from the “sensors.” But not using the sensors means one is flying blind, possibly with a feeling of control, but one that’s disconnected from reality. It’s like driving a car at night without headlights and turning the steering wheel with no clue where the car is actually headed—a very dumb idea. It’s shocking to see how such behavior, which borders on absurdity, can establish itself in large organizations. Problems on the Way Up Even if an organization uses sensors—for example, by obtaining the infamous status reports—not all is necessarily well. Anyone who has heard the term watermelon status understands: these are the projects whose status is “green” on the outside but “red” on the inside, meaning that they are presented as going well but in reality suffer from serious issues. Corporate project managers and status reporters aren’t straight-out liars, but they might be overly optimistic or take some literary license to make their project look good. “700 happy passengers reach New York after Titanic’s maiden voyage” is also factually correct, but not the status report you’d want to get. Observing how much trust some senior executives place in PowerPoint slides might make you believe that it not only has a built-in spell checker but also a lie detector. Digital companies are generally suspicious of fabricated presentations and “massaged” messages, but instead believe in hard data, preferably rendered in live metrics dashboards. Google’s Mobile Ads team in Japan reviewed the performance of all ad experiments, run as A/B tests, every week, and decided which experiments should be accepted into production, which ones should be rejected, and which ones needed to run longer in order to become conclusive. The decisions were based on hard user data, not projections or promises. Working based off hard data can be frustrating because getting a solution running doesn’t yet earn you much praise: that’s expected anyway. Praise comes once your solution receives attention and traffic from actual users— data that’s much harder to fabricate. Smart Control Some control circuits take in more feedback signals and refine how they drive the system. For example, some heating systems measure the outside temperature to predict energy loss through windows and walls. Google’s Nest thermostat takes it a step further: it takes in additional information, such as the weather forecast (the sun helps warm the house), and when you are usually home or away. It also learns about the inertia of the heating system, which can lead to overheating the house due to heat capacity left in the radiators when the furnace is switched off. Nest is thus called a “learning” or “smart” thermostat; it takes in additional signals and optimizes what it does based on that feedback. It would be fantastic if we could apply that label to more project managers. Saupreiß, ned so Damischer When people speak about command-and-control structures, they are quick to cite the military, which, after all, is run by commanders. The military organization most equated with stodginess and “iron discipline” is the Prussian army. For people living in Bavaria, in Germany’s southern region, Prussia is externalized in the concept of the Preiß, a not-so-friendly term referring to people born north of their state. Ironically, the Prussian military understood very well that one-way control is merely an illusion. Carl von Clausewitz wrote a 1,000-page tome, On War, in the early 1800s, in which he cites sources of friction: the external gap between desired and actual outcomes (uncertainty) and the internal gap between plans and the actions of an organization. In his book The Art of Action,3 Stephen Bungay extends this concept into three gaps, as illustrated in Figure 27-1: the knowledge gap between what you’d like to know and actually do know, the alignment gap between plans and actions, and the effects gap between what you expect your actions to achieve and what actually happens. Figure 27-1. The three gaps that can make control an illusion (adapted from Bungay) You can recognize organizations that believe they can close these gaps by the methods they apply. They try to close the knowledge gap by generating thick requirements documents, which are often outdated by the time they are completed. They try to tackle the alignment gap by developing extensive project plans and micro-managing according to them. Lastly, the effects gap is a bit more difficult to close. Such organizations try nonetheless, either through the aforementioned watermelon status reports or by using proxy metrics (Chapter 40), things that are easily measured but don’t quite reflect reality. Essentially, these organizations create their own reality, or illusion. Many IT organizations believe they can close the knowledge, alignment, and effects gaps, an approach that has been disproven some 200 years ago. Unlike some IT organizations, the Prussians already knew that the gaps couldn’t be eliminated. Instead, they accepted the gaps and adjusted their management style accordingly, replacing a concrete order with the concept of Auftragstaktik, best translated as “mission command” or “directive.” Understanding the purpose of the mission enabled the troops to adjust to unforeseen circumstances (knowledge or effects gaps) without having to report back to central command. This would save valuable time and lead to better decisions in the local context. The difference between an order and a mission command becomes clear with a simple example. Suppose that a small platoon is given the order to take a hill. As it climbs up the hill, the soldiers realize that there’s no resistance at all and progress easily. Should they march to the top? If they have been given a simple command, they’ll do so. This might be correct if the intent, the Auftrag, was to take the hill as a strategic position. However, if the intent was to attack troops positioned on the hill, advancing to the top makes little sense. Auftragstaktik doesn’t mean people are left to do whatever they deem appropriate. It’s based on discipline, but active discipline, one that respects the commander’s intentions, as opposed to passive obedience, which demands blind execution. Also, when deciding on an action, teams pull from a well-defined repertoire of tactics that are well known and extensively trained. So, the Preißn weren’t so stodgy after all and were possibly ahead of many modern IT departments. Actual Control: Autonomy Ironically, it turns out that giving teams decision autonomy actually increases control as it accepts the gaps and avoids operating in an illusion. But be careful: many organizations equate autonomy to “everyone does what they think is best.” Now, unfortunately, that’s not autonomy, that’s anarchy: whether we like it or not, anarchists do what they believe is right. Everyone doing what they think is best isn’t autonomy, it’s anarchy. How can large-scale IT organizations establish autonomy without falling into anarchy? In my experience, it requires the interplay between three elements (see Figure 27-2): Enablement It may sound trivial, but first you need to enable people to perform their work. Sadly, corporate IT knows many mechanisms that disable people: HR processes that restrict recruiting, approval processes that take weeks to provision servers, black markets (Chapter 29) that are inaccessible to new hires. Just like a thermostat connected to a furnace with a plugged gas line won’t do much good, high friction negates autonomy. In IT, platforms such as cloud computing can enable employees and at the same time assure consistency in “tactics” as they provide a common set of tools to select from. Feedback Autonomous teams can make better decisions because they have the shortest feedback cycles (Chapter 36). This way, they can learn fast and improve. This works only if teams see the consequences for their decisions. If the thermostat is mounted in another room than the radiator, there’s no control circuit. Strategy To make good decisions, teams need to be able to distinguish which decisions are good. They therefore need to have specific goals; for example, revenue generated or quantifiable user engagement. These goals aren’t specific commands but overall objectives to be achieved. A thermostat is useful only if someone sets the desired temperature. Figure 27-2. Strategy, feedback, and enablement separate autonomy from anarchy This system won’t work if you omit one or more elements: strategy without enablement will lead to zero progress but lots of frustration. Autonomy without strategy or feedback will resemble anarchy as teams can’t judge the appropriateness of their decisions. And enablement without strategy will only make anarchy more efficient. For example, the Spotify Squad Model4 found that increasing alignment on a common strategy supports increasing autonomy. Many enterprise architecture teams (Chapter 4) set direction without being responsible for the consequences. Applying the logic that lack of strategy and feedback leads to anarchy implies a rather disconcerting outcome. Another insight might surprise traditional organizations that are looking to give teams more autonomy: autonomous teams need better management. Managing nonautonomous teams is comparatively easy: they’ll largely do as told. Autonomous teams, in contrast, require leadership: they need to be told the overall intent and goals. So, ironically, organizations that are looking to increase autonomy in their teams might need to strengthen management first. Autonomous teams need better management. Controlling the Control Loop Even though a control circuit’s job is to keep a system in a steady state without someone having to monitor it, observing the circuit’s behavior can still be useful in a larger context, meaning we shouldn’t blindly trust the autopilot. For example, if the air filter in a forced-air heating system becomes clogged or the furnace collects soot, it will take longer to warm the house under otherwise identical conditions. A “dumb” thermostat will simply run the heater longer, covering up the issue. A smart control system, in contrast, can measure the length of the thermostat dutycycle; for instance, how long the furnace runs to reach or maintain a certain room temperature. If this duty cycle extends, the controller can give a hint that the system is no longer operating as efficiently as before. Therefore, a control loop shouldn’t be a black box; instead, it should expose health metrics based on what it has “learned.” Advanced cloud features like server autoscaling, which are able to absorb sudden load spikes without human intervention, are handy, but they can also mask serious problems. For example, if a new version of the software performs poorly, the infrastructure might attempt to autocompensate this problem by deploying more servers. You might just find out a bit later via your monthly bill. In control theory, observing the behavior of a control loop is considered an “outer loop” that observes the behavior of the inner control loop and can trigger adjustments to the system. 1 Steve Denning, “Ten Agile Axioms That Make Managers Anxious,” Forbes, June 17, 2018, https://oreil.ly/Dn1es. 2 Jeff Sussna, Designing Delivery: Rethinking IT in the Digital Service Economy (Sebastopol, CA: O’Reilly, 2015). 3 Stephen Bungay, The Art of Action: How Leaders Close the Gaps between Plans, Actions and Results (London: Nicholas Brealey Publishing, 2010). 4 Henrik Kniberg, “Spotify Engineering Culture (part 1),” Spotify Labs, March 27, 2014, https://oreil.ly/d3MAI. Chapter 28. They Don’t Build ’Em Quite Like That Anymore No One Lives in a Foundation The Great Pyramid at 30% completion: effort completion, that is The great pyramids are impressive buildings and attract hordes of tourists even several millennia after their construction. The attraction results not only from the engineering marvel, such as the perfect alignment and balance, but also from the fact that pyramids are quite rare. Besides the US one-dollar bill, you’ll find them only in Egypt, Central America, and IT organizations! Why IT Architects Love Pyramids Pyramids are a fairly common sight in IT architecture diagrams and tend to give architects, especially the ones nearer to the penthouse, a noticeable sense of satisfaction. In most cases, the pyramid diagram indicates a layering concept with a base layer that includes functionality commonly needed by the upper layers. For example, the base layer could contain generic functions, while the next layer up would contain industry-specific functions, followed by functionality for specific business functions, and being topped off with customer-specific configuration (Chapter 11). Layering is a very popular and useful concept in systems architecture because it constrains dependencies between system components to flow in a single direction, as opposed to a Big Ball of Mud (Chapter 8). Depicting the layers in the shape of a pyramid suggests that the upper layers are much smaller and more specialized than the base layers, which provide most of the common functionality. IT is enamored with this model because it implies that a large portion of the base layer code can be shared or acquired given that it’s identical across many businesses and applications. For example, a better Object-Relational Mapping (ORM) framework or a common business component such as a billing system are unlikely to present a competitive advantage and should simply be bought. Meanwhile, necessary and valuable customizations can be performed in the “tip” with relatively little effort or by lesser-skilled labor. The analogy is consistent with the pyramids of Giza, where the top third of the pyramid’s height makes up only roughly 4% of the material. Organizational Pyramids The other place littered with pyramids are slide decks depicting organizational structures, where they refer to hierarchical structures. Almost all organizations are hierarchical: multiple people from a lower tier report to a single person on the next upper tier, resulting in a directed tree graph, which, when the root is placed on the top, resembles a pyramid. Even “flat” organizations tend to have some hierarchy, as a single person generally acts as a chairman or CEO. Such a setup makes sense because directing work takes less effort than actually conducting the work, meaning an organization needs fewer managers or supervisors than workers (unless they are trying to buy love; see Chapter 38). Having fewer leaders also helps with consistent decision making and setting a single strategic direction. No Pyramid Without Pharaoh Still, there’s a good reason why the Egyptians abandoned the idea of building pyramids some 4,500 years ago: the base layers of a pyramid require an enormous amount of material. It’s estimated that the Great Pyramid of Giza consists of more than two million blocks weighing in at several tons each. Assuming workers toiled day and night over the course of one decade, they would have had to lay an average of three large limestone blocks per minute. Three quarters of the material had to be laid for the first 50 meters of height alone. Even though the result is undoubtedly impressive and long lasting, it can hardly be called efficient. The economics of building pyramids can function only if there’s an abundance of cheap or forced labor (historians still debate whether the pyramids were built by slaves or paid workers) or a pharaoh’s unbelievable accumulation of wealth. In addition to resources, one also needs to bring a lot of patience. Building pyramids doesn’t mix well with economies of speed (Chapter 35). Some of the pyramids in Egypt weren’t even finished during the pharaoh’s lifetime. No One Lives in a Foundation Functional pyramids as we find them in IT system designs face another challenge: the folks building the base layer not only need to move humongous amounts of material, they also must anticipate the needs of the teams building the upper layers. That’s a lot more difficult to do in IT pyramids where things tend to evolve over time (Chapter 3). Building an IT pyramid purely from the bottom up incurs several problems: First, the lower layers alone don’t provide much value to the business—they are merely a foundation for more things to come. The result is a large investment with a slow return of value, not something a typical business is looking for. It also negates the Agile principle of “use before reuse.” Building a base layer means designing functions to be reused later without first actually using them. This can be a guessing game at best. Lastly, it also dangerously ignores the Build-Measure-Learn cycle (Chapter 36) of learning what’s needed from observing actual usage. What if the business expected a different pyramid? No one likes to live in a foundation. Therefore, delivering only a base layer has limited business value. Not limited to pyramids but applicable to any layered system is the challenge of defining the appropriate seams between the layers. Done well, these seams form abstractions that hide the complexity of the layer below while leaving the layer above with sufficient flexibility (Chapter 11). It’s not easy to find examples that work well—like abstracting packet-based network routing behind data streams (sockets)—but when implemented well, this enables major transformations like the internet. Normal IT teams can’t expect to be quite that lucky. Building Pyramids from the Top If you’re determined to build an IT pyramid, the best way to do so is from the top down. This is not something you can do with real pyramids, but software does allow us to defy gravity in this case. When I mention “top down,” I am referring to the way the pyramid is constructed, not the way the project is managed. Ironically, “top-down management” leads to pyramids being built bottom up. To build an IT pyramid from the top, you start with a specific application or service that delivers customer value, thus assuring use before considering resue and avoiding the dangerous notion of “reusable.” When multiple applications can utilize a specific feature or functionality, you can let the related components “sift down” into a lower layer of the pyramid, making them more broadly available. Building the pyramid this way ensures that the base layer contains functionality that’s actually needed as opposed to functions that some people, often the enterprise architects (Chapter 4) far away from actual software development, believe might be needed somewhere, sometime. “Reusable” can be a dangerous word. It describes components that were designed to be widely used but aren’t. Anticipating some needs ahead of time, such as the much-mentioned ORM framework, is fine. So is building some of the pyramid base layers like operating systems. In many ways, cloud computing is a massive base layer, but one with very nice seams. Building pyramids from the top can lead to some amount of duplication; for example, if two independent development teams build similar functionality that’s not yet part of the base layer. Transparency across teams—for example, by using a common source code repository or a common service registry—can help detect such duplication early. While too much duplication may be undesired, we must keep in mind that avoiding duplication isn’t free (Chapter 35). I have seen base services layers that force a consumer to make many remote calls even to execute a simple function. This approach was chosen by the base-layer architects because it ostensibly provides more flexibility. The first client developer coding against this interface described his experience in quite unkind words, citing well-known issues such as sequencing, partial failure, and maintaining state. The base-layer team’s retort was a new dispatcher layer on top of their service layer to “enhance the interaction.” The team was building the pyramid from the bottom up. Building the pyramid from the top down also typically results in much more usable APIs (programming interfaces) into the lower layers. Because in the layered model the consumers of the lower layers live on top, building APIs from the top equates to being customer centric: rather than guessing what your customers (other development teams in this case) might want, you aderive it from actual use. Celebrating the Base Layer Building pyramids is popular in IT for another reason: the completion of the pyramid’s base layer provides a proxy metric for actual product success. It allows teams to claim major progress without any meaningful validation of business impact. It’s analogous to developers’ love of building frameworks: you get to devise your own requirements, and upon delivery of those requirements, you declare success without any actual user having validated the need nor the implementation. In other words, designing pyramid base layers allows penthouse architects (Chapter 1) to purport the notion that they are connected to the engine room without facing the scrutiny of actual product development teams or, worse yet, real customers. The folks highest up in the organizational pyramid love to design the bottom layers of the IT system pyramid, far away from actual users. It’s ironic that the folks highest up in the organizational pyramid love to design the bottom layers of the IT system pyramid. The reason is clear: building successful applications is more difficult than generic and unvalidated base layers. Unfortunately, by the time the bluff becomes apparent, the penthouse architects are almost guaranteed to have moved to another project. Living in Pyramids While IT building pyramids can be debated, organizational pyramids are largely a given: we all report to a boss, who reports to someone else, and so on. In large organizations, we often define our standing by how many people are above us in the corporate hierarchy. The key consideration for an organization is whether they actually live in the pyramid; in other words, whether the lines of communication and decision making follow the lines in the hierarchy. If that’s the case, the organization will face severe difficulties in times that favor economies of speed (Chapter 35) because pyramid structures can be efficient, but they are neither fast nor flexible: decisions travel up and down the hierarchy, often suffering from a bottleneck in the coordination layer (Chapter 30). Luckily, many organizations don’t actually work in the patterns described by the org chart but follow a concept of feature teams, tribes, or squads. These organizational elements typically have complete ownership of an individual product or service: decisions are pushed down to the level of the people actually most familiar with the problem. This speeds up decision making and provides shorter feedback loops. Some organizations are looking to speed things up by overlaying communities of practice over their structural hierarchy, bringing people with a common interest or area of expertise together. Communities can be useful change agents, but only if they are empowered and have clear goals (Chapter 27). Otherwise, they run the risk of becoming communities of leisure, a hiding place for people to debate and socialize without measurable results. We should wonder, then, why organizations are so enamored with org charts that they adorn the second slide of almost any corporate project presentation. My hypothesis is that static structures carry a lower semantic load than dynamic structures: when presented with a picture showing two boxes, A and B, connected by a line, the viewer can easily derive the model: A and B have a relationship. One can almost imagine two physical cardboard boxes connected by a string wire. Dynamic models are more difficult to internalize: if A and B have multiple lines between them that depict interaction over time, possibly including conditions, parallelism, and repetition, it’s much more difficult to imagine the reality the model is trying to depict. Often, only an animation can make it more intuitive. Hence, we are more content with static structures even though understanding a system’s behavior (Chapter 10) is generally much more meaningful than seeing its structure. It Always Can Get Worse Running an organization as a pyramid can be slow and inhibit feedback cycles, which are needed to drive innovation. However, some organizations have a pyramid model that’s even worse: the inverse pyramid. In this model, a majority of people manage and supervise a minority of people doing actual work. Besides the apparent imbalance, the inevitable need of the managers to obtain updates and status reports from the workers is guaranteed to grind progress to a halt. Such pathetic setups can occur in organizations that once depended completely on external providers (Chapter 38) for IT implementation work and are now starting to bring IT talent back in-house. It can also happen during a crisis, such as a major system outage, which gets so much management attention that the team spends more time preparing status calls than resolving the issue. A second antipattern ironically occurs when organizations attempt to fix the issues inherent in their hierarchical pyramid setup. They supplement the existing top-down reporting organization (often referred to as a line organization) with a new project organization. The combination is typically called a matrix organization (for once, this isn’t a movie reference) as people have a horizontal reporting line into their project and a vertical reporting line into the hierarchy. However, organizations that are not yet flexible and confident enough to give project teams the necessary autonomy (Chapter 27) are prone to creating a second pyramid, the project pyramid. Now employees struggle not only with one but with two pyramids. Building Modern Structures If pyramids aren’t the way to go, how should you build systems, then? I view both systems and organizational design as an iterative, dynamic process that’s driven by the need to deliver business value. When building IT systems, you should add new components only if they provide measurable value. Once you observe a sizable set of common functions, it’s good to push those down into a common base layer. If you don’t find such components, that’s also OK. It simply means that a pyramid model doesn’t fit your situation. Chapter 29. Black Markets Are Not Efficient But They Reveal How Things Actually Get Done I got anythin’ you need, bro A common complaint about large organizations is that they are slow and mired in processes that are designed to exert control (Chapter 27) as opposed to supporting people in getting their work done quickly. For example, I used to be allowed to make technical decisions involving tens of millions of dollars, but I had to obtain management approval to purchase a $200 plane ticket. By the time I got the approval, often the fare had increased. Most organizations consider such processes as crucial to keeping the organization running smoothly. “What would happen if everyone did what they wanted?” is the common justification. Most organizations never dare to find out, not because they fear chaos and mayhem, but because they fear that everything will be fine, and the people creating and administering the processes will no longer be needed. Black Markets to the Rescue Ironically, beneath the covers of law and order, such organizations are intrinsically aware that their processes hinder progress. That’s why these organizations tolerate a “black market” where things get done quickly and informally without following the self-imposed rules. Such black markets often take the innocuous form of needing to “know who to talk to” to get something done quickly. You need a server urgently? Instead of following the standard process, you call your buddy who can “pull a few strings.” Setting up an official “priority order” process, usually for a higher price, is fine. Bypassing the process to get special favors for those who are well connected is a black market. If the answer to “how long does it take to get a server?” is “it depends on who’s asking,” then you have a black market. Another type of black market can originate from “high up.” While it’s not uncommon to offer different service levels, including “VIP support,” providing senior executives with support that ignores the very process- or security-related constraints imposed by the executives in the first place is a black market. Such a black market appears, for example, in the form of executives sporting sexy mobile devices that are deemed too insecure for employees, notwithstanding the fact that executives’ devices often contain the most sensitive data. Black Markets Are Rarely Efficient What these examples have in common is that they are based on unwritten rules and undocumented, or sometimes secret, relationships. That’s why black markets are rarely efficient, as you can see from countries where black markets constitute a major portion of the economy: black markets are difficult to control and deprive the government of much-needed tax income. They also tend to circumvent balanced allocation of resources: those with access to the black market will be able to obtain goods or favors that others cannot. Black markets therefore stifle economic development because they don’t provide broad and equal access to resources. This is true for countries as much as large enterprises. Black markets stifle innovation because they don’t provide equal access to resources. The digital world democratizes access, which is exactly the opposite. In organizations, black markets often contribute to slow chaos (Chapter 31), in which the outside of the organization appears to be disciplined and structured, but the reality is completely different. They also make it difficult for new members of the organization to gain traction because they lack connections into the black market, presenting one way systems resist change (Chapter 10). Black markets also cause inefficiency by forcing employees to learn the black-market system. Knowing how to work the black market is undocumented organizational knowledge that’s unique to the organization. The time it takes employees to learn the black market doesn’t benefit the organization and presents a real but rarely measured cost. Once acquired, the knowledge doesn’t benefit the employee either, because it has no market value outside of the organization. Ironically, this effect can contribute to large organizations tolerating black markets: it aids employee retention because much of their knowledge consists of undocumented processes, special vocabulary, and black-market structures, which ties them to the organization. Worse yet, black markets break necessary feedback cycles: if procuring a server is too slow to compete in the digital world, the organization must resolve the issue and speed up that process. Circumventing it in a blackmarket fashion gives management a false sense of security, which often goes along with fabricated heroism: “I knew we could get it done in two days.” Amazon can get it done in a few minutes for a hundred thousand customers. The digital transformation is driven by democratization; that is, giving everyone rapid access to resources. That’s exactly the opposite of what a black market does. You Cannot Outsource a Black Market Another very costly limitation of black markets is that they cannot be outsourced. Large organizations tend to outsource commodity processes like human resources or IT operations, exactly those areas that are subject to black market economies. Specialized outsourcing providers have better economies of scale and lower cost structures, partly because they follow officially established processes. Because services are now performed by a third-party provider, and processes are contractually defined, the unofficial black market bypass no longer works. Essentially, the business has subjected itself to a work-to-rule slowdown. Organizations that rely on an internal black market, therefore, will experience a huge loss in productivity when they outsource part of their service portfolio. Beating the Black Market How do you avoid running the organization via a black market? More control and governance could be one approach: just like the DEA cracks down on the black market for drugs, you could identify and shut down the black-market traders. However, one must recall that the IT organization’s black market isn’t engaged in trading illegal substances. Rather, people circumvent processes that don’t allow them to get their work done. Knowing that overambitious control processes caused the black market in the first place makes more control and governance an unlikely solution. Still, some organizations will be tempted to do so, which is a perfect example of doing exactly the opposite of what has the desired effect (Chapter 10). You can’t eliminate black markets with more control and governance. After all, those are the very mechanisms that caused the black market in the first place. The only way to avoid a black market is to build an efficient “white market,” one that doesn’t hinder progress but enables it. An efficient white market reduces people’s desire to maintain an alternate black-market system, which does take some effort after all. Trying to shut down the black market without offering a functioning white market is likely to result in resistance and substantial reduction in productivity. Self-service systems are a great tool to starve black markets because they remove the human connection and friction by giving everyone equal access, thus democratizing the process. If you can order IT infrastructure through a self-explanatory tool that provides fast provisioning times, there’s much less motivation to do it “through the back door.” Automating undocumented processes is cumbersome, though, and often unwelcome because it can highlight the slow chaos (Chapter 31). Feedback and Transparency Black markets generally originate as a response to cumbersome processes, which result from process designers prioritizing reporting and control: inserting a checkpoint or quality gate at every step provides accurate progress tracking and valuable metrics. However, it makes people using the process jump through an endless sequence of hurdles to get anything done. That’s the reason I have never seen a single user-friendly HR or expense reporting system. Forcing the people designing processes to use them for their own daily work can highlight the amount of friction the processes create and thus provide a valuable feedback loop (Chapter 27). This means no more VIP support but support that’s good enough for everyone to use. Wouldn’t everyone like to be treated like a VIP? Similarly, HR teams should apply for their own job postings to experience the process firsthand. When recruiting, I routinely apply for my own job openings so I can detect any hurdles in the process. Transparency is a good antidote to black markets. Black markets are inherently nontransparent, providing benefit to only a small subset of people. When users gain full transparency of the official processes, such as ordering a server, they might be less inclined to want to order one from the black market, which does carry some overhead and uncertainty. For example, will a black market server be supported or perhaps reallocated during the next inventory sweep? Therefore, full transparency should be embedded into an organization’s systems as a main principle. Replacing a black market with an efficient, democratic white market also makes control less of an illusion (Chapter 27): if users employ official, documented, and automated processes, the organization can observe actual behavior and exert governance; for example, by requiring approvals or issuing usage quotas. No such mechanisms exist for black markets. The main hurdle to drying up black markets is that improving processes has a measurable up-front costs while the cost of the black market is usually not measured. This gap leads to the cost of no change (Chapter 33) being perceived as being low, which in turn reduces the incentive to change. Chapter 30. Scaling an Organization How to Scale an Organization? The Same Way You Scale a System! Horizontal scaling seems more natural The digital world is all about scalability: millions of websites, billions of hits per month, petabytes of data, more tweets, more images uploaded. To make this work, architects have learned a ton about scaling systems: make services stateless and horizontally scalable, minimize synchronization points to maximize throughput, keep transaction scope local, avoid synchronous remote communication, use clever caching strategies, and shorten your variable names (just kidding!). With everything around us scaling to never-before-seen throughput, the limiting element in all of this is bound to be us, the human users, and the organizations we work in. You might wonder, then, whether IT architects, who know so much about scalability, can apply their expertise to scaling and optimizing throughput in organizations. I might have become an architect astronaut1 suffering from oxygen deprivation due to exceedingly high levels of abstraction, but I can’t help but feel that many of the scalability and performance approaches known to experienced IT architects can just as well be applied to scaling organizations. If a coffee shop (Chapter 17) can teach us about maximizing a system’s throughput, maybe our knowledge of IT systems design can help improve an organization’s performance? Component Design—Personal Productivity Increasing throughput starts with the individual. Some folks are simply 10 times more productive than others. For me it’s hit or miss: when I am “in the zone,” I can be incredibly productive but lose traction just as quickly when I am being frequently interrupted or annoyed by something. So, I won’t bestow on you any great personal advice, but instead refer you to the many resources like GTD (Getting Things Done),2 which advises you to minimize your inventory of open tasks (making the Lean folks happy) and to break down large tasks into smaller ones that are immediately actionable. For example, “I really ought to replace that old clunker” turns into “visit three dealerships this weekend.” Incoming stuff is categorized and either immediately processed or parked until it’s actionable, thus reducing the number of concurrent threads. The suggestions are very sound, but as always it takes a bit of trust and lots of discipline to succeed at implementing them. Avoid Sync Points—Meetings Don’t Scale Let’s assume people individually do their best to be productive and have high throughput, meaning we have efficient and effective system components. Now we need to look at the integration architecture, which defines the interaction between components; in other words, people. One of the most common interaction points (short of email, more on that later) surely is the meeting. The name alone gives some of us goose bumps because it suggests that people get together to “meet” one another, but doesn’t define any specific agenda, objective, or outcome. Meetings are synchronization points—a well-known throughput killer. From a systems design perspective, meetings have another troublesome property: they require multiple humans to be (mostly) in the same place at the same time. In software architecture, we call this a synchronization point, widely known as one of the biggest throughput killers. The word “synchronous” derives from Greek and essentially means things happening at the same time. In distributed systems for things to happen at the same time, some components must wait for others, which is quite obviously not the way to maximize throughput. The longer the wait for the synchronization point, the more dramatic the negative impact on performance becomes. In some organizations finding a meeting time slot among senior people can take a month or longer. Such resource contention on people’s time significantly slows down decision making and project progress (and hurts economies of speed; see Chapter 35). The effect is analog to locking database updates: if many processes are trying to update the same table record, throughput suffers enormously as most processes just wait for others to complete, eventually ending up in the dreaded deadlock. Administrative teams in large organizations acting as transaction monitor underlines the overhead caused by using meetings as the primary interaction model. Worse yet, full schedules cause people to start blocking time “just in case,” a form of pessimistic resource allocation, which has exactly the opposite of the intended effect on the system behavior (Chapter 10). Getting together can be useful for brainstorming, critical discussions, or decisions, but the worst kind of meetings must be status meetings. If someone wants to know where a project stands, why would they want to wait for the next status meeting that takes place in a week or two? To top it off, many status meetings I attended had someone read text off a document that wasn’t distributed ahead of the meeting lest someone read through it and escape the meeting. Interrupts Interrupt—Phone Calls When you can’t wait for the next meeting, you tend to call the person. I know well as I log half a dozen incoming calls a day, which I routinely don’t answer (they typically lead to an email starting with the phrase “I was unable to reach you by phone,” whose purpose I never quite understood). Phone calls have short wait times when compared to meetings, but are still synchronous and thus require all resources to be available at the same time. How many times have you played “phone tag” where you were unable to answer a call just to experience the reverse when you call back? I am not sure there’s an analog to this in system communication (I should know; after all, I am documenting conversation patterns),3 but it’s difficult to imagine this as effective communication. Phone calls are “interrupts” (they are blockable by muting your ringer), and in an open environment, they not only interrupt you but also your coworkers. That’s one reason that Google Japan’s engineering desks were by default not equipped with phones—you had to specifically request one, which was looked upon as a little old fashioned. The damage ringing phones can do in open office spaces was already illustrated in Tom DeMarco and Tim Lister’s classic Peopleware.4 The “tissue trick” won’t work anymore with digital phones, but luckily virtually all of them have a volume setting. My pet peeve related to phones is people busting into my office while I am talking on the speaker phone, so I’d like to build a mini project to illuminate an “on air” sign while I am on the phone. Piling on Instead of Backing off Retrying an unsuccessful operation is a typical conversation pattern. It’s also a dangerous operation because it can escalate a small disturbance in a system into an onslaught of retries, which brings everything to a grinding halt. That’s why _Exponential Backoff5 is a well-known pattern and forms the basis of many low-level networking protocols, such as Carrier Sense, Multiple Access with Collision Detection (CSMA/CD), which is a core element of the Ethernet protocol. Ironically, humans tend to not back off if a phone call fails, but have a tendency to pile on: if you don’t pick up, they tend to call you at ever shorter intervals to signal that it’s urgent. Ultimately, they will back off, but only after burdening the system with overly aggressive retries. Such behavior contributes to uneven resource utilization. It seems that either everyone seems to be calling you or it’s extremely quiet. Asynchronous communication with queues in contrast can perform traffic shaping—spikes are absorbed by the queue, allowing the “service” to process requests at the optimal rate without becoming overloaded. That’s why I prefer to receive an email starting with “I was unable to reach you by phone”: I converted a synchronous operation into an asynchronous one. Asynchronous Communication—Email, Chat, and More In corporate environments, email tends to draw almost as much ire as meetings. It has one big advantage, though: it’s asynchronous. Instead of being interrupted, you can process your email whenever you have a few minutes to spare. Getting a response might take slightly longer, but it’s a classic “throughput over latency” architecture, best described by Clemens Vaster’s analogy of building wider bridges, not faster cars, to solve the perennial congestion on the two-lane floating bridge that’s part of Washington State Route 520 between Seattle and Redmond. Email also has drawbacks, the main one being people flooding everyone’s inbox because the perceived cost of sending mail is zero. Unfortunately, the cost of reading an email isn’t. You must therefore have a good inbox filter if you want to survive. Also, mail isn’t collectively searchable—each person has their own record of history. I guess you could call that an eventually consistent architecture of sorts and just live with it, but it still seems horribly inefficient. I wonder how many copies of that same 10 MB PowerPoint presentation plus all its prior versions are stored on a typical Exchange server. Integrating chat with email can overcome some of these limitations: if you don’t get a reply or the reply indicates that a real-time discussion is needed, the “reply by chat” button turns the conversation into quasi-synchronous mode: it still allows the receiver to answer at will (so it’s asynchronous) but allows for much quicker iterations than mail. Products like Slack, which favor a chat/channel paradigm, also enable asynchronous communication without email. Systems architects would liken this approach to tuple spaces, which, based on a blackboard architectural style, are well suited for scalable, distributed systems thanks to loose coupling and avoiding duplication. Asking Doesn’t Scale—Build a Cache! Much of corporate communication consists of asking questions, often via synchronous communication. This doesn’t scale because the same questions are asked again and again. Architects would surely introduce a cache into their system to offload the source component, especially when they receive repeated requests for basic information, such as a photo of a new team member. In such cases, I simply type the person’s name into Google and reply with a hyperlink to an online picture, asking Google instead of another person. Search scales, but only if the answers are available in a searchable medium. Therefore, if you receive a question, reply so that everyone can see (and search) the answer; for example, on an internal forum—that’s how you load the cache. Taking the time to explain something in a short document or forum post scales: 1,000 people can search for and read what you have to share. 1,000 one-on-one meetings to explain the same story would take half of your annual work time. One cache killer that I have experienced is the use of different templates, which aim for efficiency but hurt data reuse. For example, when I answer requests for my resume with a link to my home page or LinkedIn, I observe a human transcribing the data found online into a prescribed Word template. Some things are majorly wrong in the digital universe. Poorly Set Domain Boundaries—Excessive Alignment Even though some communication styles might scale better than others, all will ultimately collapse under heavy traffic because humans can handle only so much throughput, even in chat or asynchronous communication. The goal therefore mustn’t only be to tune communication but also to reduce it. Large corporations suffer from a lot of unnecessary communication, caused, for example, by the need “to align.” I often jest that “aligning” is what I do when my car doesn’t run straight or wears the tires unevenly. Why I need to do it at work all the time puzzled me, especially as “alignment” invariably triggers a meeting with no clear objective. In corp speak, to align means to coordinate on an issue and come to some sort of common understanding or agreement. A common understanding is an integral part of productive teamwork, but the act of “aligning” can start to take on a life of its own. My suspicion is that it’s a sign of misalignment (pun intended) between the project and organizational structures: the people who are critical to a project’s success or are vital decision makers aren’t part of the project, requiring frequent “steering” and “alignment” meetings. The system design analog for this problem is setting domain boundaries poorly, drawing on Eric Evans’s Domain-Driven Design6 concept of a Bounded Context.7 Slicing a distributed system across poorly set domain boundaries is almost guaranteed to increase latency and burden both the system and its developers, who must grapple with increased complexity. Sam Newman would surely agree.8 Self-Service Is Better Service Self-service generally has poor connotations: if the price were the same, would you rather eat at McDonald’s or in a white-tablecloth restaurant with waiter service? If you are a food chain looking to optimize throughput, though, would you rather be McDonald’s or the quaint Italian place with five tables? Self-service scales. Requesting a service or ordering a product by making a phone call or emailing spreadsheet attachments for someone to manually enter data doesn’t scale, even if you lower the labor cost with near- or offshoring. To scale, automate everything (Chapter 13): make all functions and processes available online on the intranet, ideally both as web interfaces and as (access protected) service APIs so that users can layer new services or custom user interfaces on top; for example, to combine popular functions. Staying Human Does scaling organizations like computer systems mean that the digital world shuns personal interaction, turning us into faceless email and workflow drones that must maximize throughput? I don’t think so. I very much value personal interaction for brainstorming, negotiation, solution finding, bonding, or just having a good time. That’s what we should maximize face-to-face time for. Having someone read slides aloud or calling me the third time to ask the same question could be achieved many times faster by optimizing communication patterns. Am I being impatient? Possibly, but in a world in which everything moves faster and faster, patience might not be the best strategy. High-throughput systems don’t reward patience. 1 Joel Spolsky, “Don’t Let Architecture Astronauts Scare You,” April 21, 2001, Joel on Software (blog), https://oreil.ly/MafCn. 2 Wikipedia, "Getting Things Done,” https://oreil.ly/PRfdu. 3 Hohpe, “Conversation Patterns,” Enterprise Integration Patterns, https://oreil.ly/qHzFw. 4 Tom DeMarco and Timothy Lister, Peopleware: Productive Projects and Teams, 3rd ed. (Upper Saddle River, NJ: Addison-Wesley, 2013). 5 Wikipedia, “Exponential Backoff,” https://oreil.ly/A4QbL. 6 Eric Evans, “About Domain Language,” Domain Language (website), https://oreil.ly/m71x1. 7 Martin Fowler, “Bounded Context,” MartinFowler.com, https://oreil.ly/AtY88. 8 Sam Newman, Building Microservices: Designing Fine-Grained Systems (O’Reilly, 2015). Chapter 31. Slow Chaos Is Not Order Going Fast? Bring Discipline! Agile or just fast? The next turn will tell. We all have our pet peeves or hot buttons, things that we’ve come across often enough that, despite their insignificance, really annoy us. In private life, these issues tend to revolve around things like toothpaste tubes: cap off versus cap on, or squeezed from the bottom versus from the middle. Such differences have been known to put numerous marriages and live-in relationships in danger (hint: a second tube runs about $1.99). In the corporate IT world, pet peeves tend to be related to things more technical in nature. Mine is people using the word agile without having understood its meaning, almost two decades after the Agile Manifesto was authored. Surely you have overheard conversations like the following: What’s your next major deliverable? Dunno—we are Agile! What’s your project plan? Because we are Agile, we are so fast that we couldn’t keep the plan up to date! Could I see your documentation? Don’t need it—we are Agile! Could you tell me about your architecture? Nope—Agile projects don’t need this! And when one dares to ask how the teams know that they are Agile, you’re sure to hear the following response: We are guaranteed to be Agile because we’re officially certified! Such ignorance is topped only by statements that Agile methods aren’t suited for your company or department because they are too chaotic for such a structured environment. Ironically, the opposite is usually the case: corporate environments often lack the discipline to implement Agile processes. Fast Versus Agile My first annoyance about the widespread abuse of the word agile is repeatedly having to remind people that the method is called “Agile,” not “fast,” and for a good reason. Agile methods are about hitting the right target through frequent recalibration and embracing change rather than trying to predict the environment and eliminating uncertainty. Firing from afar at a moving target is fast, but not Agile: you will likely miss. Agile methods allow course corrections along the way, more like a guided missile (though I am not fond of the weapons analogy). Agile quickly gets you where you need to be. Running in the wrong direction faster isn’t a method, but foolishness. Speed and Discipline When observing something that moves fast, it’s easy to feel a sense of chaos: too many things are happening at the same time for someone to judge how it all really fits together. A good example is a Formula 1 pit stop: screech, whir, whir, roar, and the car has four new tires in under four seconds (refueling is no longer allowed in F1 racing). Watching this process happening at such high speeds leaves one feeling slightly dizzy and that it’s some sort of miracle or in fact a bit chaotic. If you watch the procedure a few times, ideally in slow motion, you can appreciate that few teams are more disciplined than a pit stop crew: every movement is precisely choreographed and trained hundreds of times. After all, at F1 speed a second longer in the pit means lagging almost 100 meters behind. Moving fast in the IT world likewise requires discipline. Automated tests are your safety belt—how else would you be able to deploy code into production at a moment’s notice, e.g., in case of a serious problem? The most valuable time for an online retailer to deploy code is right in the middle of the holiday season, when customer traffic is at its peak. That’s when a critical fix or a new feature can have the biggest positive impact on the bottom line. Ironically, that’s exactly the time when most corporate IT shops impose a “frozen zone,” during which they forbid the deployment of code changes. Making a code push in peak traffic takes confidence. Having iron discipline and lots of practice can make you more confident and fast. Fear will slow you down. Confidence without discipline will make you crash and burn. Fast and Good Agile development overcomes the perception that things are either fast or of high quality by adding a new dimension (Chapter 40). This admittedly makes it difficult to really grasp the concept without seeing it in action. I often claim that “Agile cannot be taught, it can only be shown,” meaning that you should learn Agile methods by working on an Agile team, and not from a textbook. I describe the attributes required for fast software development and deployment as follows: Velocity Development velocity assures that you can make code changes swiftly. If the code base is fraught with technical debt, such as duplication, you will lose speed right there. Confidence Once you made a code change, you must have the confidence in your code’s correctness, e.g., through code reviews, rigorous automated tests, and small, incremental releases. If you lack confidence, you will hesitate, and you can’t be fast. Repeatable Deployment must be repeatable, usually by being 100% automated. All your creativity should go into writing great features for your users, not into making each deployment work. Once you decide to deploy, you must depend on the deployment working exactly as it did the last 100 times. Elastic Your runtime must be elastic because once your users like what you built, you must be able to handle the traffic. Feedback You need feedback from monitoring to make sure you can spot production issues early and to learn what your users need. If you don’t know in which direction to head, moving faster is no help. Secure And last but not least, you need to secure your runtime environment against accidental and malicious attacks, especially when deploying new features frequently, which may contain, or rely on libraries that contain, security exploits. In unison, these qualities make for a disciplined but fast-moving and agile development process. People who haven’t seen such a process live often cannot quite believe how liberating it is to work with confidence. Even with the 15-year-old build system for my Enterprise Integration Patterns website I don’t hesitate for a second to delete all build artifacts to rebuild and redeploy them from scratch. Slow-Moving Chaos If high speed requires high discipline (or ends up in certain disaster), is it true then that slow speed allows sloppiness? While not logically equivalent, the reality shows that this is usually the case. Once you look under the cover of traditional processes, you realize that there’s a lot of messiness, rework, and uncontrolled black markets (Chapter 29). For example, US auto plants in the 1980s apparently dedicated up to one-quarter of the floor space to rework.1 No wonder Japanese car companies came in and ate their lunch with a disciplined, zero-defect approach, which acknowledged that stopping a production line to debug a problem is more effective than churning out faulty cars. These manufacturing companies were disrupted 30 years ago much in the same way digital companies are disrupting slow and chaotic service businesses now. Hopefully, you can learn something from their mistakes! Alarmingly, you can find the same level of messiness in corporate IT: why would it take two weeks to provision a virtualized server? For one, because most of this time is spent in queues (Chapter 35), and second, because of “thorough testing.” Hold on, why would you need to test a virtual server that should be provisioned in a 100% automated and repeatable fashion? And why would it take two weeks? Usually because the process being followed isn’t actually 100% automated and repeatable: a little duct tape is added here, a little optimization is done over there, a little script is edited, and someone forgot to mount the storage volumes. Oops. That’s one reason to never send a human to do a machine’s job (Chapter 13). Once you look under the veil of “proven processes,” you quickly discover chaos, defined as a state of confusion or disorder. It’s just so slow moving that you have to look a few times to recognize it. A good way to test for chaos is to request precise documentation of the aforementioned proven processes: most of the time it doesn’t exist, is outdated, or is not meant to be shared. Yeah, right… ITIL to the Rescue? If you challenge IT operations about slow chaos, you will likely receive a stare of disbelief and a reference to ITIL,2 a proprietary but widely adopted set of practices for IT service management. ITIL provides common vocabulary and structure, which can be of huge value when supplying services or interfacing with service providers. ITIL is also a bit daunting, consisting of five volumes of some 500 pages each. When an IT organization refers to ITIL, I am generally suspicious whether there’s a gap between perception and reality; i.e., does the organization really follow ITIL, or is this label used to shield further investigation into the slow chaos? A few quick tests give valuable hints: I ask a sysadmin which ITIL process they primarily follow. Or I ask an IT manager to show me the strategic analysis of the customer portfolio described in section 4.1.5.4 of the volume on service strategy. Most of the time we find that the ITIL ideal and the ITIL reality differ dramatically. I prominently displayed a set of ITIL manuals in my office to thwart anyone’s temptation of hand-waving their way through a conversation. ITIL itself is a very useful collection of service management practices. However, just like placing a math book under your pillow didn’t get you an “A” grade in school, just referencing ITIL doesn’t repel slow chaos. Objectives Require Discipline Many organizations are managed by objectives and grant teams autonomy in achieving these objectives. While this in general is a sound approach, it can fail spectacularly in organizations that lack discipline, because teams may use any means to achieve the objective, compromising on base values such as quality. If reaching objectives by any means is rewarded, resultoriented objectives can actually cause a lack of discipline. A provider’s large datacenter migration project had been set a clear goal of migrating a certain number of applications into a new datacenter location (a quite sensible objective). Alas, the provider had difficulties reliably provisioning servers in the new datacenter, causing many migration issues. To drive out this issue first, I suggested creating an automated test that repeatedly placed orders for servers in a variety of configurations and validated that all servers were delivered to spec. We would then start application migration once reliable provisioning was proven. The project manager exclaimed that if they did that, they’d never migrate a single application in 10 years! The team preferred to migrate applications regardless of their underlying problems, just so that they could achieve the project objective. Setting output-oriented objectives therefore requires an agreed-upon discipline as a baseline for achieving those objectives. This is why the Prussian ideal of Auftragstaktik (Chapter 27) depended on active discipline: increasing an organization’s discipline allows more far-reaching and meaningful objectives to be set. The Way Out You may be asking yourself: why does no one clean up the slow chaos? Many traditional but successful organizations simply have too much money (Chapter 38) to really notice or bother with it. They must first realize that the world has changed from pursuing economies of scale to pursuing economies of speed (Chapter 35). Speed is a great forcing function for automation and discipline. For most situations besides dynamic scaling, it’s OK if provisioning a server takes a day. But if it takes more than 10 minutes, you know there’ll be the temptation to perform a piece of it manually. And that’s the dangerous beginning of slow-moving chaos. Instead, let software eat the world (Chapter 14) and don’t send humans to do a machine’s job (Chapter 13). You’ll be fast and disciplined. 1 John Roberts, The Modern Firm: Organizational Design for Performance and Growth (Oxford: Oxford University Press, 2007). 2 “ITIL—IT service management,” Axelos (website), https://oreil.ly/PN_Mj. Chapter 32. Governance Through Inception I Am from Headquarters, I Am Here to Help You Corporate governance circa 1984 Corporate IT tends to have its own vocabulary. A top contender for the most frequently used phrase must be to align, which translates vaguely into the activity of holding a meeting with no particular objective beyond mulling over a topic and coming to some sort of agreement short of an official approval. Large IT organizations tend to get slowed down (Chapter 30) by doing this a lot. After alignment, governance likely comes in second. Living in Perfect Harmony Governance generally describes the act of harmonizing and standardizing things across the organization by means of rules, guidelines, and standards. IT harmonization done well increases purchasing power through economies of scale, reduces downtime thanks to less operational complexity, and boosts IT security by eliminating unnecessary diversity. While pursuing harmonization is a rather worthwhile goal, governance can also do harm; for example, by converging on a lowest common denominator, which in the end doesn’t meet the business’s need. Also, many enterprises standardize on an all-encompassing solution that ends up being too expensive for many use cases. Lastly, if the wrong things are standardized, it can stifle creativity. Harmonization can reduce cost and complexity, increase uptime, and strengthen cybersecurity. But, if done in the abstract, it can also stifle innovation, lead to a lowest-common-denominator solution, or propose overengineered and overpriced universal solutions. One common cause of suboptimal standards is that those setting the standards don’t have the necessary skill set and the full context of a situation. Worse yet, these teams often lack meaningful feedback on the effect of the standards they set. Things may look orderly from the top—e.g., everyone uses the same type of laptop—but developers lack administrative access and main memory, while frequent travelers must lug around a desktop-equivalent monster laptop. In many large IT organizations, top decision makers don’t use the very tools they standardize. For example, they rarely use the standard workplace or HR tools (Chapter 29), because they’re entitled to special solutions or they have admins who do this work for them. They hence can lack both the situational context and a critical feedback cycle. Exerting governance in an existing organization or one that grew by acquisition involves migrating from the “wrong” system implementation to the “standard.” Such migrations bring cost and risk without an apparent benefit for the local entity, making enforcement difficult. The enemy of governance is the “shadow IT,” which describes local development outside the reaches of central governance. The Value of Standards Standardization has enormous value, as epitomized by what happened during a devastating 1904 fire in Baltimore, Maryland. When much of downtown Baltimore was ablaze, firemen from surrounding cities rushed to help with their fire engines. Sadly, many of these firefighters and much of their equipment ended up standing idle because the fire hose connections of other cities’ departments wouldn’t fit Baltimore’s fire hydrants. The National Fire Protection Association was quick to learn from this disaster and in 1905 established a standard for fire hose connections, still known as the “Baltimore Standard.” Corporate governance typically starts by defining a set of standards that are to be adhered to. A standards organization will define and administer these standards for many types of products that are being used. For example, they may decree that software ABC shall be used as the internet browser and vendor product XYZ for databases. But if we look at the real world, the most successful standards have been of a different nature. The standards with the biggest economic impact have been compatibility or interface standards: specifications that allow interchangeability of parts. Fire hoses and hydrants are a great example, as is HTTP. In an IT environment, interface standards translate to standardizing interfaces rather than products; for example, standardizing on HTTP or a specific minimum version of HTML, as opposed to setting Internet Explorer as the browser. The most successful IT standards over the past half-century have been TCP/IP and HTTP—these brought us universal connectivity and the internet. However, neither is a product standard, but both are interface standards. Also, both are open standards. Interface Standards Interface standards bring flexibility and network effects: when many elements can interconnect, the benefit to all participants increases. The internet, originally based on the HTML and HTTP standards for content and connectivity, is the perfect example. Thanks to these standards, any browser could connect to any web server regardless of the implementation technology used. Such effects also highlight again how lines are more interesting than boxes (Chapter 23). Enterprises must therefore articulate their main driver behind setting standards: standardizing vendor products aims to reduce cost and complexity through economies of scale, while compatibility or connectivity standards boost flexibility and innovation. Both are useful, but call for different types of standards. Not all interface standards look like interfaces. For example, when standardizing inside an enterprise, elements, or “boxes,” can act as connecting elements. Monitoring and version control systems are great examples: while they are components, their purpose is to connect many applications so that one can gain a unified view across software development or operational status, respectively. That’s why in my view it’s more beneficial to standardize the version control system than standardize the integrated development environment (IDE) that developers use: the former is a connecting element, while the latter is a node. Storing all sources in a single repository allows easy reuse or shared code ownership, something that shared IDEs can’t do. It’s more beneficial to standardize connecting elements, such as monitoring or source control, than endpoints such as laptops or IDEs. Google took this to the extreme by storing (almost) all of its source code in a single version control system. Mapping Standards However, setting standards, even for interfaces, isn’t quite as simple. For example, sizing all fire hose connections the same turns out to not be such a good idea. For a standard to be useful, it needs to be based on a common worldview and vocabulary (Chapter 16) that specifies the standard’s scope. For example, IT standards for databases, application servers, or integration run the risk of being meaningless without a distinction of the types of databases or servers under consideration. The Baltimore fire hydrant standard distinguishes two kinds of standards, one for pumper connections and one for fire hose connections. Pumper connections feed water from the hydrant to a pump truck and have a large diameter. Hose connections feed an individual fire hose and are smaller in diameter. For example, if an organization wants to standardize database products, you’d need to first define whether you standardize relational databases separately from NoSQL databases and, if so, whether you want to distinguish between document and graph databases (Chapter 16). Only then should you look at products: before you visit a car dealer you should know whether you want a minivan or a two-seater sports car. Or visit Porsche— they seem to be making everything these days. For storage, you need to distinguish a SAN from NAS and differentiate backup storage from direct-attached storage (DAS). And you may be looking into HDFS and converged/so-called “hyperconverged” storage (a storage virtualization layer over local disks). Governance by Decree Enforcing standards can be a bit like herding cats, even when the economic value is blatantly obvious. For example, almost one hundred years after the Baltimore standard, fighting large fires such as the Oakland Hills Fire of 1991 is still impeded by cities not following the standard.1 Often, the deviation from the standard is a historical artifact or purposely driven by vendors to gain lock-in. In many organizations, a diagnostic “police” will visit different entities to ascertain their standard compliance, which gives rise to the joke about the biggest lie in a corporate environment: “I am from headquarters; I am here to help you.” Cybersecurity can be a useful vehicle to drive standardization: nonstandard or outdated versions of software generally carry a higher risk of vulnerability than well-maintained, harmonized environments. A specific challenge is posed by users who also use a standard, in addition to their own solution. They thus will correctly proclaim “yes, we do drive BMW cars,” in line with a corporate standard that they do so, despite the parking lot being full of Mercedes, Rolls-Royces, and Yugos. In another phenomenon, users employ a standard, but for the wrong purpose. For example, they may use the standard BMW as a meeting room for four people, and don’t actually drive it (they prefer Mercedes for that). Sounds absurd? I have seen many similarly absurd examples in corporate IT! Governance Through Infrastructure Interestingly, in my seven years at Google no one ever mentioned the word governance (or SOA or big data, for that matter). Knowing that Google not only has a fantastic service architecture and world-leading big data analytics, you might guess then that it also has strong governance. In fact, Google has an extremely strong governance in places where it matters most; for example, runtime infrastructure. Employees were free to write code in Emacs, vi, Notepad, IntelliJ, Eclipse, or any other editor, but there was basically only one way to deploy software to the production infrastructure, on one kind of OS (in the old days, you could choose between 32 or 64 bit), on one kind of hardware. While occasionally painful, this strictness worked because most software developers would put up with pretty much anything to have their software run on a Google-scale infrastructure: it was, and likely still is, a decade ahead of what most other companies were using. The governance didn’t need to take the form of a decree because the system was vastly superior to anything else, rendering not following it a guaranteed waste of time. If the corporate car is a Ferrari or has a flux capacitor for time travel,2 people won’t be running to the VW dealer. In Google’s case, the flux capacitor was the amazing “Borg” deployment and machine management system, which has been publicly described in a Google research paper.3 For Google the system’s economies of scale worked so well that in the end it became reasonable to have everyone drive a Ferrari while enjoying the fast pace. Runtime Governance Netflix exerts governance over application design and architecture by running their infamous chaos monkey against deployed software to verify that the software is resilient. Noncompliant software will be pummeled in production by automated compliance testers. Hardly any organization that brags about its corporate governance group would have the guts to do the same. Inception In large IT organizations the motivation is generally a little less pronounced and the infrastructure a little less advanced. If you’ve been to the movies in recent years you must have come across Inception, an ingenious Christopher Nolan flick depicting corporate criminals who steal trade secrets from victims’ subconscious minds. The title derives from the plot, in which the corporate team usually operates in “read only” mode to extract secrets from the victim’s memory, but that for their big coup they must actively implant an idea into a victim’s mind to cause him to take a particular action—a process referred to “inception.” In the movie the tricky part is to make the victim truly believe it was his idea. If we could perform inception, corporate governance would be much easier: IT units would independently come to the conclusion to use the same software. This isn’t quite as absurd as it sounds because there’s one magic ingredient in today’s IT world that makes it possible: change. With change comes the need to update systems (still have Lotus Notes somewhere?) and the opportunity to set new standards without any additional migration costs. You “simply” have to agree on which incarnation of the new piece of technology you want to employ, for example for a software-defined network, a big data cluster, or an on-premise platform-as-a-service. That you have to do by inception. Inception in corporate IT works only if the governing body is ahead of the rest of the world, so they can set the direction before the widespread need arises. Acting as an educator, they supply new ideas to their audience and can inject, or incept, ideas, such as demand for a specific product or standard. In a sense, that’s what marketing has been doing for centuries: creating demand for the product that manufacturing happened to have built. In times of change, the “new” will ultimately replace the “old” and through constant inception the landscape becomes more standardized. The key requirement is that “central” needs to innovate faster than the business units so that when a division requests a big data analytics cluster, corporate IT already has clear guidance and a reference implementation. Doing so requires foresight and funding, but beats chasing business units for noncompliance and facing migration costs. The Emperor’s New Clothes Traditional IT governance can also cause an awkward scenario best described as the “emperor’s new clothes”: a central team develops a product that exists primarily in slide decks, so-called vaporware. When such a product is decreed as a standard, which is essentially meaningless, customers may happily adopt it because it’s an easy way to earn a “brownie point,” or even funding, for standard compliance without the need for much actual implementation. In the end everyone appears happy, except the shareholders: it’s a giant and senseless waste of energy. Governance Through Necessity In an interesting book about refugee camps in the Western Sahara,4 I learned that almost everyone in these camps who owns a car has the same older car models, either a Land Rover all-terrain vehicle or an early 1990s Mercedes sedan. Together, these models make up more than 90% of all local cars, with 85% of sedans being Mercedes—a corporate governor’s dream! Why? Residents chose an inexpensive and very reliable car that could withstand the rough terrain and heat. The standardization came through simple necessity, however: buying another model of car would mean not being able to take advantage of the existing skill set and the pool of available spare parts. In an environment of economic constraints, these are major considerations. Corporate IT has the same forces, especially regarding IT skill set availability for new technologies. The observed diversity in corporate environments is therefore a rich company problem (Chapter 38): the scarcity of skills or resources just isn’t strong enough to drive joint decision making—they can easily be solved with more money. You could also argue that the refugee camps had the advantage of a socalled greenfield installation, even though that term seems awfully inappropriate for people being displaced in the desert. 1 Momar Seck and David Evans, “Major U.S. Cities Using National Standard Fire Hydrants, One Century After the Great Baltimore Fire,” NISTIR 7158, National Institute of Standards and Technology. 2 A reference to the ’80s movie Back to the Future. 3 A. Verma et al., “Large-Scale Cluster Management at Google with Borg.” 4 Manuel Herz, From Camp to City: Refugee Camps of the Western Sahara (Lars Muller, 2012). Part V. Transformation When setting up modern technology in a large IT organization, you’ll invariably find that there’s an impedance mismatch. Using elastic billing from a cloud provider won’t work well if you have to make an annual budget forecast anyway. And being able to provision infrastructure with an API call becomes a lot less exciting if there’s a two-month approval process attached to it. Therefore, the last and final leg on your architect journey is being able to change the way organizations work. Change Is Risky Bringing change into large organizations is rewarding but challenging, requiring you to utilize everything you’ve learned so far: you must first use your architectural thinking to understand how complex organizations work and which “levers” you may have. Superb communication skills help you garner support, while leadership skills are needed to effect a lasting change. Last, your IT architect skills are needed to plan and implement the technical changes necessary for the organization to work in a different way. As an architect you are best qualified to understand how technical and organizational changes depend on each other so that you can solve the Gordian knot of interdependencies. Citing The Matrix one more time (after all, Neo is quite a change agent in a tough environment!), the exchange between the Architect and the Oracle draws the apt context: The Architect: You played a very dangerous game. The Oracle: Change always is. Interestingly, in The Matrix, the Architect is the main entity trying to prevent change. You should identify yourself with Neo, instead, making sure to have an Oracle to back you up. Not All Change Is Transformation Not every kind of change deserves to be called “transformation.” You change the layout of the furniture in your living room, but you transform (or maybe convert) your house into a club, retail store, or place of worship. The word trans-form has its origin in Latin with a literal meaning of “to change shape or structure.” When we speak of IT transformation, we therefore imply not an incremental evolution, but a fundamental restructuring of the technology landscape, the organizational setup, and the culture. Basically, expect to have to turn the house upside down, cut it into pieces, and put it back together in a new shape. Bursting the Boiler A prevalent risk in corporate transformation agendas is upper management recognizing the need for change and subsequently applying pressure to the organization to become faster, more agile, more customer centric, etc. However, the organization, and especially middle management, is often not ready to transform and attempts to achieve the targets set by upper management within the old way of working. This can put enormous strain on the organization and is unlikely to meet the ambitions. I compare this to a steam engine that is surpassed by a fast electric train. In an attempt to speed up, the steam-engine operator throws more coals onto the fire to increase the boiler pressure. While it may initially speed up the steam engine, soon the boiler will burst. You can’t compete with an electric train by putting more pressure on the boiler. Instead, you need to devise a new engine that can keep up. That’s what architects do. Why Me? As an architect, you might think: “Why me? Isn’t this where the high-paid consultants come in?” They can certainly help, but you can’t just inject change from the outside with a slide deck. Lasting change must come from the inside through role models, rapid feedback cycles, celebrated achievements, and much more. To effect lasting change in an organization you’ll need to understand the following: Chapter 33, No Pain, No Change! Organizations are unlikely to change if there’s no pain. Chapter 34, Leading Change You must show a better way of doing things. Chapter 35, Economies of Speed Organizations need to think in economies of speed instead of economies of scale. Chapter 36, The Infinite Loop Running in circles is an essential part of digital organizations. Chapter 37, You Can’t Fake IT You must be digital on the inside to be digital on the outside. Chapter 38, Money Can’t Buy Love There is no SKU for transformation.1 Chapter 39, Who Likes Standing in Line? You can speed up organizations by waiting less instead of working more. Chapter 40, Thinking in Four Dimensions To transform, organizations need to think in new dimensions. 1 SKU = Stock Keeping Unit, used for order and inventory management. Chapter 33. No Pain, No Change! And Watching Late-Night TV Does Not Help… Go, go, gooooo! A colleague of mine once attended a “digital showcase” event in his company, which highlighted many innovative projects and external hackathons the company had organized. Upon returning to his desk, though, he found himself in the same old corporate IT world where he is forced to clock time, cannot get a server in less than three weeks, and isn’t allowed to install software on his laptop. He was wondering whether he was caught in some twisted incarnation of two-speed IT, but that made little sense; after all, his project was part of the fast-moving “digital” speed. Stages of Transformation I had a different answer: transformation is a difficult and time-consuming process that doesn’t happen overnight. People just don’t wake up one day and behave completely differently, no matter how many TED Talks they listened to the day before. (A talk I once attended illustrated how difficult it is to change which part of the body you dry first with your towel after taking a shower in the morning. I guess the speaker was right—I never changed that.) To illustrate the stages a person or an organization tends to go through when transforming their habits, I drew up the example of someone changing from eating junk food to leading a healthy lifestyle. With no scientific evidence, I quickly came up with 10 stages: 1. You eat junk food. Because it’s tasty. 2. You realize eating junk food is bad for you. But you keep eating it. Because it is tasty. 3. You start watching late-night TV weight-loss programs. While eating junk food. Because it is so tasty. 4. You order a miracle exercise machine from the late-night TV program. Because it looked so easy. 5. You use the machine a few times. You realize that it’s hard work. Worse yet, no visible results were achieved during the two weeks you used it. Out of frustration you eat more junk food. 6. You force yourself to exercise even though it’s hard work and the results are meager. You’re still eating some junk food, though. 7. You force yourself to eat healthier, but find it not tasty. 8. You actually start liking vegetables and other healthy food. 9. You become addicted to exercise. Your motivation changed from losing weight to doing what you truly like. 10. Friends ask you for advice on how you did it. You have become a source of inspiration to others. Change happens incrementally, and it will take a lot of time plus dedication. Digital Transformation Stages Drawing the analogy between my colleague’s situation and my freshly created framework, I concluded that they must be somewhere between stage 3 and 4 on their transformation journey. What he attended was the digital equivalent of watching late-night miracle solutions. Maybe the company even invested in or acquired one of the nifty startups, which are young, hip, and use DevOps. But upon returning to his desk, he experienced that the organization was still eating lots of junk food. I suggest that the transformation scale from 1 to 10 isn’t linear: the critical steps occur from stage 1 to 2 (awareness, not to be underestimated!), 5 to 6 (overcoming disillusionment) and from 7 to 8 (wanting instead of forcing yourself). I would therefore give his company a lot of credit for starting the journey, but warn them that disillusionment is likely to lie ahead. Wishful Thinking Sells Snake Oil It can be amazing how gullible smart individuals and organizations become when they are presented with miracle claims for a better life. As soon as people or organizations have entered stage 3, whole industries that are built on selling “snake oil” eagerly await them, overweight individuals and slowpaced corporate IT departments alike: late-night weight-loss commercials and shiny demos showing business people building cloud solutions in no time. As Russell Ackoff once pointedly put it, in “A Lifetime of Systems Thinking”:1 Managers are incurably susceptible to panacea peddlers. They are rooted in the belief that there are simple, if not simple-minded, solutions to even the most complex of problems. When you are looking for a quick change, it’s difficult to resist, especially if you don’t have your own world map (Chapter 16). Digital natives have it easy because, as the name suggests, they were born on the upper levels of the digital transformation scale and never had to make it through this painful change process. Others feel the pain and tend to search for an easy way out. The problem is that this approach will never get you beyond stage 5, where real change hasn’t happened yet. Tuning the Engine Not everyone who buys snake oil is a complete fool, though. Many organizations adopt worthwhile practices but don’t understand that these practices don’t work outside of a specific context. For example, sending a few hundred managers to become Scrum Master certified doesn’t make you agile. You need to change the way people think and work and establish new values. Holding a standup meeting every day that resembles a status call where people report 73% progress also doesn’t transform your organization. It’s not that standup meetings are a bad idea, rather the opposite, but they are about much more than standing up.2 Real transformation has to go far beyond scratching the surface and change the system. Systems theory (Chapter 10) teaches us that to change the observed behavior of a system, you must change the system itself. Everything else is wishful thinking. It’s like wanting to improve the emissions of a car by blocking the exhaust pipe. If you want a cleaner running car, there’s no other way than going all the way back to the engine and tuning it or transforming it into an electric car. When you want to change the behavior of a company, you need to go to its engine—the people and the way they are organized. This is the burdensome, but only truly effective way. Help Along the Way Some enterprise IT vendors do resemble the folks selling overpriced workout machines on late-night TV: their products work, but not quite as advertised, and they are in fact overpriced. A good walk in the park every day likely produces the same results for free. You just need to be smart enough to know that and disciplined enough to stick to it. Many enterprise IT vendors provide genuine innovation to their customers, but at a price. Enterprise vendors range from “old school” to “selling an imitation of the new world to old enterprises” and “truly new world.” The further left on this scale your organization is, the more you will pay. My goal, therefore, is to build sufficient internal skill to use products as far to the right on that spectrum as possible. As I once stated in a slightly exaggerated way: “Corporate IT tends to pay for its stupidity. If you are stupid, you better be rich!” An organization that doesn’t have the required skill yet pays “tuition,” a concept well-known in German as Lehrgeld. If spending the money helps them do better next time, it’s a good investment. As always, I make sure to document such decisions (Chapter 8). The consultants and enterprise vendors that surround traditional enterprises (Chapter 38) have a limited incentive to fully transform their clients into becoming digital: digital companies tend to shun consultants and largely employ open-source technology, often developed by themselves. Because externals are set to profit from the transformation path itself, they are useful in helping an enterprise start the transformation, as this brings the willingness to invest money. However, they aren’t quite as keen to catapult their customers into a state where their advice or products are no longer needed. This love-hate relationship is likely to affect the role an architect plays in the transformation effort: you can’t achieve it without external help, but you have to be aware that it’s a co-opetition rather than true collaboration. The Pain of Not Changing The biggest risk during the transformation journey is suffering a relapse after having bought “snake oil” just to realize that it doesn’t achieve the promised results, or at least not as quickly as anticipated. This risk is particularly high at stages 4 or 5 of my model. The inevitable pain of changing makes the lure of the easy path, that is, not changing or giving up halfway, a clear-and-present danger. The long-term effects of not changing are easily put aside because that pain isn’t happening yet. Plus, you already accepted the current state, even if it clearly isn’t optimal. The certainty of knowing the current state proves to be a major force against change, which carries a large amount of uncertainty— who knows whether all the projected benefits will actually materialize? It could be getting worse for all we know. This is one of the many ways we are biased and thus poor decision makers (Chapter 6). IT organizations, especially operations teams, tend to equate change to risk (Chapter 26). The insight that change was needed often comes much later, when the cost of not having changed becomes blatantly and painfully apparent. Sadly, at that time the list of available options tends to be much shorter, or empty. This is true for individuals (“I wish I had started a healthier life when I was young”) as well as organizations (“We wish we had cleaned up our IT before we became disrupted”). When people reflect on their lives, they are much more likely to regret not having done things as opposed to the things they did. The logical conclusion is simple: do more things and keep doing those that work well. Getting Over the Hump A linear chain of events has one tricky property: the probability of making it through all steps computes as the product of the individual transition probabilities between each step and the next. Let’s say you are a quite determined person and have a 70% chance of making it from one step to the next, even though the machine you ordered from late-night TV didn’t work quite as advertised. If you compound this probability across the 9 steps needed to go from stage 1 to stage 10, you arrive at a 4% probability, 1 in 25, of making it to the goal. If you assume a fifty-fifty chance at each step, which might be more realistic (just look on eBay for barely used exercise machines), you end up with 1/2 = 0.2% or 1 in 512 (!). “Against All 9 Odds” comes to mind, even though it’s probably not Phil Collins’s best song. The biggest enemy of change is complacency: if things aren’t so bad, the motivation to change is low. Organizations can artificially increase the pain of not changing, e.g., by creating fear or conjuring a (fake) crisis before the real crisis occurs. Such a strategy can work but is risky. It cannot be applied many times as people will start ignoring the repeated “fire drill.” Still, conjuring a crisis beats undergoing a real crisis. Many organizations only really start to change when they have a “near-death” experience. The problem is that near-death often results in actual death. 1 Russell Ackoff, “A Lifetime of Systems Thinking,” The Systems Thinker (website), https://oreil.ly/DP_Ea. 2 Jason Yip, “It’s Not Just Standing Up: Patterns for Daily Standup Meetings,” MartinFowler.com, Feb. 21, 2016, https://oreil.ly/Le5-n. Chapter 34. Leading Change The Island of Sanity in the Sea of Desperation Don’t get voted off the island! Demonstrating positive results from a different way of doing things in a small team can help overcome complacency and the fear of uncertainty, and thus is a good way to start a transformation. We shouldn’t forget, though, that the “trailblazers” on such teams have a doubly tough job: they need to overcome the pain of change and do so in an environment that’s still at stage 1 of the transformation journey. This is comparable to eating healthy when everyone around you at the table is having tasty cake and the restaurant has nothing healthy on the menu at all. To succeed, you need a firm belief and perseverance. The corporate IT equivalent of trying to eat healthy at the cake party is trying to be Agile when it takes four weeks to get a new server or when contemporary development tools and hardware aren’t allowed because they violate corporate security standards. You’ve got to be willing to swim upstream to effect change. A Tractor Passing the Race Car One particular danger of leading change with a different approach is that the existing, slow approaches are often more suitable for the current environment. This is a form of systems resisting change (Chapter 10) and can result in your fancy new software/hardware/development approach being pummeled by the old, existing ways. I compare this to building a fullfledged race car, just to find out that in your corporate environment each car has to pull three tons of baggage in the form of rules and regulations. And instead of a nice, paved racetrack, you find yourself in a foot-deep sea of process mud. You will find out that the old corporate tractor slowly but steadily passes your shiny new Formula 1 car, which is busily throwing up mud while shredding its rear tires. In such a scenario, it becomes difficult to argue that you devised a better way of doing things. It’s therefore critical to change processes and culture along with introducing new technology. A race car in a tractor pulling contest will be laughable at best. You need to build a proper road before it makes sense to commission a race car. You also need to employ your communication skills (Part III) to secure management support when setbacks happen. Setting Course To motivate people for change, you can either dangle the digital carrot, painting pictures of a happy, digital life on the far horizon, or wield the digital stick, warning of impending doom through disruption. In the end, you’ll likely need a little bit of both, but the carrot is generally the more noble approach. For the carrot to work, you need to paint a tangible picture of the alternate future and set visible, measurable targets based on the company strategy. For example, if the corporate strategy is based on increasing speed to reduce time-to-market, a tangible and visible goal would be to cut the release cycle for your division’s software products or services in half (or more) every year. If the goal is resilience, you set a goal of halving average recovery times (Chapter 12) for outages. Some goals can even be enforced through automation. Digital companies may enforce a goal to improve resilience by deploying a chaos monkey (Chapter 32) that randomly disables components. Setting goals can be a tricky affair, as the organization might meet the goals without completing the intended change. For example, setting a reduction in number of outages as a goal surfaces two major issues. First, it incentivizes hiding outages and, second, it’ll make teams invest in more up-front testing, slowing down the organization. Lastly, it’s not necessarily the number of outages that negatively affect the business, but the total observed downtime. Venturing Off the Mainland You cannot expect everyone to instantly join you on your journey, though, simply because you’re telling stories about the magic land awaiting them in the far distance. You will surely find some explorers or adventurers-at-heart who are willing to get on the boat just based on your vision or charisma. Some may not even believe your promises, but find sailing to unknown shores more appealing than just sitting around. These folks are your early adopters and can become powerful multipliers for your mission. Find them, connect them in a community, and take them along. Others will wait to see whether your ship actually floats. Be kind to them and pick them up for the journey once they are ready. These folks may actually be more committed as they overcame an initial hurdle or fear. Yet others will want to see you return with your ship loaded with gold. That’s also fine—some have to see to believe. So you need to be patient and recruit for your transformation journey in waves. Burning the Ships Even after folks have joined you on the transformation journey, the chance of a relapse is high: on your journey you will encounter storms, pirates, sharks, sandbanks, icebergs, and other adverse conditions. Captains of a digital transformation have to be skilled sailors, but also strong leaders. A tough approach is to “burn the ships,” derived from the story that upon arriving on a new shore the captain would burn the ships so no one could propose to retreat and go back home. I am not sure whether this approach really increases the odds of success. You want a team that’s committed and believes in success, as opposed to one that has doubts but no ship on which to return. Offshore Platforms Some companies’ change programs sail far off the mainland to overcome the constraints imposed by the old world. Copying what they observe in successful so-called “digital” companies, teams move into colorful buildings with open seating plans and baristas, use Apple laptops full of open source stickers, and wear shorts or hoodies. Such units, resembling offshore drilling platforms far from the mainland—run under fancy labels like “innovation center,” “digital hub,” or “digital factory”—can be a lot of fun, but suffer from several major issues: 1. These new islands often don’t have a meaningful bridge back to the mainland, meaning they largely operate in isolation. They therefore don’t act as a transformation vehicle for the main island. My cynical advice for such a setup is: “if you want to show that smart people in an ideal environment can create valuable things, you could have just bought Facebook stock.” 2. Such islands often don’t have economic pressure because they are well-funded by the mothership. They thus end up being “digital trust fund” playgrounds that don’t deliver concrete business value. Those setups could be handy for press releases and corporate tours, but not for working in rapid-value delivery cycles. 3. And last, copying digital leaders’ working environments isn’t going to make you “digital.” This fallacy, known as the cargo cult,1 ignores the mechanisms behind the visual facade. A barista stand doesn’t magically accelerate your release cycles: you can’t copy-paste culture. So, just building a new island in a different You can’t copy-paste culture. ocean isn’t going to help with an organization’s transformation. You need to strike a balance between sufficiently reducing constraints but still being relevant to the mainland. How to find the right balance? The best approach I found is to keep iterating (Chapter 36). The Island of Sanity Still, the temptation to create a better working environment for at least a subset of the organization can be strong. I followed this approach, which I refer to as building an “island of sanity in the sea of desperation,” myself in the year 2000. Back then, just before the internet bubble burst, our somewhat traditional consulting company vied for talent with internet startups like WebVan and Pets.com (a plastic bag and a sock puppet decorate my private internet bubble archive). I therefore helped create an environment that would be attractive for such candidates and was successful in recruiting a stellar team of top-notch technologists. Sooner or later, though, your island will become too small for the people on it, causing them to feel constrained in their career options. If the island has drifted far from the mainland because the mainland hasn’t changed much at all, reintegration will be very difficult, increasing the risk that people leave the company altogether. That’s what happened to most of my team in 2001. Second, people will wonder why they have to live on a small and remote island when other companies feature the same, desirable (corporate) lifestyle on their extensive mainland. Wouldn’t that seem much easier? Or, as a friend once asked, or, rather, challenged me in a very pointed way: “Why don’t you just quit and let them die?” While transformation is hard work, you also gotta know when you’re trying too hard. Skunkworks That Works People working in a separate location can however create significant innovations and transform the mothership, though. The best-known example perhaps is the IBM PC, which was developed far away from IBM’s New York headquarters in Boca Raton, Florida. The development bypassed many corporate rules, for example, by mostly using parts from outside manufacturers, by building an open system, or by selling through retail stores. It’s hard to imagine where IBM (and the rest of the computer industry) would be if they hadn’t created the PC. IBM was certainly not a company used to moving quickly, with insiders claiming that it “would take at least nine months to ship an empty box.” But the prototype for the IBM PC was assembled in one month and the computer was launched only one year later, which required not only development, but also manufacturing to be set up. Several factors likely contributed to the team’s success: This skunkworks was tasked with launching a real, sustainable product on the market. It wasn’t a playground. The team streamlined many processes but didn’t circumvent all corporate guidance. For example, its products passed the standard IBM quality assurance tests and thus gained acceptance on the mainland. The team didn’t deliver a toy but a successful commercial product. Lastly, teams back on the mainland probably didn’t see this project as a threat. They were simply convinced that it was impossible for IBM to make a computer for less than $15,000 and were happy to be proven wrong. These factors led to the IBM PC becoming a positive example of an ambitious project team questioning existing assumptions while being led by existing management. A more recent example that large-scale transformation can work is Microsoft under CEO Satya Nadella, who opted not to sail off the mainland but rather led the transformation by “rediscovering the soul of Microsoft.”2 Leaving Your Island Will Get Your Feet Wet You also need to be cautious that most systems (Chapter 10) operate on a local optimum. While that local optimum might be extremely far removed from the much more agile and fast way digital organizations work, it’s usually still better than the “surrounding” operating modes that you end up with when you make a small change to the system. For example, an organization may only be able to push code into production every six months, which is a practical joke in the digital world. However, it has managed to establish processes that make this cadence workable. If you change the release cycle to three months, you will make people’s lives worse and may hurt the product quality and even the company’s reputation. Hence, you should first introduce automated build and deployment tools to form the basis for faster releases. Sadly, doing so also makes the operations staff’s lives worse because they are already very busy with production support, and now in addition they must attend training and learn new tools. They will also make mistakes while doing so. In your view, the organization might live on a tiny molehill while you know of a high mountain of gold somewhere else. However, between the molehill and the mountain is a muddy swamp. Because you won’t be able to leap straight to the mountain of gold, you first have to get folks off the molehill, convincing them to keep moving after their feet get wet and muddy. That’s why you must communicate a clear vision and prepare them for tougher times ahead before the new optimum can be reached. The Country of the Blind One shouldn’t underestimate the resistance to change and innovation in large and successful enterprises that have “done things this way” for a long time. H. G. Wells’s short story “The Country of the Blind” comes to mind: an explorer falls down a steep slope and discovers a village in a valley that is completely separated from the rest of the world. Unbeknownst to the explorer, a genetic disease has rendered all of the villagers unable to see. Upon realizing this peculiarity, the explorer feels that because “the oneeyed man is king” in this town he can teach and rule them. However, his ability to see proves to be of little advantage in a place designed for blind people, without windows or lights. After struggling to take advantage of his gift, the explorer is to have his eyes removed by the village doctor to cure his strange obsessions. Oddly, two versions of this story exist, each with a different ending: in the original version, the explorer escapes the village after struggling back up the slope. The revised story has him observe that a rockslide is about to destroy the village and he’s the only one able to escape, along with his blind girlfriend. In either case, it’s not a happy ending for the villagers. Be careful not to fall into the “in the land of the blind, the one-eyed man is king” trap. Complex organizational systems settle into specific patterns over time and actively resist change. If you want to change their behavior, you have to change the system. 1 Wikipedia, “Cargo Cult,” https://oreil.ly/GpesJ. 2 Satya Nadella, Hit Refresh: The Quest to Rediscover Microsoft’s Soul and Imagine a Better Future for Everyone (New York: HarperBusiness, 2017). Chapter 35. Economies of Speed Death by Efficiency Is Slow and Painful Economies of scale versus economies of speed Large companies looking to speed up are used to optimizing the way they work: they can make production a few percent more efficient, negotiate a slightly higher discount from vendors, and reduce budget by printing in black and white. Sadly, though, their digital competitors don’t move 10% faster, but 10 times faster, leaving traditional IT departments somewhat puzzled by how this is even possible. 30,000 Times Faster A quick example showing how a 10-times speed-up can still be a quite conservative figure comes from setting up a version control system. A large IT organization looking to define a standard for source control invested six months of community work to conclude that the company should be using Git (Chapter 25). However, it was considered too difficult to migrate other projects off Subversion, so both products were recommended. The preparation cycle for the global architecture steering board meeting took another month, bringing the total elapsed time to seven months or roughly 210 days. Some tasks that would take traditional organizations months of preparation and approvals, digital companies can accomplish in a few minutes. A modern IT organization or startup would have spent a few minutes deciding on the product and have accounts set up, a private repository created, and the first commit made in about 10 minutes. The speed-up factor between the two examples comes to 210 days * (24 hours/day) * (60 minutes/hour) / 10 minutes ≈ 30,000! If that number alone doesn’t scare you, keep in mind that one organization published a paper (without selecting or implementing a product such as BitBucket, GitHub, or GitLab) and is merrily dragging along its legacy. Its “decision” is thus about as meaningful as prescribing that men should wear black shoes, but brown is also allowed for historical reasons. Meanwhile, the other organization is already committing code in a live repository. Admittedly, large organizations have more parties to align across, existing source repositories, and many other factors that will make it difficult to set up a shared service in 10 minutes. However, if you augment the timeline to include vendor selection, license negotiation, internal alignment, paperwork, and setting up the running service, the ratio could well end up in the hundreds of thousands. Should these organizations be scared? Yes! Old Economies of Scale How can modern organizations act at orders of magnitude faster than traditional ones? Traditional organizations pursue economies of scale, meaning they are looking to benefit from their size. Size can indeed be an advantage, as can be seen in cities: density and scale provide short transportation and communication paths, diverse labor supply, better education, and more cultural offerings. Cities grow because the socioeconomic factors scale in a superlinear fashion (a city of double the size offers more than double the socioeconomic benefits), while increases in infrastructure costs are sublinear (you don’t need twice as many roads for a city twice the size). But density and size also bring pollution, risk of epidemics, and congestion problems, which ultimately limit the size of cities. Still, cities grow larger and live longer than corporate organizations. One reason lies in the fact that organizations suffer more severely from the overhead introduced by processes and control structures that are required or perceived to be required to keep a large organization in check. Geoffrey West, past president of the Santa Fe Institute, summarized this dynamic in his fascinating video conversation “Why Cities Keep Growing, Corporations and People Always Die, and Life Gets Faster.”1 In corporations, economies of scale are generally driven by the desire for efficiency: resources such as machines and people must be used as efficiently as possible, avoiding downtimes due to idling and retooling. This efficiency is often pursued by using large batch sizes: making 10,000 of the same widget in one production run costs less than making 10 different batches of 1,000 each. The bigger you are, the larger batches you can make, and the more efficient you become. This view is overly simplistic, though, as it ignores the cost of storing intermediate products, for example. Worse yet, it doesn’t consider revenue lost by not being able to serve an urgent customer order because you are in the midst of a large production run: such an organization values resource efficiency over customer efficiency. The manufacturing business realized this about half a century ago, resulting in most things being manufactured in small batches or in one continuous batch of highly customized products. Think about today’s cars: the number of options you can order is mind boggling, causing the traditional “batch” thinking to completely fall apart. Cars are essentially batches of one. With all the thinking about “Lean” and “Just-in-Time” manufacturing, it’s especially astonishing that the IT industry is often still chasing efficiency instead of speed. A software vendor once stated that, “Obviously the license cost per unit goes down if you buy more licenses.” To me, this isn’t obvious at all as there’s no distribution cost per unit of software, aside from that very salesperson sitting across the table from me. Whether 10,000 customers download one license or one customer buys 10,000 licenses should be the same, as long as the software vendor doesn’t send humans to do a machine’s job (Chapter 13). Cloud computing finally broke the old model. It looks like enterprise software sales and enterprise procurement both have some transformations ahead of themselves. In their defense, though, you have to admit that their behavior is determined by enterprise customers still stuck in the old thought pattern: supersize it to get a better deal! In the digital world, the limiting factor for an organization’s size becomes its ability to change. While in static environments being big is an advantage thanks to economies of scale, in times of rapid change economies of speed win out and allow startups and digital-native companies to disrupt much larger companies. Or as Jack Welch famously stated: “If the rate of change on the outside exceeds the rate of change on the inside, the end is near.” Behold the Flow! The quest for efficiency focuses on the individual production steps, looking to optimize their utilization. What’s completely missing is the awareness of the production flow, i.e., the flow of a piece of work through a series of production steps. Translated into organizations, individual task optimization results in every department requiring lengthy forms to be filled out before work can begin: I have been told that some organizations require that firewall changes be requested 10 days in advance. And all too often the customer is subsequently told that some thing or another is missing from the request form and is sent back to the beginning of the line. After all, helping the customer fill out the form would be less efficient. If that reminds you of government agencies, you might get the hint that such processes aren’t designed for maximum speed and agility. Besides the inevitable frustration with such setups, they trade off flow efficiency for processing efficiency: the work stations are nicely efficient, but the customers (or products or widgets) chase from station to station, fill out a form, pick a number, and wait. And wait (Chapter 39). And wait some more just to find out they are in the wrong line or their need cannot be processed. This is dead time that isn’t measured anywhere except in the customers’ blood pressure. Come to think of it, in most of these places, the people going through the flow are not customers in the true sense given that they don’t choose to visit this process, but are forced to. That’s why you are bound to experience such setups at government offices, where you could at least argue that misguided efficiency is driven by the pursuit to preserve taxpayer money. You’ll also commonly find it in IT departments that exert strong governance (Chapter 32). Cost of Delay For innovation and product development processes, this type of efficiency is pure poison. While digital companies do care about resource utilization (at Google, datacenter utilization was a CEO-level topic), their real driver is speed: time-to-market. Traditional organizations often misunderstand or underestimate the value of speed. In a joint business-IT workshop, a business owner once described his product as carrying substantial revenue opportunities. At the same time, the product owner asked for a specific feature that required significant development effort, but which would realize value only when rolled out in another country. Deferring that feature would speed up the initial launch, thus harvesting the portrayed revenue opportunities sooner. Flow-based thinking calls this concept the cost of delay (see the excellent book The Principles of Product Development Flow2), which must be added to the cost of development. Launching a promising product later means that you lose the opportunity to gain revenue during the time of delay. For products with large revenue upside, the cost of delay can be higher than the cost of development, but it’s often ignored. On top of avoiding the cost of delay, deferring a feature and launching sooner also allows you to learn from the initial launch and adjust your requirements accordingly. The initial launch may be an utter failure, causing the product to never be launched in the second country. By deferring this feature you avoided wasting time building something that would have never been used. Gathering more information allows you to make a better decision (Chapter 6). A great example of a non-high-tech company that embraced economies of speed is the fashion brand Zara, part of the Inditex fashion empire. When the pursuit of efficiency drove most fashion retailers to outsource production to low-cost suppliers in Asia, Zara implemented a vertically integrated model and manufactured three-quarters of its clothing in Europe, which allowed it to bring new designs into stores in a matter of weeks as opposed to the industry average of three to six months. In the fast-moving fashion retail industry, speed is such a significant advantage that this strategy propelled Inditex’s founder to be one of the 10 richest people on the planet. However, the world of fashion is also one of constant change and even “fast fashion” retailers face stiff competition from online retailers such as boohoo, which works in small batch sizes coupled with extremely short product cycles. The Value and Cost of Predictability Why do intelligent people ignore basic economic arguments such as calculating the cost of delay? Because they are working in a system that favors predictability over speed. Adding a feature later or, worse yet, deciding later whether to add it or not may require going through lengthy budget approval processes. Those processes exist because the people who control the budget value predictability over agility. Predictability makes their lives easier because they plan the budget for the next 12 to 24 months, and sometimes for good reasons: they don’t want to disappoint shareholders with runaway costs that unexpectedly reduce the company profit. As these teams manage cost, not opportunity, they don’t benefit from an early product launch. Optimizing for predictability ignores the cost of delay. Chasing predictability causes another well-known phenomenon: sandbagging. Project and budget plans sandbag by overestimating timelines or cost in order to more easily achieve their target. Keep in mind that estimates aren’t single numbers but probability distributions: a project may have a 50% chance of being done in four weeks’ time. If “you are lucky and all goes well,” it may be done in three weeks, but with only a 20% likelihood. Sandbaggers pick a number far off on the other end of the probability spectrum and would estimate eight weeks for the project, giving them a greater than 95% chance of meeting the target. Even worse, if the project happens to be done in four weeks, the sandbaggers idle for another four weeks before release to avoid having their time or budget estimates cut the next time. If a deliverable depends on a series of activities, sandbagging compounds and can extend the time to delivery enormously. The Value and Cost of Avoiding Duplication On the list of inefficiencies, duplication of work must be high up: what could be more inefficient than doing the same thing twice? That’s sound reasoning, but you must also consider that avoiding duplication doesn’t come for free: you need to actively de-duplicate, i.e., detect duplicates and merge them. The primary cost involved in de-duplication is coordination: to avoid duplication you first need to detect it. In a large codebase this can be done efficiently through code search. In a large organization, it can require many “alignment” meetings—synchronization points—high up in the hierarchy, which we know not to scale (Chapter 30) in both computer systems and organizations. A story on duplication, attributed to Jeff Bezos, CEO of Amazon: When a manager pointed out that efforts might be duplicated, the senior executive walked to the board and wrote “2 > 0.” Evolving a widely reused resource also requires coordination because changes must be compatible with all existing systems or users. Such coordination can slow down innovation. On the flip side, modern development tools, such as automated testing, can reduce the traditional dangers of duplication. Some digital companies have even begun to explicitly favor duplication because their business environment rewards economies of speed. How to Make the Switch? Changing from efficiency-based thinking to speed-based thinking can be difficult for organizations: after all, it’s less efficient! In most people’s minds being less efficient translates into wasting money. On top of that, people being idle is more visible than the damage done by missed market opportunities. Usually, this change in attitude happens only when IT is seen as driving business opportunity instead of being a cost center. While corporate IT is stuck in a cycle of cutting cost and increasing efficiency, economies of scale will prevail, which gives the digital giants an ever-bigger lead over traditional companies that dream of becoming digital but cannot shed their old habits. 1 Geoffrey West, “Why Cities Keep Growing, Corporations and People Always Die, and Life Gets Faster,” Edge, May 23, 2011, https://oreil.ly/UAh5C. 2 Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development (Redondo Beach, CA: Celeritas Publishing, 2009). Chapter 36. The Infinite Loop Sometimes Running in Circles Can Be Productive The corporate innovation circuit. Best lap time: unknown In programming, an infinite loop is rarely a good thing (unless you are Apple, Inc., and your address is 1 Infinite Loop in Cupertino, California). But even Apple HQ appears to be moving off the infinite loop, which is a noteworthy feat in and of itself. In poorly run organizations (not Apple!) employees often make cynical remarks about how they run in circles and when the desired results aren’t achieved, management tells them to run faster. You surely don’t want to be part of that infinite loop! Build-Measure-Learn There’s one loop, though, that’s a key element of most digital companies: the continuous learning loop. Because digital companies know well that control is an illusion (Chapter 27), they are addicted to rapid feedback. Eric Ries eternalized this concept in his book The Lean Startup1 as the BuildMeasure-Learn cycle: a company builds a minimum viable product and launches it into production to measure user adoption and behavior. Based on the insights from live product usage, the company learns and refines the product. Jeff Sussna aptly describes the “learning” part of the cycle as “operate to learn”—the goal of operations isn’t to maintain the status quo but to deliver critical insights into making a better product. Digital RPMs The critical KPI for most digital companies is how much they can learn per dollar or time-unit spent, i.e., how many revolutions through the BuildMeasure-Learn cycle they can make. The digital world has thus changed the nature of the game completely and it would be foolish at best (fatal at worst) to ignore this change. Taking book authoring as an example: publishing Enterprise Integration Patterns took a year of writing, followed by some six months of editing and three months of production. While we had a feeling that the book might be a success, it wasn’t until another year later that we could measure the success in actual copies sold. So, making one-half revolution from Build to Measure took about four years! Completing the cycle, i.e., publishing a second edition, would have taken another 6 to 12 months. In comparison, I wrote the original version of this book as an ebook that was published while it was still a work in progress. The book sold several hundred copies before it was even done, and I received reader feedback by email and Twitter almost in real time as I was writing. The same is true for many other industries: digital technology has made customer feedback immediate. This is a huge opportunity, but also a huge challenge as customers have learned to expect rapid changes based on their feedback. If I don’t post an update to my book in two or three weeks, people may worry that I might have given up on writing. Luckily, I find instant feedback (comments as well as purchases) hugely motivating, so I have been far more productive in writing this book than ever before. Adopting learning as an organization’s key metric is good news for another reason. While many tasks are taken over by machines, learning how to build a product that excites users remains firmly in the hands of humans. Old-World Hurdles Unfortunately, traditional companies aren’t built for rapid feedback cycles. They often still separate run from change (Chapter 12) and assume a project is done by the time it reaches production. Launching a product is about the 120-degree mark in the innovation wheel of fortune, so making one-third of a single revolution counts for nothing if your competition is on its one-hundredth refinement. What keeps traditional organizations from completing rapid learning cycles? Their structuring as a layered hierarchy: in a fairly static, slowmoving world, organizing into layers has distinct advantages; it allows a small group of people to steer a large organization without having to be involved in all details. Information that travels up is aggregated and translated for easy consumption by upper management. Such a setup works very well in large organizations, but has one fundamental disadvantage: it’s horribly slow to react to changes in the environment or to insights at the working level. It takes too much time for information to travel all the way up to make a decision because each “layer” in the organization brings communication overhead and requires a translation. Even if architects can ride the elevator (Chapter 1), it still takes time for decisions to trickle back down through a web of budgeting and steering processes. Once again, we aren’t talking about a difference of 10% but of factors in the hundreds or thousands: traditional organizations often run feedback cycles to the tune of 18 months while digital companies can do it in days or weeks. Layered organizations benefit from separation of concerns. However, it becomes a liability in Economies of Speed. In times when nearly every organization wants to become more “digital” and the technical platforms are readily available as open source or cloud services, building a fast-learning organization is a critical success factor. Looping in Externals With every revolution, the organization not only learns what features are most useful for the users, but the project team also learns how to build enticing user experiences, how to speed up development cycles, or how to scale the system to meet increasing demand. This learning cycle is critical for the organization’s digital transformation because it enables in-house innovation and rapid iterations. Digital transformation begins with changing HR and recruiting practices. Inversely, if corporate IT depends heavily on the work of external providers, which is rather common, the ones benefiting from this learning are the external consultants. Organizations should therefore place their internal staff inside the learning cycle and use external support mainly to coach or teach them. Taking this logic a step further, digital transformation begins with transforming HR and recruiting practices to hire qualified staff and to educate existing employees so that they can become part of the learning cycle. Pivoting the Layer Cake To speed up the feedback engine you need to turn the organizational layer cake on its side by forming teams that carry full responsibility from product concept to technical implementation, operations, and refinement. Often such an approach carries the label of “tribes,” “feature teams,” or “DevOps,” which is associated with a “you build it, you run it” attitude. Doing so not only provides a direct feedback loop to the developers about the quality of their product (pagers going off in the middle of the night are a very immediate form of feedback), but it also scales the organization (Chapter 30) by removing unnecessary synchronization points: all relevant decisions can be made within the project team. Running in independent teams that focus on rapid feedback has one other fundamental advantage: it brings the customer back into the picture. In the traditional pyramid of layered command-and-control, the customer is nowhere to be found—at best somewhere interacting with the lowest layer of the organization, far from where decisions are made and strategies are set. In contrast, “vertical” teams draw feedback and their energy directly from the customer. The main challenge in assembling such teams is to get a complete range of skill sets into a compact team, ideally not exceeding the size of a “twopizza team”; that is, one that can be fed by two large pizzas. This requires qualified staff, a willingness to collaborate across skill sets, and a lowfriction environment. The Spotify team concepts2 of chapters and guilds are likely the most useful resource in this context. Maintaining Cohesion If all control rests in the vertically integrated team, what ensures that these teams are still part of one company and for example use common branding and common infrastructure? It’s OK to have some pie crust on the vertical layer cake: for example, one at the top for branding and overall strategy and one at the bottom for common infrastructure that never sends a human to do a machine’s job (Chapter 13). Once you have perfected the rapid Build-Measure-Learn feedback cycle, you may wonder how many revolutions you will need to make. In digital companies the feedback engine stops spinning only when the product is dead. That’s why, for once, it’s good to be part of an infinite loop. 1 Eric Ries, The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses (New York: Crown Business, 2011). 2 Henrik Kniberg, “Spotify Engineering Culture (Part 1).” Chapter 37. You Can’t Fake IT To Be Digital on the Outside, You First Need to Be Digital on the Inside Who can spot the dinosaur programmer? Rapid feedback cycles (Chapter 36) help digital companies understand customer demand and improve the product or service offered. Naturally, this feedback loop works best when the product or service has direct exposure to the end customer or consumer. Corporate IT, in contrast, is relatively far removed from the end customer because it supplies IT services to the business, which in turn is in contact with the customer. Does this imply that corporate IT shouldn’t be the focal point for digital transformation as it’s too far removed from digital customers? Many digital transformation initiatives that are driven “from the top” appear to support this notion: they have special teams engage with customers in focus groups before handing down the specs to IT for implementation. Laying the Foundation But just like you cannot build a fancy new house on an old, fragile foundation, you cannot be digital on the outside without transforming the IT engine room: IT must deliver those capabilities to the business that are needed to become Agile and to compete in the digital marketplace. If it takes eight weeks to procure a virtual server based on an email request, the business cannot scale up with demand, unless it stockpiles a huge number of idling servers, which would be the exact opposite of what cloud computing promises. Worse yet, if these servers are set up with an old version of the OS, modern applications may not run on them. On top of all this, necessary manual network changes are guaranteed to break things or slow them down. Feedback Cycles Rapidly deploying servers can be achieved with private cloud technologies, but that alone doesn’t make IT “digital.” For corporate IT to credibly offer services to businesses competing in a digital world, it must itself be ready to compete in the digital world of IT service providers, not only from a cost and quality perspective, but also from an engagement model point of view: corporate IT must become customer centric and learn from customers using its products in an infinite loop (Chapter 36). If the servers that are provisioned aren’t the ones the customer needs, provisioning them faster accomplishes nothing. Moreover, the customer may not want to order servers at all, but prefers to deploy applications on a so-called “serverless” architecture. To understand these trends, IT must engage with their internal customers—the business units—in a rapid feedback loop, just as the business units do with their end customers. Delivering on Your Promises Engaging with customers is helpful only if you can deliver on their demands. In the case of IT delivering services to its customers, the business units, it must have the capability and the attitude to deliver digital services rapidly at high quality. An MIT study1 showed that those companies that aligned business and IT without first improving their IT delivery capability actually spent more money on IT but suffered from below-average revenue growth. You can’t fake being digital. Customer Centricity Customer centricity is a common phrase incorporated into many a company’s motto or “value statement.” What company wouldn’t want to be customer centric after all? Even institutions whose customers are decreed by law, such as the Internal Revenue Service, have exhibited a good dose of customer awareness in recent years. For many organizations, though, it’s difficult to move beyond the simple slogan and truly become customer centric because it requires fundamental changes to the organizational culture and setup: hierarchical organizations are CEO centered, not customer centered. Operational teams following ITIL processes are process centered, not customer centered. IT run as a cost center is likely cost centered as opposed to customer centered. Running a customer-centric business on top of a process- or CEO-centric IT is bound to generate enormous friction. Cocreating IT Services To support a business in digital transformation, it’s no longer enough for IT to develop and push commodity services to their customers, the business units, via governance (Chapter 32). IT must start to behave like a digital business, generating “pull” demand instead of pushing product. This can be done well by developing products jointly with customers, which goes under the fancy moniker of “cocreation.” While many internal customers will welcome the change in mindset and the opportunity to influence a service being built, others may not want to engage unless you present a firm price and service-level agreement. Being digital works only if your customers are digital. Eat Your Own Dog Food Some IT departments are relatively far from the end customer, so they wonder how they can get feedback cycles started. They tend to ignore a large, readily available pool of customers that’s very close by: their own employees. Employees are friendly and motivated customers that are usually eager to try out new stuff. Ironically, the common name for this clever practice is dogfooding, assuming people will eat their own dog food. I’d side with an old friend here who determined that it’s unfair that his dog eats dog food while he’s having a tasty dinner. So he decided to share his meal with his dog instead—the vet confirmed the dog is perfectly healthy doing this. Google is famous for dogfooding its products, meaning its employees get to try alpha or beta versions of new products. While the name doesn’t make it sound too appealing, Google’s “food” includes pretty exciting products, some of which never reach the eyes of the consumer. Dogfooding is effective because it enables an extremely rapid feedback and learning cycle in a safe and controlled environment. I start all my IT services by offering them first as an internal beta release. Once we better understand customer expectations and work out the kinks, we offer them to external customers. Google took things a step further and merged employee and customer accounts into a single user-management system, making customers and employees appear identical to most applications, differentiated only by their domain name (google.com) and their access from the corporate network. Merging the previously disparate systems was rather painful, but the effect was hugely liberating as employees were treated as customers. In contrast, traditional organizations can look at employees and customers very differently, as illustrated by this example: At a large financial services company, employees weren’t supposed to use Android phones. Without even debating the technical merit, I couldn’t help but wonder how this company can then support customers using Android devices, which make up some 80% of the market. If Android isn’t considered secure enough for the company’s financial services employees, how can it be considered secure enough for its customers? Rather than trying to control the user base, it’d be more helpful to understand and address potential weaknesses, for example through twofactor authentication, mobile device management, fraud monitoring, or disallowing old versions of the OS, both for customers and employees. Digital Mindset Besides starting to use their own products and learning to iterate, one of the biggest hurdles in making corporate IT more digital can be the employees’ mindset. When employees use previous-generation BlackBerry phones and internal processes are handled by emailing spreadsheets based on rules documented in a slide deck, it’s difficult to believe that an organization can act digitally. While it’s a touchy subject, the age distribution in traditional IT can be an additional challenge: the average age in corporate IT is often in the 40s or early 50s, far removed from the digital natives being courted as the new digital customer segment. Bringing younger employees into the mix can help companies become digital as it brings some of your target customer segment in-house. The good news is that change can happen gradually, starting with small steps. When employees start using LinkedIn to pull photos or resumes instead of emailing resume templates, it’s a step toward becoming digital. Checking Google Maps to find convenient hotels instead of the clunky travel portal is another. Building small internal applications to automate approval processes is a small but very important step: it gets people into a “maker mindset” that motivates them to tackle problems by building solutions, not by referring to outdated rule books. The digital feedback cycle can work only if people can build solutions. This may be the biggest hurdle for corporate IT departments, because they are too afraid of code (Chapter 11). Code is what software innovation is made of, so if you want to be digital, you’d better learn to code! Opportunities for making small steps toward becoming digital are plentiful. I tend to look for little problems to solve or small things to speed up and automate. At Google, getting a USB charger cable was a matter of 2.5 minutes: 1 minute to walk to the nearest Tech Stop, 30 seconds to swipe your badge and scan the cable at the self-checkout, and 1 minute to walk back to your desk. To do this in corporate IT, I had to mail someone, who mailed someone, who asked me the type of phone I use and then entered an order, which I had to approve. Elapsed time: about 2 weeks. Speed factor: 14 days × 24 hours/day × 60 minutes/hour / 2.5 minutes = 8064, in the same league as setting up a source code repository (Chapter 35). Fixing this would make a great miniproject. You don’t see a positive business case? That’s probably because your company isn’t yet set up to develop solutions rapidly. A digital company could likely build this solution in an afternoon, including database and web user interface, and host it in its private cloud basically for free. If you never start building small, rapid solutions, your IT will be paralyzed and likely unable to act in a digital environment. The Stack Fallacy As much of corporate IT is focused on infrastructure and operations, becoming software minded (Chapter 14) requires a huge shift. For example, my idea to build an on air sign (Chapter 30) that illuminates when my IP desk phone is off the hook never materialized because the team rolling out the devices didn’t code or deal with software APIs. The challenge an organization faces when “moving up the stack,” e.g., from infrastructure to application software platform or from software platform to end-user application is well-known and has aptly been labeled the stack fallacy.2 Even successful companies underestimate the challenge and are subject to the fallacy: VMware missed the shift from virtualization software to Docker containers for many years, Cisco has been spending billions in acquisitions to get closer to application delivery, and even mighty Google failed to move from utility software like search and mail to an engaging social network, a market dominated by Facebook. For most of corporate IT, this means an uphill climb from a focus on operating infrastructure to engaging users with rapidly evolving applications and services. Though challenging, it is doable: internal IT doesn’t need to compete in the open market, giving it the chance to change in small increments. 1 David Shpilberg et al., “Avoiding the Alignment Trap in IT,” MIT Sloan Management Review, October 1, 2007, https://oreil.ly/nK9ph. 2 Anshu Sharma, “Why Big Companies Keep Failing: The Stack Fallacy,” TechCrunch, Jan. 18, 2016, https://oreil.ly/OYCi-. Chapter 38. Money Can’t Buy Love Or a Culture Change I need that feature by Tuesday After transitioning from a Silicon Valley company to a traditional business, my new coworkers frequently reminded me that we’re a large corporation, implying that what works for Google wouldn’t apply here. My routine retort was that by applying the standard measure of market capitalization, I joined a corporation 10 times smaller. More interesting, my coworkers also pointed out that Google can do pretty much whatever it wants thanks to all the money it has. My view, in contrast, was that many successful traditional businesses suffer from exactly this problem of having too much money. Innovator’s Dilemma How can organizations have too much money? After all, their goal is to maximize profits and shareholder returns. To do so, companies use stringent budgeting processes that control spending. For example, proposed projects are assessed by their expected rate of return against a benchmark typically set by existing investments, sometimes called internal rate of return (IRR). Such processes can hurt innovation, though, when new ideas must compete with existing, highly profitable “cash cows.” Most innovative products can’t match established products’ performance or profitability during early stages. Traditional budgeting processes may therefore reject new and promising ideas, a phenomenon that Christensen coined the Innovator’s Dilemma.1 However, when these new innovations later surpass sustaining technologies, they threaten organizations that didn’t invest early on and that now lag behind. Rich companies tend to have a high IRR and are therefore especially likely to reject new ideas. Also, they perceive the risk of no change as low—after all, things are going great. This dampens their appetite for change (Chapter 33) and increases the danger of disruption. Beware of the HiPPO Despite its downsides, companies making investment decisions based on expected return at least use a consistent decision metric. Many rich companies have a different decision process: that of the highest paid person’s opinion, or HiPPO. This approach isn’t just highly subjective but also susceptible to shiny, HiPPO-targeted vendor demos, which peddle incremental “enterprise” solutions as opposed to real innovation. Because those decision makers are far removed from actual technology and software delivery, they don’t realize how fast new solutions can be built on a shoestring budget. To make matters worse, internal “salespeople” exploit management’s limited understanding to push their own pet projects, often at a cost orders of magnitude higher than what digital companies would spend. I have seen someone make it to board level with the idea of exposing functionality as an API, at a cost of many million Euros. It’s easy to sell people in the stone age a wheel. Overhead and Tolerated Inefficiency Many established companies with a profitable business model carry significant overhead: fancy corporate offices, old labor contracts with overly generous retirement provisions, overemployment for roles that are no longer needed, an army of administrative staff for executives, company cars, drivers, car washes, private dining rooms, coffee and cake being served in boardrooms—the list is long. This overhead cost is generally distributed across all cost centers, placing an enormous financial burden on small and innovative teams working on disruptive technologies. My small team of architects was loaded with enormous overhead cost ranging from office space and cafeteria subsidies to workplace charges (computers, phones), which I couldn’t influence. In comparison, free meals offered by digital companies are a trivial expense. Overhead costs also result from inefficiencies that are tolerated in wealthy organizations because there’s little pressure to remove them. Examples are manifold: labor-intensive manual processes (I have seen people manually preparing spreadsheets from SAP data every month), lengthy meetings with 20 executives, half of whom have little to contribute, ordering processes with long paper trails, people printing reams of paper as handouts for meetings on digital strategy. All these line items add up and make it difficult for large companies to compete in new segments where margins aren’t yet rich enough to support such overhead. Hollowed-Out IT A particularly dangerous pitfall for wealthy organizations looking to transform is the belief that any required skill can be bought at will. Years ago, many companies considered IT a commodity: a necessity, but not one that created a competitive advantage. That’s why they didn’t perceive any risk in keeping IT skills outside of the company. Instead, they valued the flexibility in ramping external IT staff up and down as needed just as they would with administrative or cleaning staff. They perceived this model as more efficient (Chapter 35). In the late 1990s, the telecom business was very profitable thanks to a fastgrowing broadband internet market. These companies outsourced virtually all technical work to external contractors and system integrators (where I was employed). Solid profits allowed them to digest the high consulting fees, high administrative overhead for contract management, and more than occasional project cost overruns. However, outsourcing software delivery has severe drawbacks in the digital age: first, it prevents the organization from effectively participating in the Build-Measure-Learn cycle (Chapter 36) because externals typically work on a prenegotiated scope of work and therefore have little incentive to keep iterating on products or to shorten release cycles. Second, the organization won’t be able to develop a deep understanding of new technologies and their potential, thus stifling innovation. Worse yet, in many cases knowledge of a company’s existing system landscape rests with external contractors, rendering the organization unable to make rational decisions based on the status quo. If you don’t know your starting point, it’s difficult to get on the road to change. Outsourcing IT has severe drawbacks in the digital age because it excludes the organization from the critical innovation cycle. These companies’ IT departments degenerated into mere budget administration structures with hardly any technology skill. The main skill needed was securing budget and spending it. Those companies couldn’t attract much real IT talent because qualified candidates realized that their skills weren’t valued. Nevertheless, all was perceived as working well while the money flowed freely. Excessive Dependencies But then everything changed: hardly any industry was overrun by internet companies as spectacularly as telecommunications. Telecoms used to “own” communication but completely failed to see the potential of the smartphone and digital consumer services. Telecoms used to generate billions of dollars in profits from short message service (SMS) products, a market that dropped significantly in just a few years thanks to WhatsApp, Facebook Messenger, and others. Existing IT contracts focused on improving efficiency (Chapter 35) in backend processing, such as billing; no internal skill was available to design and deliver new services to customers; and existing organizational structures and processes squashed any innovation that was trying to happen. Eventually, telecoms were left with providing “dumb data pipes” in a downward price spiral while digital companies enjoyed almost-trilliondollar valuations and rich profit margins. Experienced software architects know that too many external dependencies get you in trouble. The same is true for organizations. Paying More May Get You Less Other factors surely played a role in telecoms missing the “digital boat,” but believing that technology skills can be acquired as needed is particularly dangerous. Just like you cannot buy friends, a company cannot buy motivated employees. Candidates with highly marketable skill sets, such as cloud architecture or machine learning, are attracted to teams with strong, like-minded people. This presents traditional companies with a chickenand-egg problem. Many companies try to overcome this hurdle by paying higher salaries. However, compensation is often not the main motivator for top candidates; they are looking for an employer where they can learn from their peers and have the freedom to implement projects rapidly. That’s why it’s difficult for companies to “buy” skilled employees. My experience is that people Worse yet, trying to attract talent by who come for money leave for offering higher salaries can backfire because it will attract “mercenary” more money. developers who work for the money alone. My experience is that people who come for money leave for more money. It won’t attract passionate developers who want to be part of a high-performing team to change the world. I compare this pitfall to the unpopular kid handing out candy at school: the kid won’t make friends, but will be surrounded by children who are willing to pretend to be a friend in exchange for candy. Changing Culture from Within Top consultants can surely help you implement new and exciting technology projects, but they won’t significantly change the organization’s culture; the cultural change must come from within. Roberts2 classifies the describing characteristics of an organization as PARC–people, architecture (structures), routines (processes), and culture. Restructurings and process reengineering can change the organization’s architecture and routines, but cultural changes must be instilled by the company leadership. This takes time, lots of energy, and sometimes a leadership change: “to do change management, sometimes you need to change management.” Because digital transformation requires changing both technology and culture, I opted to drive a large-scale IT transformation from the inside. It’s the hard, but the only sustainable way. 1 Clayton M. Christensen, The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail, reprint ed. (New York: HarperBusiness, 2011). 2 John Roberts, The Modern Firm: Organizational Design for Performance and Growth (Oxford, England: Oxford University Press, 2007). Chapter 39. Who Likes Standing in Line? Good Things Don’t Come to Those Who Wait 100% utilization When in university, we often wonder whether and how what we learn will help us in our future careers and lives. While I am still waiting for the Ackerman function to accelerate my professional advancement (our first semester in computer science blessed us with a lecture on computability), the class on queuing theory was actually helpful: not only can you talk to the people in front of you in the supermarket checkout line about M/M/1 systems and the benefits of single queue, multiple servers systems (which most supermarkets don’t use), but it also gives you an important foundation to reason about economies of speed (Chapter 35). Looking Between the Activities When looking to speed things up in enterprises, most people look at how work is done: are all machines and people utilized, and are they working efficiently? Ironically, when looking for speed, you mustn’t look at the activities, but between them. By looking at activities you may find inefficient activity, but between the activities is where you find inactivity, things sitting around and waiting to be worked on. Inactivity can have a much more detrimental effect on speed than inefficient activity. If a machine is working well and almost 100% utilized but a widget must wait three months to be processed by that machine, you may have replicated the public healthcare system, which is guided by efficiency but certainly not speed. Many statistics show that wait times in typical IT processes, such as ordering a server, make up more than 90% of the total elapsed time. Instead of working more, we should wait less! A Little Bit of Queuing Theory When you look between activities, you are bound to find queues, just like the lines at your local department of motor vehicles or city office. To better understand how they work and what they do to a system, let’s indulge in a bit of queuing theory. My university textbook on queuing theory, Kleinrock’s Queuing Systems,1 appears to be out of print, but is available used. But don’t worry, you don’t need to digest 400 pages of queuing theory to understand enterprise transformation. My university professor reminded us that if we remember only one thing from his class, it should be Little’s Result. This equation states that in a stable system, the total processing time T, which includes wait time, is equal to N, the number of items in the system (the ones in the queue plus the ones being processed) divided by the processing rate λ; in short T = N /λ. This makes intuitive sense: the longer the queue, the longer it takes for new items to be processed. If you are processing two items per second and there are 10 items on average in the systems, a newly arriving item will spend five seconds in the system. You might correctly deduce that most of those five seconds are spent in the queue, not actually processing the item. The noteworthy aspect of Little’s result is that the relationship holds for most arrival and departure distributions. To build a bridge between speed and efficiency, let’s look at the relationship between utilization and wait time. The system is utilized whenever an item is being processed, meaning one or more items are in the system. If you sum up the probability that a given number of items are in the system, for instance, 0 items (the system is idle), 1 (one item being processed), 2 (one item being processed plus one in the queue), etc., you find that the average number of items in the system is equal to ρ / (1 – ρ), where ρ designates the utilization rate, or the fraction of time the server is busy (we make the assumption that arrivals are independent, which is described as a memoryless system). From the equation you can quickly gather that high levels of utilization (ρ moving closer to 100%) lead to extreme queue sizes and therefore wait times. Increasing utilization from 60% to 80% almost triples the average queue length: 0.6/(1 – 0.6) = 1.5 versus 0.8/(1 – 0.8) = 4. Driving up utilization will drive away your customers because they get tired of standing in line! Finding Queues Queuing theory proves that driving up utilization increases processing times: if you live in a world in which speed counts, you have to stop chasing task efficiency. Instead, you need to look at your queues. Sometimes these queues are visible like the lines at government offices where you take a number and wonder whether you’ll be served before closing time. In corporate IT the queues are generally less visible—that’s why so little attention is paid to them. By looking a little harder, though, you can find them almost everywhere: Busy calendars When everyone’s calendar is 90% “utilized,” important decisions queue for people to meet and discuss them. I waited for meetings with senior executives for multiple months. Steering meetings Such regular meetings tend to occur once every month or quarter. Topics will be queued up for them, again holding up decisions or project progress. Email Inboxes fill up with items that would take you a mere three minutes to take care of, but that you don’t get to for several days because you are highly “utilized” in meetings all day. Stuff often rots in my inbox queue for weeks. Software releases Code that is written and tested but waiting for a release is sitting in a queue, sometimes for six months. Workflow Many processes ranging from getting an invoice paid to requesting a raise for employees, have excessive wait times built in. For example, ordering a book takes large companies multiple weeks, as opposed to having it delivered the next day from Amazon. To get a feeling for the damage done by queues, consider that ordering a server often takes four weeks or more. The infrastructure team won’t actually bend metal to build a brand-new server just for you: most servers are provisioned as VMs these days (thanks to software eating the world Chapter 14). If you reasonably assume that there are four hours of actual work in setting up a server consisting of assigning an IP address, loading an operating system image, and doing some nonautomated installations and configurations, the time spent in the queue makes up 99.4% of the total time! That’s why we should look at the queues. Reducing the four hours of effort to two won’t make any difference unless you reduce the wait times. Cutting the Line Standing in line is hardly productive, but occasionally entertaining. When waiting in line at the San Francisco Marina post office I observed the highly utilized and actually quite friendly postal workers. To give myself a bit of utilization I stepped over to grab Priority Mail envelopes for my next urgent mailing (back then I didn’t know what cool things the Graffiti Research Lab guys made from postal supplies). When returning to my spot in the line, the guy behind me complained and after a brief argument he claimed, “You are out of line.” I think the irony of his statement escaped him as I was the only one who was amused. Digital companies understand the danger of queues quite well. The infamously tasty and free Google cafés have signs posted stating that “Cutting the line is encouraged.” Google doesn’t like to bear the opportunity cost of 20 people politely waiting behind a person who transports salad leaves to their plate one by one. Making Queues Visible “You can’t manage what you can’t measure,” goes the old saying, apparently falsely attributed to W. Edwards Deming. In the case of queues, making them visible can be a major step toward managing them. For example, metrics extracted from the ticketing system can show the time spent in each step or the ratio of effort over elapsed time (you will be shocked!). Showing that most time is simply spent waiting could also help the organization think in new dimensions (Chapter 40); for example, to realize that more elapsed time doesn’t equate to higher quality. For critical business processes such as insurance claims handling, queue metrics are often managed under the umbrella of business activity monitoring (BAM). Corporate IT should use BAM to measure its own business, such as provisioning software and hardware, and reduce lag times. Slow IT these days means slow business. Why are single queue, multiple server systems more efficient and why don’t more supermarkets use them? Lining customers up in a single queue reduces the chances that a server (i.e., cashier) is idling due to an uneven distribution of customers across the queues. It also allows smooth increases or reduction in the number of cashiers without everyone running to the newly opened lane or being ticked off at a lane closing. Most important, it eliminates the frustration that the other line is always moving faster! However, a single queue requires a bit more floor space and a single entry point for customers. You will see single queue, multiple server systems in many post offices and some large electronic stores like Fry’s Electronics. Apparently, they understand queuing theory! Message Queues Aren’t All Bad So how can the coauthor of a book on asynchronous message queues conclude that queues are trouble? Queues are a great tool for building high throughput and resilient systems. They buffer load spikes to allow resources to work at optimum rates. Just imagine each person who wants to check out of the supermarket just piling their items onto the checkout counter the moment they reach it. Hardly a useful scenario. Many businesses, such as Starbucks, use queues (Chapter 17) to optimize throughput. Queues become troublesome when they get long due to excessive utilization rates. High utilization and short response times don’t mix. Don’t blame the queue for it. 1 Leonard Kleinrock, Queueing Systems. Volume 1: Theory (New York: Wiley-Interscience, 1975). Chapter 40. Thinking in Four Dimensions More Degrees of Freedom Can Make Your Head Hurt Stuck in two dimensions A university class on coding theory taught us about spheres in an ndimensional space. Though the math behind it made a good bit of sense (the spheres represent the “error radius” for encoding, while the space between the sphere is “waste” in the coding scheme), trying to visualize fourdimensional spheres can make your head hurt a good bit. However, thinking in more dimensions can be the key to transforming the way you think about your IT and your business. Living Along a Line IT architecture is a profession of trade-offs: flexibility brings complexity; decoupling increases latency; distributing components introduces communication overhead. The architect’s role is often to determine the “best” spot on such a continuum, based on experience and an understanding of the system context and requirements. A system’s architecture is essentially defined by the combination of trade-offs made across multiple continua. Quality Versus Speed When looking at development methods, one well-known trade-off is between quality and speed: if you have more time, you can achieve better quality because you have time to build things properly and to test more extensively to eliminate remaining defects. If you count how many times you have heard the argument “We would like to have a better (more reusable, scalable, standardized) architecture, but we just don’t have time,” you start to believe that this God-given trade-off is taught in the first lecture of “IT project management 101.” The ubiquitous slogan “quick-and-dirty” further underlines this belief (Chapter 26). The folks bringing this argument often also like to portray companies or teams that are moving fast as undisciplined “cowboys” or as building software where quality doesn’t matter as much as in their “serious” business, because they cannot distinguish fast discipline from slow chaos (Chapter 31). The term banana product is sometimes used in this context— a product that supposedly ripens in the hands of the customer. Again, speed is equated with a disregard for quality. Ironically, the cause for the “we don’t have time” argument is often selfinitiated as the project teams tend to spend many months documenting and reviewing requirements or getting approval, until finally upper management puts their fist on the table and demands some progress. During all these preparation phases, the team “forgot” to talk to the architecture team until someone in budgeting catches them and sends them over for an architecture review that invariably begins with, “I’d love to do it better, but…” The consequence is a fragmented IT landscape consisting of a haphazard collection of ad hoc decisions because there was never enough time to “do it right” and no business case to fix it later. The old saying, “nothing lasts as long as the temporary solution,” certainly holds in corporate IT. Most of these solutions last until the software they are built on is going out of vendor support and becomes a security risk. More Degrees of Freedom So what if we add a dimension to the seemingly linear trade-off between quality and speed? Luckily, we are moving only from one to two dimensions, so our head shouldn’t hurt as much as with the n-dimensional spheres. We’d simply have to plot speed and quality on two separate axes of a coordinate system instead of on a single line, as illustrated in Figure 40-1. Now we can portray the trade-off between the two parameters as a curve whose shape depicts how much speed we have to give up to achieve how much better quality. Figure 40-1. Moving from one to two dimensions For simplicity’s sake, you could assume that the relationship is linear, depicted by a straight line. This probably isn’t quite true, though: as we aim to approach zero defects the time we need to spend in testing probably goes up a lot, and as we know, testing can prove only the presence of defects but not their absence. Developing software for life- and safety-critical systems or things that are shot into space are probably positioned on this end of the spectrum, and rightly so. That they rarely achieve zero defects can be seen by the example of the Mars Climate Orbiter, which disintegrated due to a unit error between metric and US measures. At the other end of the continuum, in the “now or never zone,” you may simply reach the limits of how fast you can go. You’d need to slow down a good bit and spend at least some time on proper design and testing to improve quality. So, the relationship likely looks more like a concave curve that asymptotically approaches the extremes at the two axes. The trade-off between time (speed) and quality still holds in this twodimensional view, but you can reason much more rationally about the relationship between the two. This is a classic example of how even a simple model can sharpen your thinking (Chapter 6). Changing the Rules of the Game When you move into the two-dimensional space, you can ask a much more profound question: “Can we shift the curve?” And: “If so, what would it take to shift it?” Shifting the curve to the upper right would give you better quality at the same speed or faster speed without sacrificing quality. Changing the shape or position of the curve means we no longer need to move along a fixed continuum between speed and quality. Heresy? Or a doorstep to a hidden world of productivity? Because digital companies see speed and quality as two dimensions, they can think about how to shift the curve. Probably both, but that’s exactly what digital companies have achieved: they have shifted the curve significantly to achieve never-before-seen speeds in IT delivery while maintaining feature quality and system stability. How do they do it? A big factor is following processes that are optimized for speed (Chapter 35), as opposed to optimizing for resource utilization under the guises of efficiency (Chapter 39). Digital companies can shift the curve because: They understand that software runs fast and predictably, so they never send a human to do a machine’s job (Chapter 13). They optimize end-to-end instead of optimizing locally. They turn as many problems as possible into software problems so they can automate them and hence move faster and often more predictably. If something does go wrong, they can react quickly, often with the users barely noticing. This is possible because everything is automated and they use version control (Chapter 14). They build resilient systems, ones that can absorb disturbance and self-heal, instead of trying to predict and eliminate all failure scenarios. None of these techniques are rocket science. However, they require an organization to change the way it thinks. And that’s not easy to do. Inverting the Curve If adding a new dimension doesn’t make folks’ head hurt enough, tell them that modern software delivery can even invert the curve: faster software often means better software! Much delay in software delivery is caused by manual tasks: long wait times for servers or environments to be set up by hand, manual regressing testing, and so on. Removing this friction, usually by automating things, not only speeds up software development but also increases quality because manual tasks are often the biggest source of errors (Chapter 13). As a result, you can use speed as a lever to increase quality. For example, you can demand shorter provisioning times for servers in order to increase the level of automation and reduce defects due to human error. What Quality? When speaking about speed and quality, we should take a moment to consider what quality really means. Most traditional IT folks would define it as the software’s conformance to specification and possibly adherence to a schedule. System uptime and reliability are surely also part of quality. These facets of quality have the essence of predictability: we got what we asked or wished for at the time we were promised it. But how do we know whether we asked for the right thing? Probably someone asked the users, so the requirements reflect what they wanted the system to do. But do they know what they really want, especially if you are building a system the users have never seen before? One of Kent Beck’s great sayings is, “I want to build a system the users wish they asked for.” The traditional definition of quality is a proxy metric: we presuppose to know what the customers want, or at least that they know what they want. What if this proxy isn’t a very reliable indicator? Companies living in the digital world don’t pretend to know exactly what their customers want because they are building brand-new solutions. Instead of asking their customers what they want, they observe customer behavior (Chapter 36). Based on the observed behavior they quickly adjust and improve their product, often trying out new things using A/B testing. You could argue that this results in a product of much higher quality, one that the customers wish they could have asked for. So, you not only can shift the curve of how much quality you can get for how much speed, you can also change what quality you are aiming for. Maybe this is yet another dimension? The traditional definition of quality is a proxy metric. Losing a Dimension What happens when a person who is used to working in a world with more degrees of freedom enters a world with fewer, such as an IT organization still holding the belief that quality and speed are opposites? This can lead to a lot of surprises and some headaches, almost like moving from our threedimensional world to the Planiverse.1 The best way out is reverse engineering the organization’s beliefs (Chapter 26) and then leading change (Chapter 34). 1 Wikipedia, "The Planiverse,” https://oreil.ly/RncTp. Part VI. Epilogue: Architecting IT Transformation This book’s main purpose is to encourage IT architects to take an active role in transforming traditional IT organizations that must compete with digital disruptors. “Why are technical architects supposed to take on this enormous task?” you may ask, and rightly so: many managers or IT leaders may have strong communication and leadership abilities that are needed to change organizations. However, today’s digital revolution isn’t just any organizational restructuring, but one that is driven by IT innovation: mobile devices, cloud computing, data analytics, wireless networking, and the Internet of Things, to name a few. Leading an organization into the digital future therefore necessitates a thorough understanding of the underlying technologies along with their application for competitive advantage. Game On Due to network effects, many digital business models follow a winnertakes-all dynamic: Google owns search, Facebook owns social, Amazon owns fulfillment and cloud, Netflix and Amazon jointly own content. Apple and Google’s Android own mobile. Google tried to get into social and floundered. Microsoft struggles in search and essentially withdrew from mobile. Amazon also struggled in mobile just like Google repeatedly dabbled in fulfillment without seeing a lot of traction. In cloud computing even almighty Google is at best a runner-up with Amazon holding on to a significant lead. Following this battle of the titans from the sidelines of a traditional organization resembles watching world-class athletes compete from the bleachers while eating popcorn: these organizations sport evaluations close to a trillion dollars (Netflix being the “baby” with roughly $150 billion market capitalization in 2020), have access to the world’s top IT talent, and are run by extremely talented and skilled management teams. How would one even hope to compete? There are several effects that play into the hands of incumbent companies. First, the digital world is one of constant evolution, and every round brings new opportunities. Uber disrupted the taxi industry by realizing that taxis aren’t the only cars on the road and that taxi drivers aren’t the only ones who can give others a ride. However, automotive manufacturers may have an ace up their sleeve in the next round when they launch self-driving cars. Second, traditional enterprises can utilize existing assets. For example, Fast Retailing, Uniqlo’s parent company, rather than emulate an online business model, uses the physical store as its key asset and is hugely successful at it. Target, a major US retailer, sees huge uplift in ecommerce sales with its curbside pick-up model—you just drive up and your order is loaded into your car. The digital world is one of many opportunities, for those companies that can question existing assumptions and turn IT into a major innovation driver. Transforming from the Bottom Up It’s hard to imagine that instigating a digital transformation purely from the top down can be successful. Non-tech-savvy management can at best limp along based on input from external consultants or trade journals. That’s not going to cut it, though: competition in the digital world is fierce, and customer expectations are increasing every day. When we hear of a successful startup company that went public or was acquired for a huge sum of money, we usually forget the dozens or even hundreds of startups in the same space that didn’t make it despite a great idea and a bunch of smart people working extremely hard on it. Architects, who are rooted in technology but can ride the elevator to the penthouse, are needed to make such a transformation successful. Transforming from the Inside Out Watching vendor demos and purchasing a few new products aren’t going to make an organization competitive against digital behemoths. As the overall direction of the digital revolution has become fairly clear and technology has been democratized to the point where every individual with a credit card can procure servers and big data analytics engines within minutes, the main competitive asset for an organization is its ability to learn fast. External consultants and vendors can give a boost, but they cannot substitute for an organization’s ability to learn (Chapter 36). Architects are therefore needed to drive or at least support the transformation from within the organization. From Ivory Tower Resident to Corporate Savior If you aren’t yet convinced that transforming the organization is part of your job as an architect, you may not have much of a choice: recent technology advances can be successfully implemented only if the organizational structure, processes, and often the culture also change. For example, DevOps-style development is enabled through the advent of automation technologies but relies on breaking down change and run silos. Cloud computing can reduce time-to-market and IT cost dramatically, but only if the organization and its processes empower developers to actually provision servers and make necessary network changes. Lastly, being successful with data analytics requires the organization to stop making decisions based on management slide sets, but on hard data. All these are major organizational transformations. Technology evolution has become inseparable from organizational evolution. Correspondingly, the job of the architect has broadened from designing new IT systems to also designing a matching organization and culture. In times of digital disruption, the job of the IT architect has surely become more challenging: keeping pace with ever-faster technology evolution, but also being well versed in organizational engineering, understanding corporate strategy, and communicating to upper management are now part of being an architect. But the architect’s job has also become more meaningful and rewarding for those who take up the challenge. In a prior job, I often jested that I was the chief organizational engineer disguised as the chief architect. The new world doesn’t reward architects who draw diagrams while sitting in the ivory tower. It has a lot in store, though, for hands-on innovation drivers and change agents. I hope this book encourages you to take the challenge and equips you with useful guidance, some clever slogans, and a little wisdom along your journey. Chapter 41. All I Have to Offer Is the Truth Giving Folks the Red Pill It’s so much more comfortable up here Embarking on a transformation journey can be quite a dramatic, sometimes even traumatic, undertaking for many people working for traditional enterprises. Digital companies are run, or at least perceived to be run, by highly educated, 20-something digital natives who aren’t distracted by family or social life and require little to no sleep. Their employers have hardly any legacy to deal with and billions in the bank, despite offering most services to consumers for free. For IT staff who have been working in the same, traditional enterprise, following the same processes for decades, this is likely to cause a mix of fear, denial, and resentment. Getting these folks on board for a transformation agenda is thus a delicate affair: if you are too gentle, people may not see a need to change. If you are too direct, people may panic or resent you. Nothing But the Truth Extorting a final reference from the movie The Matrix, when Morpheus asks Neo to choose between the red pill, which will eject him into reality, and the blue pill, which will keep him inside the illusion of the Matrix, he doesn’t describe what “reality” looks like. Morpheus merely states: Remember: all I’m offering is the truth. Nothing more. If he had told Neo that the truth translates into living in the confines of a bare-bones hovercraft ship patrolling sewers in the middle of a war against the machines who perpetually hunt the ship to chop it up with their powerful laser beams, he may have taken the blue pill. But Neo had already understood that there’s something wrong with the current state, the Matrix illusion, and felt a strong desire to change the system. And while you also sense that something’s not quite right with the existing system, most of your corporate peers will be quite content with their current environment and position. Sadly it’s not enough if you take the pill yourself, so you need to push them a little bit to come along for the ride. Just like in the movie The Matrix, though, the new digital reality that awaits the red-pill-taking folks may not be exactly what they expected. In a meeting, a fellow architect once proudly proclaimed that for transformation to succeed the architect’s life needs to be made easier. He was bound to be disappointed. Aiming to make one’s life easier is unlikely to lead into the digital future but will rather end up in disappointment. Technological advances and new ways of working make IT more interesting and valuable to the business, but they don’t make it easier: new technologies must be learned, and the environment generally becomes more complex, all while the pace speeds up. Digital transformation isn’t a matter of convenience, but of corporate survival. Digital Paradise? Looking from the outside, working at digital companies appears to largely consist of free lunches, massages, and riding Segways. While digital companies do court their employees with an unheard-of list of perks, they are also hugely competitive internally and externally. They firmly embrace a culture of constant change and speed to remain competitive and drive innovation. This means that employees rarely get to rest on the laurels of their work but need to keep pushing on. Engineers don’t join digital companies to relax but to push the envelope, innovate, and change the world. The rewards match the challenge, though, not just financially, but, most important, in enabling engineers to really make a difference and accomplish things they wouldn’t be able to accomplish on their own. More than a decade ago at Google, you could scale an application you wrote to 100,000 servers and run analytics against petabytes of logs in a second or two. Most traditional companies still dream of these capabilities a decade later. Such are the rewards of the digital IT life. These examples also show traditional companies why they should be scared. Don’t Try This at Home When looking to transform, traditional companies often identify practices employed by digital disruptors and try to import them into their traditional way of working. While it’s important to understand how your competitors think and work, adopting their practices requires careful consideration. Digital companies are known to do things like storing all their source code in a single repository, not having any architects, or letting employees work on whatever they like. When admiring these techniques, traditional companies must realize that they are watching world-class superstars pulling off amazing stunts. Yes, there are people who walk a tightrope between skyscrapers or jump off a tower to glide into the rooftop pool of a nearby building. This doesn’t mean you should try the same at home. When adopting “digital” practices, an organization must understand the interdependencies between these practices. A single code repository requires a world-class build system that can scale to thousands of machines and execute incremental build and test cycles. Sticking all your code into a single repository without having such a system in place, and a team to maintain it, is like jumping off a building without a parachute. It’s unlikely you’ll be landing softly in the nearby rooftop pool. Abandon Ship For most organizations, sailing to the digital future is a matter of survival. Imagine that you are an officer on the Titanic ocean liner and were just informed that the ship will be slowly but surely sinking. Most of the passengers are completely unaware of the severity of the situation and are comfortably sipping champagne on the upper decks. If you walk up to the passengers and individually inform them: Sir, excuse me if you wouldn’t mind. Could you be so kind as to consider relocating to the main deck so we may transfer you to a safer vessel? After you finish your drink, obviously. Please kindly excuse the terrible inconvenience. Your well-being is our primary concern. You may not get much of a response, maybe just a doubtful stare. People may order another champagne and then have a peek at the vessel you are suggesting, the lifeboat, just to conclude that it appears much less safe and convenient than staying on the world’s most modern and unsinkable ocean liner. On the other hand, if you speak to the passengers as follows: This ship is sinking! Most of you will drown in the icy ocean because there aren’t enough lifeboats. you will cause widespread panic and a rush for the lifeboats that’s likely to leave many passengers dead or injured before the ship even takes on water. Motivating corporate IT staff to start changing the way they work, and to leave behind the comfort of their current position is not dissimilar. They are also unlikely to realize their ship is sinking. Where on the spectrum of communication methods you should land depends on each organization and individual. I tend to start gentle and “ratchet up” the rhetoric when I observe perpetual inaction. Looks Are Deceiving Just as it seems unlikely that a simple block of ice can sink a modern (at the time) marvel of engineering, small, digital companies may not appear threatening to a traditional enterprise. Many startups are run by relatively inexperienced, sometimes even naive, people who believe they can revolutionize an industry while sitting on a beanbag because their office space hasn’t been fully set up yet. They are often understaffed and need to secure multiple rounds of external funding before turning profitable, if ever. However, just like 90% of an iceberg’s volume lies under water, digital companies’ enormous strength is hidden: it lies in their ability to learn much faster, often orders of magnitude faster than traditional organizations. Dismissing or trivializing startups’ initial attempts to enter an established market could therefore be a fatal mistake. “They don’t understand our business” is a common observation from traditional businesses. However, what took a business 50 years to learn may take a disruptor only one year or less because it is set up for economies of speed (Chapter 35) and has amazing technology at its disposal. Digital disruptors also don’t have to unlearn bad habits. Learning new things is difficult, but unlearning existing processes, thought patterns, and assumptions is disproportionately more difficult. Unlearning and abandoning what made them successful in the past is one of the biggest transformation hurdles for traditional companies (Chapter 26). Some traditional businesses may feel safe from disruption because their industry is regulated. To demonstrate how thin a safety net regulation provides, I routinely remind business leaders that if the digitals have managed to put electric and self-driving cars on the road and rockets into space, they are surely capable of obtaining a banking or insurance license. For example, they could simply acquire a licensed company. The fintechs Lemonade (insurance) and N26 (banking) are vivid examples of successful challengers in a regulated industry. Digital companies are not out to replicate existing business models. Rather, they choose weak spots that are highly inefficient or cause unhappy customers. Lastly, digital disruptors don’t tend to attack from the front. They tend to choose weak spots in existing business models that are highly inefficient, but not significant enough for large, traditional enterprises to pay attention to. Airbnb didn’t build a better hotel, and fintech companies aren’t interested in rebuilding a complete bank or insurance company. Rather, they attack the distribution channels, where inefficiency, high commissions, and unhappy customers allow new business models to scale rapidly with minimum capital investment. Some researchers claim that had the Titanic hit the iceberg head on, it might not have sunk. Instead, it was taken down because the iceberg tore open a large portion of the relatively weak side of the hull. That’s where the digitals hit. Distress Signals While transformation can be a scary endeavor, you aren’t the only architect who is accepting the challenge. Just like ships in distress, it’s good to call for help when things look dire. You shouldn’t be shy about sending a digital SOS—no one has a proven recipe for transformation, so exchanging experiences and anecdotes with your peers is highly beneficial. You may even opt to share your experiences in a book. I’ll be one of your first readers. Index A abstractions, levels of, Levels of Abstraction: Simplicity Versus Flexibility ACID transactions, Your Coffee Shop Doesn’t Use Two-Phase Commit Agile methods, Agile and Architecture, Slow Chaos Is Not Order-The Way Out alignment defined, The Dangers of Riding the Elevator excessive, Poorly Set Domain Boundaries—Excessive Alignment amber organizations, Systems Resist Change analysis paralysis, Reversing Irreversible Decision Making anarchy, Actual Control: Autonomy Application Delivery Controller (ADC), Charting Territory architect elevator, The Architect Elevator-Flattening the Building architectural decisions, Architectural Decisions-Passing the Test architecture benefits of automation, Never Send a Human to Do a Machine’s Job-A Place for Humans, Self-Service Is Better Service benefits of including built-in options, Architecture Is Selling OptionsAmplifying Metaphors benefits of layers, Why IT Architects Love Pyramids defined, Architecture, Defining Software Architecture evolutionary architecture, Evolutionary Architecture gaining insights from the real world into, Your Coffee Shop Doesn’t Use Two-Phase Commit-Welcome to the Real World! identifying, Is This Architecture?-Passing the Test multispeed, Multispeed Architectures programming versus configuration, Code Fear Not!-Configuration Hiding as Code? skills needed to create sound, Architecting the Real World value of, The Value of Architecture architecture analysis, Viewpoints architecture diagrams, Diagrams Are Models-Nothing Is Confusing in and of Itself, Architecture Diagrams architecture reviews, Whys Reveal Decisions and Assumptions, Is This Architecture? architecture sketches, Sketching Bank Robbers-That’s Wrong! Do It Again! architecture without architects, Vanishing Point: The Guide assumptions overcoming, Reprogramming the Organization uncovering, Whys Reveal Decisions and Assumptions, Dissecting IT Slogans-The Unexpected Is Undesired asynchronous communication, Asynchronous Communication—Email, Chat, and More asynchronous processing model, Hotto Cocoa o Kudasai Auftragstaktik, Saupreiß, ned so Damischer automation benefits to architecture, Never Send a Human to Do a Machine’s JobA Place for Humans configuration changes, A Software System’s First Derivative, Beyond Self-Service versus humans, A Place for Humans, Staying Human levels of, Pirate Ship Leads to Better Decisions repeatability and resilience gained by, It’s Not Only About Efficiency scaling through, Self-Service Is Better Service self-service portals, Self-Service tacit versus explicit knowledge, Explicit Knowledge Is Good Knowledge understanding current system state, Automation Is Not a One-Way Street autonomy, Actual Control: Autonomy-Actual Control: Autonomy B backpressure, Backpressure banana products, Quality Versus Speed bi-modal IT, Multispeed Architectures bias, Bias Big Ball of Mud architecture, There Always Is an Architecture, Fit for Purpose, Why IT Architects Love Pyramids binary values, Your Coffee Shop Doesn’t Use Two-Phase Commit black markets, Flattening the Building, Black Markets Are Not EfficientFeedback and Transparency decision makers and, Living in Perfect Harmony Black-Scholes formula for computing option values, Options Have Value, Strike Prices bottleneck databases as, A Chair Can’t Stand on Two Legs in IT systems, The Loomers’ Riot? in organizations, Scaling an Organization bounded rationality, System Effects branching, Trunk-Based Development breadth-first writing, The Curse of Writing: Linearity browser-based document editing, Single Source of Truth Build-Measure-Learn cycle, The Infinite Loop-Maintaining Cohesion impediments to using, Hollowed-Out IT business activity monitoring (BAM), Making Queues Visible business architecture, Connecting Business and IT business models, Looks Are Deceiving buy-over-build strategy, Good Intentions Don’t Lead to Good Results C canonical data models, Canonical Data Model cargo cult, Offshore Platforms change, culture of, Culture of Change, Offshore Platforms, Changing Culture from Within (see also resistance to change; transformation) chaos monkey, Runtime Governance, Setting Course checkpoints, Feedback and Transparency chief architects asking the right questions, Question Everything-No Free Pass becoming successful, A Chief Architect’s Life: It’s Not That Lonely at the Top-Is It Proven to Work?, Architects as Change Agents role of, What Architects Are Not chief engineers, What Architects Are Not Clausewitz, Carl von, Saupreiß, ned so Damischer code, fear of, Fear of Code-Configuration Hiding as Code?, Digital Mindset coffee shop analogy, Your Coffee Shop Doesn’t Use Two-Phase CommitWelcome to the Real World! collaboration benefits of pairing, Pairing benefits of version control, Version Control coordinating among team members, Single Source of Truth dealing with resistance, Resistance progress transparency, Transparency style versus substance, Style Versus Substance trunk-based development, Trunk-Based Development working iteratively, Always Be Ready to Ship comments and questions, Staying Up-to-Date commercial off-the-shelf (COTS) solutions, Good Intentions Don’t Lead to Good Results communication asynchronous, Asynchronous Communication—Email, Chat, and More capturing audience interest, Show the Kids the Pirate Ship!-Play Is Work collaborating using version control, Software Is CollaborationResistance conveying technical topics effectively to upper management, Communication-Communication Tools corporate politics and, The Pen Is Mightier Than the Sword, but Not Mightier Than Corporate Politics creating architecture sketches, Sketching Bank Robbers-That’s Wrong! Do It Again! diagram-driven design, Diagram-Driven Design-No Silver Bullet (Point) enhancing focus by placing emphasis, Emphasis Over CompletenessNothing Is Confusing in and of Itself expressing component relationships, Drawing the Line-Beware of Extremes presenting complex technical material, Explaining Stuff-I Wanted to Have Liked To, but Didn’t Dare Be Allowed software architects as connectors and translators, The Architect Elevator writing for busy people, Writing for Busy People-The Pen Is Mightier Than the Sword, but Not Mightier Than Corporate Politics compatibility standards, Product Standards Restrict, Interface Standards Enable, The Value of Standards compensating actions, Compensating Action complex systems, Every System Is Perfect… complexity management feedback loops, Feedback Loops heater analogy, Heater as a System influencing system behavior, Influencing System Behavior organized complexity, Organized Complexity recurring system effects/patterns, System Effects resistance to change, Systems Resist Change systems thinking, Every System Is Perfect… understanding system behavior, Understanding System Behavior, Elements—Relationship—Behavior conceptual integrity, Principles Drive Decisions confidence, Fast and Good configuration applying design best practices to, Configuration Hiding as Code? code versus data, Code or Data? Or Both? versus coding, When Are We Configuring? configuration programming, Configuration Programming design-time versus runtime deployment, Deployment at Design-Time Versus Runtime versus higher-level programming, Higher-Level Programming model versus representation, Model Versus Representation configuration automation, A Software System’s First Derivative configuration changes, Beyond Self-Service confirmation bias, Bias container, Strike Prices, Avoid the Skipping Stones containment, The Metamodel Continuous Deployment (CD), A Software System’s First Derivative Continuous Integration (CI), A Software System’s First Derivative continuous learning loop, Build-Measure-Learn control flow, Organizational Architecture: The Dynamic View control theory and accurate reports, Problems on the Way Up autonomy, Actual Control: Autonomy-Actual Control: Autonomy closing gaps between plans, actions, and results, Saupreiß, ned so Damischer control circuits, Control Circuits controlling the control loop, Controlling the Control Loop feedback loops, A Two-Way Street illusion of control, The Illusion smart control, Smart Control control, actual, The Illusion, Actual Control: Autonomy Conversation Patterns, Conversations correlation identifiers, Correlation cost of delay, Cost of Delay culture of change, Culture of Change, Offshore Platforms, Changing Culture from Within curse of knowledge, Mind the Gap customer centricity, Layers Versus Platforms, Customer Centricity customer feedback, Digital RPMs, Feedback Cycles cybernetics, A Two-Way Street D data flow, Organizational Architecture: The Dynamic View deadlocks, Avoid Sync Points—Meetings Don’t Scale decision making architectural decisions, Architectural Decisions-Passing the Test avoiding decisions, Avoiding Decisions benefits of seeing the big picture, Pirate Ship Leads to Better Decisions-Pack Some Pathos bias and, Bias decision analysis, Micromort decision models, Model Thinking deferring decisions with options, Deferring Decisions with OptionsDeferring Decisions with Options five whys approach to, Five Whys importance of communication in, You Can’t Manage What You Can’t Understand IT decisions, IT Decisions judging the quality of decisions, Making Decisions minimizing irreversible, Reversing Irreversible Decision Making poor decision-making discipline, The Law of Small Numbers priming, Priming, Common IT Beliefs revealing assumptions and principles leading to, Whys Reveal Decisions and Assumptions, Reverse-Engineering OrganizationsHanded-Down Beliefs decision trees, Model Thinking delay, cost of, Cost of Delay depth-first writing, The Curse of Writing: Linearity designing systems, Your Coffee Shop Doesn’t Use Two-Phase CommitWelcome to the Real World! details, in technical presentations, Consistent Level of Detail DevOps, Pivoting the Layer Cake diagram-driven design, Diagram-Driven Design-No Silver Bullet (Point) diagrams, creating clear, A Good Paper Is Like the Movie Shrek, Avoid the Ant Font-The Style of Elements, Diagramming as Design Technique- Indicate Degrees of Uncertainty, Drawing the Line-Beware of Extremes digital mindset, Digital Mindset discipline, achieving speed with, Slow Chaos Is Not Order-The Way Out discussion group, joining, Staying Up-to-Date distributed system design, Your Coffee Shop Doesn’t Use Two-Phase Commit-Welcome to the Real World! documentation, corporate politics and, The Pen Is Mightier Than the Sword, but Not Mightier Than Corporate Politics dogfooding, Eat Your Own Dog Food duplication of work, value and cost of, The Value and Cost of Avoiding Duplication dynamic models, Living in Pyramids E economies of scale, Old Economies of Scale economies of speed, High-Speed Elevators, The Second Derivative, It’s Not Only About Efficiency, Economies of Speed-How to Make the Switch? efficiency, resource versus customer, Old Economies of Scale efficiency-based thinking, How to Make the Switch?, A Little Bit of Queuing Theory elastic, Fast and Good elasticity, An Architecture Option: Elasticity engine room defined, The Architect Elevator disconnect from penthouse, The Dangers of Riding the Elevator riding the architect elevator to and from, Not a One-Way Street enterprise architects developing an undistorted IT worldview, Plotting Your World Map role of, Enterprise Architect or Architect in the Enterprise? versus technical architects, Some Organizations Have More Floors Than Others, Visit All Floors understanding organization structures and systems, OrganizationsNavigating Large Organizations enterprise architecture (EA) versus business architecture, Connecting Business and IT-IT Is from Mars, Business Is from Venus challenges of, Value-Driven Architecture defined, Enterprise Architecture ethos, Tell Me a Story, Pack Some Pathos evolutionary architecture, Evolutionary Architecture exception handling, Exception Handling-Compensating Action external consultants, High-Speed Elevators, Hollowed-Out IT F face-to-face time, Staying Human feature teams, Pivoting the Layer Cake feature toggles, Trunk-Based Development feedback, Fast and Good feedback loops, Control Circuits, Digital RPMs, Feedback Cycles first derivative defined, Rate of Change Defines Architecture designing for, Designing for the First Derivative of software systems, A Software System’s First Derivative fitness function, Evolutionary Architecture five whys technique, Five Whys five-second test, The Five-Second Test flow efficiency, Behold the Flow!-Cost of Delay flow-based thinking, Behold the Flow! G Git, Version Control economies of speed and, 30,000 Times Faster learning curve and, Resistance goal setting, Setting Course governance by decree, Governance by Decree by inception, Inception by infrastructure, Governance Through Infrastructure defined, Living in Perfect Harmony necessity of, Governance Through Necessity value of standards for, The Value of Standards-Mapping Standards H harmonization, Living in Perfect Harmony headings, in writing, A Good Paper Is Like the Movie Shrek heating system analogy, Heater as a System highest paid person’s opinion (HiPPO), Beware of the HiPPO horizontal scaling, An Architecture Option: Elasticity I IBM PC development, as significant innovation, Skunkworks That Works impact architects and, Impact-Architect as Last Stop? inactivity, effects of, Looking Between the Activities inception, governance through, Governance Through Inception-Governance Through Necessity inefficiency, Overhead and Tolerated Inefficiency infinite learning loops, The Infinite Loop-Maintaining Cohesion infrastructure as code (IaC), Configuration Hiding as Code?, SDX: Software-Defined Anything innovation, IBM PC and, Skunkworks That Works integration architecture, Avoid Sync Points—Meetings Don’t Scale interface standards, Product Standards Restrict, Interface Standards Enable, The Value of Standards internal rate of return (IRR), Innovator’s Dilemma IT beliefs agility opposes discipline, Agility Opposes Discipline all problems can be solved with more people or more money, All Problems Can Be Solved with More People or Money following a proven process leads to proven good results, Following a Proven Process Leads to Proven Good Results late changes are expensive or impossible, Late Changes Are Expensive or Impossible quality can be added later, Quality Can Be Added Later speed and quality are opposed, Speed and Quality Are Opposed (“Quick and Dirty”) the unexpected is undesired, The Unexpected Is Undesired IT components, standardizing, A4 Paper Doesn’t Stifle Creativity-One Size Might Not Fit All Tastes IT pyramids, They Don’t Build ’Em Quite Like That Anymore-Building Modern Structures IT worldview challenges of establishing, The IT World Is Flat charting territory, Charting Territory defining borders, Defining Borders developing, High-Speed Elevators developing an undistorted map, Plotting Your World Map product philosophy compatibility check, Product Philosophy Compatibility Check shifting territory, Shifting Territory vendors' perspective, Vendors’ Middle Kingdoms ITIL, ITIL to the Rescue? K key performance indicators (KPIs) amount of learning per dollar spent, Digital RPMs cost, High-Speed Elevators decision making, Architecture Is Selling Options knowledge gaps, Mind the Gap L law of small numbers, The Law of Small Numbers layers base layer as proxy metric, Celebrating the Base Layer benefits of, Why IT Architects Love Pyramids building from the top, Building Pyramids from the Top challenges of building, No One Lives in a Foundation drawbacks of, No Pyramid Without Pharaoh efficiency versus speed, Living in Pyramids inverse pyramids, It Always Can Get Worse organizational pyramids, Organizational Pyramids versus platforms, Layers Versus Platforms leadership architects and, Leadership-Architect as Last Stop? learning loops, Not a One-Way Street legacy systems culture of change, Culture of Change drawbacks of, If You Never Kill Anything, You Will Live Among Zombies embracing change, If It Hurts, Do It More Often fear of change and, Fear of Change MTBF versus MTTR, Hoping for the Best Isn’t a Strategy planned obsolescence, Planned Obsolescence separation of operating code versus development code, Run Versus Change version upgrades, Version Upgrades lift boys, Other Passengers linearity, curse of, The Curse of Writing: Linearity lines, expressing component relationships with, Behold the Line!-Beware of Extremes logos, Tell Me a Story, Pack Some Pathos M maker mindset, Digital Mindset mapping standards, Mapping Standards marchitecture diagrams, No Silver Bullet (Point) matrix organization, The Matrix (Not the Movie) mean time between failures (MTBF), Hoping for the Best Isn’t a Strategy mean time to recovery (MTTR), Hoping for the Best Isn’t a Strategy meetings (synchronization points), Scaling an Organization impact on performance, Avoid Sync Points—Meetings Don’t ScaleAvoid Sync Points—Meetings Don’t Scale mental model, dangers of, Influencing System Behavior mentoring, The Virtuous Cycle metamodels, The Metamodel, Beware of Extremes microfail value, IT Decisions micromort value, Micromort models business, Enterprise Architecture, Looks Are Deceiving diagrams as, Diagrams Are Models dynamic, Living in Pyramids for asynchronous processing, Hotto Cocoa o Kudasai for decision making, Model Thinking using Wardley maps as, Platform Standards versus representations, Model Versus Representation-Model Versus Representation motivation, Setting Course, Paying More May Get You Less N nonrequirements, Architects Deal with Nonrequirements O obsolescence, planned, Planned Obsolescence open source tools, Good Intentions Don’t Lead to Good Results options Agile methods and, Agile and Architecture dealing with lack of, Evolutionary Architecture elasticity, An Architecture Option: Elasticity horizontal scaling, An Architecture Option: Elasticity no risk options, Arbitrage real options, Real Options strike prices, Strike Prices time of exercise, Time Is Fleeting uncertainty and, Uncertainty Increases an Option’s Value value of, Options Have Value options model, Time Is Fleeting organizational charts, Organizational Architecture: The Static View organizations achieving speed with discipline, Slow Chaos Is Not Order-The Way Out amber organizations, Systems Resist Change command-and-control structures, Control Is an Illusion-Controlling the Control Loop dangers of black markets, Black Markets Are Not Efficient-Feedback and Transparency flat versus classic, Some Organizations Have More Floors Than Others-Flattening the Building governance through inception, Governance Through InceptionGovernance Through Necessity identifying and overcoming outdated assumptions, ReverseEngineering Organizations-Handed-Down Beliefs layering to reduce complexity and achieve reuse, They Don’t Build ’Em Quite Like That Anymore-Building Modern Structures navigating organizations to lead change, Leading Change-The Country of the Blind resistance to change in, The Dangers of Riding the Elevator, Systems Resist Change scaling, Scaling an Organization-Staying Human understanding organization structures and systems, OrganizationsNavigating Large Organizations organized complexity, Organized Complexity outsourcing, Hollowed-Out IT overhead costs, Overhead and Tolerated Inefficiency P pair programming and collaboration, Pairing parallelism, in writing, Making It Easy for the Reader pathos, Tell Me a Story, Pack Some Pathos personal productivity, Component Design—Personal Productivity pessimistic resource allocation, Avoid Sync Points—Meetings Don’t Scale phone calls, Interrupts Interrupt—Phone Calls planned obsolescence, Planned Obsolescence platforms adhering to standards, Platform Standards, Digital Discipline building solid, yet flexible, Avoid the Skipping Stones-Avoid the Skipping Stones criteria for choosing, Avoid the Skipping Stones layers versus platforms, Layers Versus Platforms predictability, value and cost of, The Value and Cost of Predictability presentations (see also writing skills) collaborating using version control, Software Is CollaborationResistance creating useful models, Diagrams Are Models degrees of trust placed in, Problems on the Way Up diagram-driven design, Diagram-Driven Design-No Silver Bullet (Point) five-second test, The Five-Second Test presenting clear and useful diagrams, Diagramming Basics-The Style of Elements recapping key points, A Pop Quiz structuring, Twenty Slides, One Story titles, Making a Statement using simple language, Simple Language priming, Priming, Common IT Beliefs processing efficiency, Behold the Flow! product box, The Product Box product development flow, Behold the Flow! product fit, Charting Territory product philosophy compatibility checks, Product Philosophy Compatibility Check product standards, Product Standards Restrict, Interface Standards Enable productivity asynchronous communication, Asynchronous Communication— Email, Chat, and More automation and, Self-Service Is Better Service excessive alignment and, Poorly Set Domain Boundaries—Excessive Alignment personal, Component Design—Personal Productivity phone calls and, Interrupts Interrupt—Phone Calls system, Avoid Sync Points—Meetings Don’t Scale taking advantage of search, Asking Doesn’t Scale—Build a Cache! project managers, What Architects Are Not prospect theory, Bias proximity, The Metamodel Prussian army, Saupreiß, ned so Damischer Q quality fallacy that quality can be added later, Quality Can Be Added Later fallacy that speed and quality are opposed, Speed and Quality Are Opposed (“Quick and Dirty”) faster delivery of functionality at high quality, Thinking in Four Dimensions-Losing a Dimension quality gates, Feedback and Transparency questions and comments, Staying Up-to-Date queuing theory, A Little Bit of Queuing Theory-Message Queues Aren’t All Bad R rapid feedback cycles, The Infinite Loop-Maintaining Cohesion rate of change, Architects Live in the First Derivative-Rate of Change for Architects reference architecture, Defining Borders repeatability, It’s Not Only About Efficiency repeatable, Fast and Good resilience, It’s Not Only About Efficiency, Setting Course resistance to change challenges of, Systems Resist Change culture of change, Culture of Change dealing with, The Dangers of Riding the Elevator, A Tractor Passing the Race Car embracing change, If It Hurts, Do It More Often legacy systems, If You Never Kill Anything, You Will Live Among Zombies-Culture of Change persistence of beliefs, Beliefs Are Proven Until Disproven retry error handling strategy, Retry reverse engineering common IT beliefs, Common IT Beliefs-The Unexpected Is Undesired discovering unknown beliefs, Unknown Beliefs dissecting IT slogans, Dissecting IT Slogans need for, Reverse-Engineering Organizations reluctance to change, Beliefs Are Proven Until Disproven unlearning old habits, Unlearning Old Habits reverse mentoring, The Virtuous Cycle root-cause analysis, Five Whys run versus change, Run Versus Change S sandbagging, The Value and Cost of Predictability scaling, An Architecture Option: Elasticity, Scaling an OrganizationStaying Human SDX (software-defined anything), If Software Eats the World, Better Use Version Control!-Software Eats the World, One Revision at a Time second derivative, The Second Derivative secure, Fast and Good self-service portals, Self-Service self-service systems, Beating the Black Market, Self-Service Is Better Service selling options analogy, Architecture Is Selling Options-Amplifying Metaphors semantics, The Semantics of Semantics semblance of control, The Illusion senior developers, What Architects Are Not server sizing, An Architecture Option: Elasticity serverless architectures, If Software Eats the World, Better Use Version Control!, Feedback Cycles shantytown architecture, There Always Is an Architecture sidebars, in writing, A Good Paper Is Like the Movie Shrek skill architects and, Skill-Architect as Last Stop? skunkworks, Skunkworks That Works slow chaos, Slow-Moving Chaos-Slow-Moving Chaos software architect elevator, The Architect Elevator-Flattening the Building software architects ability to question everything, Question Everything-No Free Pass becoming a successful chief architect, A Chief Architect’s Life: It’s Not That Lonely at the Top-Is It Proven to Work?, Architects as Change Agents as change agents, Architects as Change Agents changing roles of, About This Book, Architects-Architects Deal with Nonrequirements communication tips, Communication-Communication Tools as connectors and translators, The Architect Elevator enterprise versus technical architects, Some Organizations Have More Floors Than Others implicit requirements and, Architects Deal with Nonrequirements, Fit for Purpose measuring value of, Measuring an Architect’s Value prototypical stereotypes of, Movie-Star Architects-Making the Call rate of change and, Architects Live in the First Derivative-Rate of Change for Architects rational and disciplined decision making by, Making DecisionsAvoiding Decisions resistance encountered by, The Dangers of Riding the Elevator skills needed to create sound architectures, Architecting the Real World specialization areas, Many Kinds of Architects three facets of being a good architect, An Architect Stands on Three Legs-Architect as Last Stop? software architecture, defined, Defining Software Architecture (see also architecture) software development and deployment attributes required for fast, Fast and Good product development flow, Behold the Flow! trunk-based development, Trunk-Based Development software development life cycle (SDLC), Software Eats the World, One Revision at a Time software-defined networks (SDNs), SDX: Software-Defined AnythingSoftware Eats the World, One Revision at a Time speed achieving with discipline, Speed and Discipline-The Way Out achieving with queuing theory, Who Likes Standing in Line?-Message Queues Aren’t All Bad economies of speed, High-Speed Elevators, The Second Derivative, Economies of Speed-How to Make the Switch? efficiency versus speed in layers, Living in Pyramids fallacy that speed and quality are opposed, Speed and Quality Are Opposed (“Quick and Dirty”) faster delivery of functionality at high quality, Thinking in Four Dimensions-Losing a Dimension multispeed architectures, Multispeed Architectures two-speed architectures, Multispeed Architectures speed-based thinking, How to Make the Switch? stack fallacy, The Stack Fallacy standardization benefits of, The Value of Standards creativity and, A4 Paper drawbacks of, Living in Perfect Harmony establishing global standards, One Size Might Not Fit All Tastes interface and product standards, Product Standards Restrict, Interface Standards Enable layers versus platforms, Layers Versus Platforms maps for, Mapping Standards platform standards, Platform Standards-One Size Might Not Fit All Tastes rationale for, Interface Standards Starbucks analogy, Your Coffee Shop Doesn’t Use Two-Phase CommitWelcome to the Real World! strike prices, Options Have Value, Strike Prices-Strike Prices synchronization points (meetings), Scaling an Organization impact on performance, Avoid Sync Points—Meetings Don’t ScaleAvoid Sync Points—Meetings Don’t Scale system context diagrams, Show Context system design, Your Coffee Shop Doesn’t Use Two-Phase CommitWelcome to the Real World! systems theory and change, Tuning the Engine systems thinking, Every System Is Perfect…-Influencing System Behavior T technical decision papers, Pushing (Less) Paper technical memos, Unit Testing Technical Papers Titanic, Problems on the Way Up, Abandon Ship, Looks Are Deceiving tolerated inefficiency, Overhead and Tolerated Inefficiency tragedy of the commons, System Effects transaction monitors, Avoid Sync Points—Meetings Don’t Scale transformation achieving speed with queuing theory, Who Likes Standing in Line?Message Queues Aren’t All Bad achieving truly digital internal IT, You Can’t Fake IT-The Stack Fallacy architecting IT transformation, Epilogue: Architecting IT Transformation-From Ivory Tower Resident to Corporate Savior dealing with complacency, Getting Over the Hump economies of speed, Economies of Speed-How to Make the Switch? effecting lasting change, Transformation-Why Me? faster delivery of functionality at high quality, Thinking in Four Dimensions-Losing a Dimension infinite learning loops and, The Infinite Loop-Maintaining Cohesion key to successful, Tell Me a Story motivating corporate IT staff, All I Have to Offer Is the Truth-Distress Signals navigating organizations to lead change, Leading Change-The Country of the Blind pain of not changing, The Pain of Not Changing perils of large budgets, Money Can’t Buy Love-Changing Culture from Within reprogramming organizations, Reprogramming the Organization stages of, Stages of Transformation-Help Along the Way tribes, Pivoting the Layer Cake trunk-based development, Trunk-Based Development two-phase commit approach, Transactions two-speed architectures, Multispeed Architectures U UML diagrams, Diagramming as Design Technique, UML upgrades, Version Upgrades, Software Developers Don’t Undo, They ReCreate V velocity, Fast and Good vendors, selecting, Vendors’ Middle Kingdoms-Shifting Territory version control, Version Control-Single Source of Truth learning curve and, Resistance version upgrades, Version Upgrades vertical cohesion, Vertical Cohesion virtualization, SDX: Software-Defined Anything virtuous cycle, The Virtuous Cycle visuals and configuration, Model Versus Representation and sketching architectures, Visuals in writing, A Good Paper Is Like the Movie Shrek W Wardley maps, Platform Standards watermelon status report, Problems on the Way Up white markets, Beating the Black Market why, why analysis, Five Whys working iteratively versus working incrementally, Always Be Ready to Ship write-off error handling strategy, Write Off writing skills addressing diverse audiences, A Good Paper Is Like the Movie Shrek brevity, In der Kürze liegt die Würze4 capturing audience interest, Show the Kids the Pirate Ship!-Play Is Work corporate politics and, The Pen Is Mightier Than the Sword, but Not Mightier Than Corporate Politics enhancing focus by placing emphasis, Emphasis Over CompletenessNothing Is Confusing in and of Itself importance of first impressions, “In the Hand”—First Impressions Count mapping complex topics into linear storylines, The Curse of Writing: Linearity non-referential statements and unsubstantiated claims, Lists, Sets, Null Pointers, and Symbol Tables parallelism and paragraph structure, Making It Easy for the Reader presenting complex technical material, Explaining Stuff-I Wanted to Have Liked To, but Didn’t Dare Be Allowed quality versus impact, Quality Versus Impact technical memos, Technical Memos writers' workshops, Unit Testing Technical Papers written versus spoken presentations, Writing Scales About the Author Gregor Hohpe helps business and technology leaders transform not only their technology platform, but also their organization. Riding the Architect Elevator from the engine room to the penthouse, he assures that corporate strategy lines up with the technical implementation and vice versa. He has served as Smart Nation Fellow to the Singapore government, as technical director in Google Cloud’s Office of the CTO, and as chief architect at Allianz SE, where he oversaw the architecture of a global data center consolidation and deployed the first private cloud software delivery platform. Having worked for both digital native companies and traditional enterprise IT allows him to reveal the many misconceptions that these organizations have about each other in the form of pointed anecdotes harvested from the daily grind of IT transformation. Gregor is known as coauthor of the seminal book Enterprise Integration Patterns (Addison-Wesley), which is widely cited as the reference vocabulary for asynchronous messaging solutions. His articles have been featured in numerous publications, including Best Software Writing (Apress), selected and introduced by Joel Spolsky, and 97 Things Every Software Architect Should Know (O’Reilly), by Richard Monson-Haefel. Colophon The cover photo, Modern Elevator, is by Auris. The cover font is Guardian Sans. The text font is Scala Pro; the heading font is Benton Sans; and the code font is Dalton Maag’s Ubuntu Mono.