Building the Internet of Things - Center

Building the Internet of Things Early learnings from architecting solutions focused on predictive maintenance Authors Martijn Hoogendoorn, Architect, Applied Incubation, Microsoft Mark Kottke, Architect, Applied Incubation, Microsoft Intended audience This white paper is aimed at technical decision makers, solution architects, and developers. Abstract This white paper provides a comprehensive overview of lessons learned from the authors' experiences in implementing large scale customer projects that target predictive maintenance as a space in IoT. It frames various elements and considerations of importance within the Internet of Things, highlighting tradeoffs, opportunities and grounding the implementation activities using a reference architecture and an associated comprehensive cost model. Acknowledgments The authors would like to thank the following people, who contributed to, reviewed, and helped improve this white paper. Contributors Marc Mercuri, Principal Program Manager, Azure Customer Advisory Team, Microsoft Clemens Vasters, Principal Program Manager, Azure Application Platform, Microsoft Reviewers Arno Harteveld, Architect, Client Solutions, Microsoft Carolina Piavis, Director Business Programs, Applied Incubation, Microsoft Ray Stephenson, Director, Applied Incubation, Microsoft Mani Subramanian, Senior SDET, Patterns & Practices, Microsoft Version 1.2 The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. The descriptions of other companies’ products in this document, if any, are provided only as a convenience to you. Any such references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee their accuracy, and the products may change over time. Also, the descriptions are intended as brief highlights to aid understanding, rather than as thorough coverage. For authoritative descriptions of these products, please consult their respective manufacturers. © 2014 Microsoft Corporation. All rights reserved. Any use or distribution of these materials without express authorization of Microsoft Corp. is strictly prohibited. Microsoft and Windows are either registered trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. Page i Table of contents Executive summary ............................................................................................................................................................1 IoT and predictive maintenance .................................................................................................................................... 2 The Internet of Things ............................................................................................................................... 2 Business value ...................................................................................................................................... 4 Megatrends ........................................................................................................................................... 4 Technology enablers ............................................................................................................................. 6 Standardization efforts .......................................................................................................................... 7 Predictive maintenance ............................................................................................................................. 7 Predictive maintenance scenarios ................................................................................................................................. 8 Healthcare ................................................................................................................................................. 8 Automotive ................................................................................................................................................ 9 Manufacturing ........................................................................................................................................... 9 Architectural considerations ......................................................................................................................................... 10 Connectivity ............................................................................................................................................. 10 Interaction patterns ............................................................................................................................. 10 Connectivity pathways ........................................................................................................................ 12 Connectivity network types ................................................................................................................. 12 Protocol choices ...................................................................................................................................... 15 Transport-layer protocol choices ......................................................................................................... 15 Transport-layer protocol security......................................................................................................... 16 Application-layer protocol choices ...................................................................................................... 17 Security ................................................................................................................................................... 19 Virtual Private Networks ...................................................................................................................... 20 Compliance ......................................................................................................................................... 20 Device communication patterns .............................................................................................................. 22 NAT-based device network ................................................................................................................. 22 IPv6 direct-addressing device network ............................................................................................... 23 NAT-based, PAN device network........................................................................................................ 24 Generic concerns with direct addressing ............................................................................................ 24 Service-assisted communication......................................................................................................... 24 Designing for scale .................................................................................................................................. 28 Communication and ingestion ............................................................................................................. 28 Data storage scalability ....................................................................................................................... 29 Device registration .................................................................................................................................. 29 Acquiring data ......................................................................................................................................... 30 Page ii Message size and format .................................................................................................................... 30 Message types .................................................................................................................................... 30 Message priority .................................................................................................................................. 31 Conditional messaging ........................................................................................................................ 31 Contextual messaging ......................................................................................................................... 31 Message batching ............................................................................................................................... 31 Bandwidth and scale ........................................................................................................................... 32 Storing information .................................................................................................................................. 32 Storing data on the device .................................................................................................................. 32 Transforming data ............................................................................................................................... 32 Location ............................................................................................................................................... 33 Longevity, format, and cost ................................................................................................................. 33 Processing information............................................................................................................................ 34 Alarm processing ................................................................................................................................ 34 Complex-event processing .................................................................................................................. 34 Big Data analysis ................................................................................................................................ 35 Machine learning ................................................................................................................................. 35 Data enhancement .............................................................................................................................. 36 Publishing insights .................................................................................................................................. 36 Audience ............................................................................................................................................. 36 Publishing format ................................................................................................................................ 36 Cost modeling and estimation..................................................................................................................................... 38 Common architecture overview ................................................................ Error! Bookmark not defined. Capacity modeling ................................................................................................................................... 41 Cost estimation ....................................................................................................................................... 42 Ingress path cost ................................................................................... Error! Bookmark not defined. Egress path cost ................................................................................................................................. 45 Management cost ................................................................................................................................ 47 System processing cost .......................................................................................................................... 49 Cost estimate calculation ........................................................................................................................ 49 Strategic choices ........................................................................................................... Error! Bookmark not defined. Buy, build, or hybrid .................................................................................. Error! Bookmark not defined. Important topics not yet covered ............................................................................................................................... 50 Networks with automatic handover and fallbacks ................................................................................... 50 The need for the commoditization of devices ......................................................................................... 50 The creation and use of information marketplaces ................................................................................. 50 Management solutions ............................................................................................................................ 50 Page iii The redefinition of SLAs .......................................................................................................................... 51 Integration simplicity................................................................................................................................ 51 Conclusions ....................................................................................................................................................................... 52 How Microsoft can help you succeed ........................................................................................................................ 53 Page iv Executive summary For decades, technology experts have anticipated the Internet of Things (IoT): the proliferation of tens of billions of connected devices that contain embedded microchips, and the rise of machine-to-machine and service-to-service communications. IoT will make inanimate objects, networks, and processes “smart”— everything from tiny components, appliances, machines, homes, buildings, and factories to energy grids, transportation networks, and logistics systems. It’s a game-changing opportunity in IT. By analyzing the vast new streams of data, and by harnessing the precise control that IoT provides, your organization can reduce costs, create new revenue streams, increase customer satisfaction and retention, spot trends faster, gain from opportunities more easily, and innovate with agility. IoT will be especially beneficial in predictive maintenance: performing maintenance at the right time to predict and prevent failures. To take full advantage of IoT opportunities in predictive maintenance, you need to think strategically about the many elements of IoT. For example, one should consider connectivity pathways and types, transport-layer and application-layer protocol choices, device interaction and communication patterns, and how to design for the vast scale of IoT. It is especially critical to understand the complex issues of data security and regulatory compliance, which can expose the enterprise to legal difficulties if they are not handled properly. You also should think about how the enterprise’s communications systems will ingest data, including message types, sizes, formats, and priorities, conditional and contextual messaging, message batching, bandwidth, and how to scale a messaging system. Another pivotal set of questions to ask relate to the data: where will data be stored and how will it be distributed or potentially sold, and what is the longevity of the data, the right format, and the associated cost to do that? What is the most efficient way to analyze Big Data, how can you best take advantage of possibilities, such as alarm processing, complex-event processing, Big Data analysis, machine learning, and data enhancement? Because data that seems at first uninteresting can be very valuable to the right audience, how do you find that audience to monetize the insights gained from processing it? The elements that are needed for security, communication, and scale in an IoT solution make it very challenging to build one from scratch. To succeed with any IoT solution, it will very likely require the implementation of a reference architecture that can help accelerate the use of massive data from millions or even billions of devices. Modeling the system’s capacity to scale, and calculating the costs to do so for related aspects, such as ingress (device to cloud) and egress (cloud to device, cloud to system) paths and system processing, is paramount. Depending on the company background, a classic buy vs. build vs. hybrid decision should be made, based on what you are already using, what is available, and what will be available in the near future at a price that is acceptable to your business. This white paper introduces and describes all of these considerations and provides you with the tools necessary to estimate the operational cost of an implemented reference architecture in production. With the Microsoft Azure platform, Microsoft offers a broad set of building blocks to help you get an IoT solution up and running quickly. Page 1 IoT and predictive maintenance At Microsoft, we hear constantly from customers who say that the Internet of Things (IoT) is one of the most exciting trends in IT.1 Many of our customers are interested in deploying sensors and devices in every part of their businesses in order to capture information from the physical world and act upon the knowledge gained from refining it. They will buy or build systems that can deliver these capabilities in order to optimize their bottom line, keep customers satisfied, and explore new revenue potential. Predictive maintenance is an IoT scenario where a device can provide data that leads to insightful, proactive maintenance before the likely failure can take place. Predictive maintenance offers a new revenue stream for device manufacturers, and it is very interesting to their customers because it enables better business continuity, which usually generates extra revenue. In this way, the cost of a new service from a device manufacturer is justifiable to customers, given the cost and impact on them of unplanned downtime of the device. The Internet of Things An expert in radio-frequency identification named Kevin Ashton first used the term “the Internet of Things” in 1999,2 though the idea had been around at least a decade earlier. As with many terms in technology, IoT is a loaded term that people interpret differently depending on their viewpoint and purpose. For example, Gartner defines it as “The network of physical objects that contain embedded technology to communicate and interact with their internal states or the external environment.” 3 Formulating this differently: The Internet of Things is a metaphor for a set of systems in which direct human intermediation is dramatically reduced by equipping distributed systems with sensors that let us acquire information, make decisions, and control things in the physical world. Based on this definition, IoT consists of a set of four composable activities:  Acquiring data. Using sensors to record information about the physical world. Examples include measuring location, humidity, temperature, light, heart rate, blood pressure, brain waves, current, and gas detection.  Processing information. Take action based on data captured and on contextual information retrieved previously or sourced from other systems. This processing could involve using actuators that can alter the state of the physical world, such as opening valves, switching machines on and off, sounding alarms, controlling servos, closing doors, and many other things.  1 2 3 Storing information. To enable trend analysis, forecasting and insight-driven decision making, historical information Figure 1. Foundational activities, composable within and between devices and systems Microsoft, “What Our Customers Are Saying: Top Enterprise Trends of 2014,” Susan Houser Wikipedia, Internet of Things Gartner, IT Glossary, Internet of Things Page 2 and context is needed. Storing the information retrieved in its contextual form (for example, including the location where it was captured, the date and time it was captured, the state of the system at the time it was captured, and so on.) is critical for this process.  Publishing insights. When embedded sensor data is combined with both internal and external data from other systems, additional insight from analyzing the data can be learned and acted upon. Exposing that insight can also drive additional value for other stakeholders outside the immediate needs of the current system, allowing for the monetization of this knowledge. On top of familiar devices, such as phones for input and presentation, a set of core components to support those activities is needed, though business goals and technical constraints will drive those that are required. Core components may include:  Sensors: the components that translate a value from the physical world into bits. Examples include sensors that measure pressure, humidity, heart rate, gas levels, and acceleration.  Devices: networked, physical, special-purpose systems that emit telemetry data, accept external information, request external information, and execute remotely-issued commands. Examples include factory floor equipment, environmental pollution sensors, and control modules in vehicles.  Bridges: systems that act as communication brokers between a device and a gateway, typically by translating data traffic between different link protocols or methods, for instance between short-range and long-range wireless protocols. A bridge can also be a connectivity infrastructure that manages a nationwide or world-wide wireless network on one side, and a bridge to a cloud system on the other. A bridge might also perform intelligent preprocessing of data, or act as an autonomous local communications hub in addition to its bridging function relative to a cloud system4. Bridges are often also referred to as gateways, but we reserve the term gateway for a network-based service with which a bridge communicates.  Gateways: network-based services that manage connectivity and connections with devices either directly or through bridges. The service establishes a trusted communication relationship with a device, deals with ingestion and routing of telemetry data, and provides access to command and notification data destined for the device. On top of these services, it provides data pipeline processing, possible containing transformation, complex event processing capabilities, data analytics components, machine learning, and so on.  Machine learning: computational algorithms that can analyze large sums of data and extract patterns from it to help a system act and “learn” from that data to drive more intelligent system responses in the physical world.  Interconnections: different systems sharing learnings and data that in turn form composite systems. We have read thought-provoking papers about IoT. Two that we found especially valuable in providing context to the concepts and opportunities of IoT are:  “Recommendations for the Strategic Initiative INDUSTRIE 4.0.” 5  “Industrial Internet: Pushing the Boundaries of Minds and Machines, a European Perspective.”6 IoT enables you to build, enhance or extend a business model based on data-driven insights from pervasive sensors that help you optimize resource use and reduce cost and environmental impact. IoT also helps you maintain a closer relationship with customers beyond the point of sale of physical products 4 5 6 Microsoft, “How Microsoft tech is helping affordable housing tenants save money” (section on “Captain”) Deutsche Akademie der Technikwissenschaften, Final report of the Industrie 4.0 Working Group General Electric, “Industrial Internet: Pushing the Boundaries of Minds and Machines, a European Perspective” Page 3 by enabling contextual, remote actions automatically and intelligently. Examples include remote servicing, proactive sales, best-practices guidance, and more. Business value At least 26 billion devices will be connected on the Internet by 2020, and organizations in every sector will use them.7 Billions of connected devices will help businesses to:  Reduce cost. Businesses can use the increased insight into manufacturing and delivery processes to optimize those processes and reduce cost. For example, reducing the number of scheduled visits a technician must make by scheduling service visits based on duty cycles and expected product lifespans informed by actual usage.  Create new revenue streams. Using the ability to sense from and actuate in the physical, new business models are emerging. Business can capitalize on these new opportunities and create new innovate revenue streams. Some examples would be monetizing newly collected datasets, offering APIs to create new business partnerships, increasing service revenue by notifying and offering improved convenience to customers, offering differentiating SKUs based on usage patterns, supplying optimized configuration services, and so on.  Increase customer satisfaction and retention. By knowing how customers of physical products use them, opportunities exist to extend the customer experience into scenarios of higher value, and retain and extend the customer base. Capturing data on how customers actually use products, and ensuring that they do not experience frustrating service issues helps companies retain customers. In the blog post “10 reasons businesses need a strategy for the Internet of Things now,”8 the author identified a concise set of benefits that a company can realize by adopting an IoT strategy. Megatrends The world faces many challenges, such as changes in wealth distribution, resource scarcity, and an aging population in developed countries. The authors of the book “From Machine-to-Machine to the Internet of Things: Introduction to a New Age of Intelligence” analyzed these megatrends and capabilities in detail.9 They found that these megatrends are driving a proliferation of embedded devices with sensors, which in turn require new capabilities for new market scenarios, as the graphic below shows. 7 8 9 Gartner, “Gartner says the Internet of Things Installed Base Will Grow to 26 billion units by 2020”, December 2013 Microsoft, 10 reasons businesses need a strategy for the Internet of Things now “From Machine-to-Machine to the Internet of Things: Introduction to a New Age of Intelligence”, ISBN 978-0124076846 Page 4 Figure 2. "Megatrends." From Machine-To-Machine to the Internet of Things: Introduction to a New Age of Intelligence. Amsterdam, Netherlands, Elsevier, January 2014. Among the list of megatrends listed in the previous figure, we want to explain in this paper how some of them relate to the Internet of Things:  Natural resource constraints. The world population is growing at a high rate, with a projected peak population of 9.22 billion in 2075.10 Given this growth and the impact it has on the growth of the worldwide economy, the world will increasingly have to do more with less, and optimize the way that we produce. IoT can support the optimization of production, loss reduction, and the efficiency of the necessary supply chain.  Economic shifts. Much like the shift in IT, going from packaged products to as-a-service solutions, the global economy is moving from a product-oriented to a service-oriented perspective.11 For a viable service-oriented economy to come into existence, it needs to be supported by a large set of devices that provide context to the customer environment for the system in order to offer the right service, at the right price, and at the right time.  Changing demographics. With the world population, especially in more-developed countries, increasingly aging, the change in demographics will need smart solutions that can help elderly people remain self-supporting.  Climate change. The impact of human activities on the environment, although debated at length, is detrimental to the sustainability of the world. In recent years, there has been a growing movement of 10 11 United Nations, Economic & Social Affairs, World Population to 2300 Wikipedia, Service economy Page 5 “green” technologies and services, ranging from electric cars to corporate and government policy changes. IoT can be a supporting factor in both providing footprint insight and reduction. Technology enablers The ever-decreasing cost and size of components, such as accelerometers, Wi-Fi radios,12 GPS, microcontrollers, and Bluetooth radios is also enabling the Internet of Things (IoT). It allows components and devices to be used in new settings, such as wearables, on-person devices, and even smaller equipment. As shown in Figure 2, IoT depends on several other major technologies and trends. Some of these technology enablers as well as others warrant clarification:  Ubiquitous connectivity. Low-powered wireless networking enables devices to talk to a gateway, among each other, or directly to the outside world. A foundation for IoT implementations, connectivity must be managed carefully. To learn more, see the Connectivity section in this paper.  Cloud computing. For systems that connect hundreds of millions of devices, cloud computing is the technology that allows for vast scale and acceptable costs, providing the ability to store large amounts of machine generated data at low cost and perform Big Data analytics and machine learning.  Small, low-power, low-cost microcontrollers. Microcontrollers today can perform tasks at very low power and have a battery life of many years.13 For example, the Texas Instruments MSP430 runs at less than 100µA/MHz and can operate on a single coin battery for more than 20 years.14 (Device battery life always depends on components and application cycle use). The memory embedded in this microcontroller is ferroelectric read-only memory (FRAM), an improvement on flash memory that sports very high data throughput at a power consumption three times lower than flash memory and 99 percent lower than comparable dynamic random-access memory (DRAM).  Power supply and storage technologies. Given the tiny size of many new devices, their deployment location, and the vast number of them that will be deployed, changing batteries is often impractical or impossible. Besides optimizing hardware design for these scenarios, 15 enhancing circuitry by limiting their quiescent current (Iq) will further improve battery life. Also, with energy harvesting techniques, such as solar power supplies, devices can recharge their built in batteries as long as there is a minimal charge left.  Embedded operating system platforms. With the vast number of devices that will be installed, cost and energy consumption per device become decisive. Engineers will create devices that cost less and that are more energy-efficient, even if they have limited processing capabilities and memory. CPU cycles spent, and the memory allocated will become important factors for choosing operating system platforms, installed components, and security configurations. There is a plethora of good general-purpose operating systems, ranging from Windows Embedded and Embedded Linux to realtime operating systems, such as FreeRTOS, ThreadX, Integrity, Nucleus, Qnx, Atomthreads, AVIXRT, ChibiOS/RT, ERIKA Enterprise, TinyOS, Thingsquare Mist/Contiki, and others.16 In sum, IoT is gaining momentum because of growing customer and enterprise needs meeting technology enablers at the right cost. For example, a network chip for less than $10 for 1,000 units. Texas Instruments, SimpleLink™ Wi-Fi Module CC3000 maxEmbedded, What is a microcontroller? And how does it differ from a microprocessor? 14 Texas Instruments, MSP430 documentation 15 Texas Instruments, Using power solutions to extend battery life in MSP430 applications 16 For a comprehensive list, see http://en.wikipedia.org/wiki/List_of_real-time_operating_systems 12 13 Page 6 Standardization efforts Throughout the world, many organizations are working on the standardization of IoT, based on specific technology or holistically on reference architectures. Examples of this work include:  ITU-Telecom (ITU-T), Internet of Things Global Standards Initiative (IoT-GSI).17  European Union, Internet of Things Architecture (IoT-A).18 In addition to these efforts, there is a lot of work going on in depth in many different technology areas, such as the standardization of protocols. Protocol choices, both at the transport as well as the application layer, are discussed later in this document. Predictive maintenance This white paper focuses on a common scenario IoT enables that we call predictive maintenance: performing maintenance with a focus on timeliness, acting exactly when needed instead of at regular intervals, and predicting and preventing failures before they happen, based on learning from historical data. Predictive maintenance—just-in-time maintenance—will massively transform how organizations and consumers manage equipment as well as people. Predictive maintenance also informs more traditional preventative maintenance patterns, optimizing routine maintenance activities. 17 18 ITU-T, Internet of Things Global Standards Initiative European Union, Internet of Things Architecture Page 7 Predictive maintenance scenarios The potential for useful applications in the Internet of Things (IoT) is endless. This section focuses on scenarios that illustrate concrete benefits based on predictive maintenance, where maintenance can be performed on both inanimate and living things. The following scenarios that we describe provide examples of the enormous potential that IoT holds for enterprises. Healthcare With the previously described change in world demographics, there is an increasing need for “remote patient management,” allowing elderly citizens to only come to the doctor or the hospital when the need arises, based on telemetry captured by smart devices. Some early innovation in this space, more geared toward health selfmanagement and consumer devices can be seen in watches with sensors that collect a variety of data, such as blood pressure and heart rate. When body temperature, oxygen levels, and CO2 levels are combined with the ability to display this data to the patient and physician in real time, this alleviates the stress of full waiting rooms and reduces the cost per patient.19 Another example is an in-home glucose monitor that uploads a patient’s vital signs to a cloud-based health platform, where the data is analyzed and presented back to the patient in an easy-to-understand format on a mobile device, and in a more complex format on a touchscreen to the doctor. The doctor can review the patient’s information and then use the touchscreen to send feedback to the patient and write a prescription.20 Powerful, specialized, cloud-connected devices like these that enable doctors and patients to work together to remotely monitor vital signs, exchange information, communicate, and alert relatives, all in real time, are either becoming available or in development. By actively monitoring patients at home21,22 or while they are mobile, healthcare professionals can provide a higher level of care, reduce in-hospital waiting time and costs, and reduce stress for everyone involved, which leads to better patient outcomes. Using technology to accurately predict and signal medical staff about conditions that need attention, enables healthcare professionals to anticipate patient issues instead of reacting to them, and remedy them before they become critical, all while maintaining the security and privacy of the data collected from such technologies.23 As a positive side effect, the collected evidence of provided care could also help alleviate the issue where doctors in the U.S. are sometimes reluctant to provide prescriptions or diagnosis over the phone because of billing restrictions,24 which forces patients to visit the office of the healthcare provider, and as a result waste a lot of everyone’s time for the treatment of common or recurring ailments. “Samsung Simband aims to take a big step in wearable health,” www.cnet.com/products/samsung-simband/ Microsoft Healthvault Medical Intelligent System, www.youtube.com/watch?v=j8Y4ukdNM60 21 Medical Design Technology Magazine, The Internet of Things and Medical Device Product Development: Practical Strategy Suggestions, March, 2014 22 YouTube, Medical Intelligent System, Proof of Concept 23 Deloitte, Networked medical device cybersecurity and patient safety: Perspectives of health care information cybersecurity executives 24 Texas Medical Association, Coding for Telephone Consultations 19 20 Page 8 Automotive Vehicles contain telemetry about their operation, and about the service activities and faults that happen on them. They travel through different locations, different weather conditions, and different usage scenarios—a four-wheel drive vehicle climbing trails, a sports car in the mountains, or a family van loaded with children. Each of these factors can have an effect on how the vehicle operates, as well as its reliability, comfort, safety, and performance. If the vehicle manufacturer or a vendor-agnostic data aggregator/analyst can collect this data, and analyze it over time, trends can be identified to find new, timelier, and more cost effective and impactful actions to take. These can include maintenance on the vehicle, reconfiguring it, which in turn can help to prevent recalls, or conversely trigger recalls to keep the vehicle safe, and more fun, useful, and cost effective for everyone involved, including the owner, the operator, and the passengers. Manufacturing A service technician is dispatched to analyze an elevator after someone reports that its doors will not close. The building owner is hearing from people who are unhappy that they have to walk up the stairs. It takes the engineer an hour to drive to the building and find the elevator. After arriving, he works through a standard checklist for another hour, only to conclude that the elevator works as expected. As so often happens, a fleeting obstruction, such as a coffee cup between the doors of the elevator or accumulated dust and dirt in the sliding rail might have caused the problem. The service technician drives back to his office, having spent a total of three hours on a phantom problem. At $150 USD per hour and with more than one million elevators in service, incidents where equipment is evaluated as operating normally upon inspection such as in this scenario can have a big impact on the profitability of an elevator company, depending on the type of maintenance contract. Moving beyond this reactive maintenance illustration, capturing telemetry about the motors that operate the elevator or the speed that the doors of the elevator close allows the engineer to take a more predictive approach. For example, an increase in the consumption of energy or a decrease in the door closing speed might signal a service request, and trigger a maintenance crew to provide the service before the elevator breaks down and customers call support, thus saving money, reducing downtime, and increasing customer satisfaction. Page 9 Architectural considerations Designing any system reveals concerns that transcend the individual components of the system. In this section, we discuss various considerations and architectural approaches that we have encountered while helping our customers design solutions in the realm of predictive maintenance. Connectivity Figure 3. An overview of network layers and mapped logical protocols A key technical enabler of the Internet of Things (IoT) is ubiquitous connectivity. Let’s first look at the Open Systems Interconnection (OSI) model.25 Even though the Internet model uses a simplified abstraction, the models in the previous figure and the associated well-known logical protocols are comparable. Application-layer protocols are not concerned with the lower-level layers in the stack other than being aware of the key attributes of those layers, such as IP addresses and ports. The right side of the figure shows the logical protocol breakdown transposed over the OSI model and the TCP/IP model. Interaction patterns Special-purpose devices differ not only in the depth of their relationship with back-end services, but in the interaction patterns of these services when compared to information-centric devices because of their role as peripherals. They are not the origin of command-and-control gestures; instead, they typically contribute information to decisions, and receive commands as a result of decisions. The decision-maker does not interface with them locally, and the device acts as an immediate proxy; the decision-maker is remotely connected and might be a machine. We usually classify interaction patterns for special-purpose devices into the four categories indicated in the following figure. 25 Wikipedia, OSI Model Page 10 Figure 4. Device communication patterns  Telemetry is information flowing in one direction that a device volunteers to a collecting service, either on a schedule or based on circumstances. That information represents the current or temporally aggregated state of the device or the state of its environment, such as readings from sensors that are associated with it.  Notifications are one-way, service-initiated messages that inform a device or a group of devices about some environmental state that they would otherwise not be aware of. For example, wind parks can be fed weather forecast information, and cities can broadcast information about air pollution, suggesting that fossil-fueled systems either throttle CO2 output or vehicles may want to show weather or news alerts or text messages to drivers.  Inquiries occur when a device solicits information about the state of the world beyond its own reach based on its current needs; an inquiry can be a singular request, but it might also ask a service to supply ongoing updates about a particular information scope. For example, a vehicle might supply a set of geo-coordinates for a route, and then ask for continuous traffic alert updates about a particular route until it arrives at the destination.  Commands are service-initiated instructions sent to either a single device or a group of devices. Commands can tell a device to provide information about its state, or to change the state of the device, including activities with effects on the physical world. That includes, for instance, sending a command from a smartphone app to unlock the doors of your vehicle, whereby the command first flows to an intermediating service and then from there is routed to the vehicle's onboard control system. Telemetry and inquiries are device-initiated, and their counterparts, commands and notifications, are service-initiated. This means that there must be a network path for messages to flow from the service to the device, which bubbles up a set of important technical questions. How do you:  Address a device on a network when it is roaming or if it is power-constrained and duty cycling the radio to conserve energy? 26, 27  Send commands or notifications with acceptable latency for a given scenario?  Ensure that the device only accepts legitimate commands and trustworthy notifications? 26 27 Wikipedia, Duty Cycle Georgia State University, ActSee: Activity-Aware Radio Duty Cycling for Sensor Networks in Smart Environments Page 11  Ensure that the device is not easily susceptible to denial-of-service (DoS) attacks that render it inoperable?  Perform this with millions of devices attached to a telemetry-and-control system? Connectivity pathways In the architectures that we have worked on, there are four common connectivity pathways:  Peer-to-peer: A method of communication between devices of a system without the use of a centralized administrative system. The peers in the network can exchange information and communicate only the necessary information back to the system. Besides providing the ability to create specific case and self-organizing networks of devices, this method of communication enhances the capabilities of the system—nodes can work together to become smarter. The disadvantages for smart systems in this type of inter-device communication is the lack of centralized control, and the impact it has on the security of the system. It also requires a higher level of logic (“intelligence”) for some or all peers to use peer-to-peer communication.  Device-to-service: A device that communicates to a supporting backend in the system, often called the service.  Service-to-device: A service that communicates to a device; the opposite of the previous connectivity pathway.  Figure 5. Communication styles Service-to-service. Communication between two separate systems, exchanging data to augment knowledge in the system. From the work that the authors have done, we have learned that for predictive maintenance implementations, a bi-directional communication pattern is key to a manageable solution. The reason for this bi-directional communication ability is to ensure that the system can tell devices to change the way that they capture telemetry, for example, the rate at which it is captured or the fidelity of the readings. We have not come across a case where the requirements were simply to capture data from devices in a oneway communication flow. Because most systems will need a method of telling devices to capture data at differing frequency or with increased fidelity, we consider a one-way communication flow a subset of the more common pattern. Connectivity network types The connectivity type demonstrates how a device and service communicate. The type of connectivity chosen for a system has broad implications to its architecture. We commonly see three types of connectivity with different implementations and implications: Page 12 Figure 6. The increasing geographical reach of varying network types  Wide area network (WAN). A good example of a WAN is a cellular network. This network type is a wireless network that is distributed over land areas called cells, each served by at least one fixedlocation transceiver, known as a cell site or base station. In a cellular network, each cell uses a different set of frequencies from neighboring cells, to avoid interference and provide guaranteed bandwidth within each cell. When joined together, these cells provide radio coverage over a wide geographic area. This enables a large number of portable transceivers (for example, mobile phones, pagers, and so on.) to communicate with each other and with fixed transceivers and telephones anywhere in the network, via base stations, even if some of the transceivers are moving through more than one cell during transmission.28 The most common cellular network is the type that cellphones use. Cellphones and many integrated components for devices support network technologies, such as Global System for Mobile Communications (GSM), Universal Mobile Telecommunications System (UMTS) and as an evolution technology, Long Term Evolution (LTE), as well as others. As with many technologies in the ecosystem of IoT, there is still a large opportunity for optimization of resource usage and cost for these technologies. 29  Local area network/wireless local area network (LAN/WLAN). A LAN uses networking technology to connect computers and devices in a limited area, such as a home, school, computer laboratory, or office building. Unlike WANs, LANs cover a limited geographic area, and do not include leased telecommunication lines. Ethernet over twisted-pair cables and Wi-Fi are the two most common technologies used in LANs today. Though Ethernet 10/100Base-T structured cabling is the basis for many commercial LANs, fiber-optic cabling is increasingly used in commercial applications. Cabling is often inconvenient or impossible to use. With the increasing capability of WLAN devices that use 28 29 Wikipedia, Cellular network Ericsson Labs, “4G for IoT” Page 13 radio waves based on the Wi-Fi industry standard, WLAN is now the standard for wireless connectivity. Wi-Fi has a maximum range of about 250 meters outdoors.30 The connection between different LANs, often times owned by a single entity and extending its range inside a metropolitan area, is referred to as a Metropolitan Area Network (MAN).  Personal area network (PAN). One of the most interesting developments in networking is the use of a PAN to transmit data among devices, such as computers, telephones, and personal digital assistants. PANs can be used to communicate among the devices themselves (intrapersonal communication), or to connect to the Internet. Until recently, PAN devices could not communicate over IP, so they needed a bridge to translate between their proprietary protocol and IP. With the introduction and adoption of IPv6 over low-power wireless personal area networks (6LoWPAN), these devices, which use developing standards, such as Bluetooth LE,31 will communicate via IP directly and take a more active role in IoT. A wireless personal area network (WPAN) is a PAN carried over wireless network technologies such as the following: − Bluetooth and Bluetooth Low Energy (LE): A wireless technology standard for exchanging data over short distances (using short-wavelength UHF radio waves in the ISM band from 2.4 to 2.485 GHz from fixed and mobile devices, and building personal area networks (PANs). Invented by telecom Ericsson in 1994, it was originally conceived as a wireless alternative to RS-232 data cables. It can connect several devices, overcoming problems of synchronization. Bluetooth LE 32 uses 5 to 10 times less power than older Bluetooth,33 making it a good fit for certain IoT applications. − Z-Wave: A wireless communications protocol designed for home automation to remotely control applications in residential and light commercial environments. Z-Wave is licensed through the ZWave alliance.34 − ZigBee(-IP): Built on IEEE 802.15.4, the physical layer for low-rate WPANs, ZigBee35 is often used to transmit low-powered periodic or intermittent data or a single signal from a sensor or input device, wireless light switch, electrical meter with in-home-display, traffic management system, or other consumer and industrial equipment that requires short-range wireless data transfer at relatively low rates. The new ZigBee IP,36 based on 6LoWPAN, lets ZigBee devices communicate without a bridge. For an interesting comparison of power consumption between ZigBee and Bluetooth LE, see “Power Consumption Analysis of Bluetooth Low Energy, ZigBee and ANT Sensor Nodes in a Cyclic Sleep Scenario”.37 Here are some of our observations from this study: − BLE would appear to have an intrinsic disadvantage in a cyclic sleep scenario because the frequency hopping scheme it uses inherently takes longer to connect compare to the fixed RF channel used in ZigBee and ANT. 30 Wikipedia, IEEE 802.11, Protocols IEEE, Transmission of IPv6 Packets over BLUETOOTH Low Energy 32 Wikipedia, Bluetooth Low Energy 33 Bluetooth SIG, A look at the Basics of Bluetooth Technology 34 Z-Wave alliance, Z-Wave For Developers And OEMs: How To Get Started 35 ZigBee Alliance, ZigBee Specification Overview 36 ZigBee Alliance, ZigBee IP Specification Overview 37 Artem Dementyev, Steve Hodges, Stuart Taylor and Joshua Smith, Power Consumption Analysis of Bluetooth Low Energy, ZigBee and ANT Sensor Nodes in a Cyclic Sleep Scenario 31 Page 14 − BLE took longer for one connection (1.15 s), than ANT (0.93 s) and ZigBee (0.25 s). This is because the BLE node was able to sleep for longer between individual RF packets, improving its duty cycle significantly. − We found that BLE achieved the lowest power consumption, followed by ZigBee and ANT. The parameters that dominated power consumption were not the active or sleep currents but rather the time required to reconnect after a sleep cycle and to what extent the RF module slept between individual RF packets. Protocol choices After you have chosen a connectivity type, you need to determine which protocols suit the purpose of your IoT solution. As you can see in the overview of logical protocols, the term protocol applies to different layers in the stack, and there are different protocols to choose from for each layer. Transport-layer protocol choices The transport layer provides communication services in the layered architecture of a network. In the Internet era, two such protocols have emerged as favorites: the connection-oriented Transmission Control Protocol (TCP) and the connectionless User Datagram Protocol (UDP). Depending on the environment that an IoT system must function in, the capabilities of its devices, and how much it must guarantee message delivery, you can choose to support either one of these protocols or both of them. The following figure and table provide an overview of the packet structure and a lightweight comparison of TCP and UDP, as well as factors to consider before choosing to use either protocol. Figure 7. TCP and UDP basic packet structures Table 1. Factors for using TCP vs. UDP Capability Connection type Reliability Protocol overhead Resource usage Broadcast transmission support Ordering of packets Header size Error checking TCP Connection-oriented Full +++ ++ no active 20 bytes Yes, retransmit UDP Connectionless None + + yes none 8 bytes Yes, no recovery possible Page 15 Acknowledgement Special features Yes - No Broadcasts Multicast UDP is a good candidate to transmit data from constrained devices over constrained networks in close proximity, such as LANs or PANs where congestion and packet loss can be low. The following factors contribute to this:  UDP has very little overhead compared to TCP.  UDP is connectionless, so with no state to maintain, it uses less memory.  UDP transactions require only two datagrams, which reduces network pressure.  UDP has no retransmission delays. On networks with a higher probability of packet loss, TCP, being more reliable and secure, is a viable candidate. The following factors contribute to this:  TCP supplies reliability, which is especially important in long-haul communications where there is a high chance of packet loss.  Because TCP is connection oriented, a device that uses TCP can better defend itself because it can ignore communications unrelated to current connections, whereas a device that uses UDP must accept every packet it receives on the listening port. In scenarios that use streaming video or audio, where high throughput is more important than guaranteed packet delivery, and in telemetry solutions in which segments are missing, architects often choose UDP because packet loss is often a better tradeoff than experiencing delays caused by TCP retransmission.38 There are also scenarios where the occasional packet lost for the telemetry channel would be acceptable, but requirements would exist for guaranteed delivery of commands, making the case for a composite model to address these needs to uses both UDP and TCP. Transport-layer protocol security Another perspective on the choice between UDP and TCP is security. Because UDP is a connectionless protocol, it lacks the header values that TCP uses for connection management, such as keeping track of packet ordering (sequence numbers) and packet delivery (acknowledgment number), shown in Figure 7. UDP is thus a more lightweight protocol, but its lack of header values also lowers the barrier that an attacker has to overcome to send false information to the system. The ability of an attacker to “just” spoof the sender address on the IP layer instead of also accounting for altering connection management information demonstrates this vulnerability. In addition to spoofing, UDP is more susceptible to “flooding attacks,” where the attacker floods the system with requests, because of missing flow control 39 and subsequent throttling behavior. TCP is also vulnerable to flooding attacks, but TCP systems can be fairly well secured by using SYN cookies. In our work with customers, we have seen many who use devices with limited resources. For example, a 120 MHz microcontroller, with 256 KB SRAM, and 2 MB flash successfully use TCP as a transport protocol, although the stack in embedded systems often needs modification.40 These customers needed reliability for long-haul (direct) communication. 38 39 40 Wireshark, Packet loss Wikipedia, Transmission Control Protocol, Flow control “Embedded”, Reworking the TCP/IP stack for use on embedded IoT devices Page 16 Application-layer protocol choices From our experience, we have seen three dominant protocols on the rise in this space:  Advanced Message Queuing Protocol (AMQP). AMQP is an open protocol for message-oriented middleware that JP Morgan Chase developed. The same problems of connecting systems together would crop up regularly. Each time the same discussions about which products to use would happen, and each time the architecture of some system would be curtailed to allow for the fact that the chosen middleware was reassuringly expensive.41 The first implementation of AMQP was iMatix OpenAMQ,42 but others have emerged as well, notably Apache Qpid,43 Microsoft Azure Service Bus,44 and RabbitMQ.45 AMQP is a binary wire protocol that supports programming languages such as C#, C, Java, Perl, Python, Ruby, PHP, and Lisp. Where many traditional queuing mechanisms have failed, AMQP seems to be thriving and is currently used in many systems, such as:46 − Aadhaar,47 a large-scale identity system with 1.2 billion identities and about 100 million authentications per day.48 − The National Science Foundation’s Oceans Observatory Initiative, processing 3 petabytes per year49 For more information, see the AMQP site to read the specifications on the protocol or try a free implementation.   Constrained Application Protocol (CoAP).50 Targeted mostly at resource constrained sensors and actuators (devices) such as valves and switches, this protocol fits the bill for specific purpose networks, such as Wireless Sensor Networks (WSNs),51 with applications such as forest fire detection,52 and structural health monitoring.53 CoAP is by default bound to UDP and optionally Datagram Transport Layer Security (DTLS), providing communications privacy. With the default binding to UDP, CoAP supports multicast messaging, allowing for the addressing of a group of destinations at once. CoAP over TCP transport is currently in draft. Figure 8. Multicast MQ Telemetry Transport (MQTT).54, 55 From the documentation, “MQTT is a Client Server publish/subscribe messaging transport protocol. The protocol runs over TCP/IP, or over other network protocols that provide ordered, lossless, bi-directional connections.” MQTT is a publish-subscribe 41 Association for Computing Machinery, Toward a Commodity Enterprise Middleware See http://www.openamq.org 43 Apache, Apache Qpid™ 44 Microsoft, AMQP 1.0 support in Service Bus 45 See http://www.rabbitmq.com/ 46 Amqp.org, Products and success stories, Notable AMQP Users 47 Unique Identification Authority of India, Aadhaar technology 48 Slideshare, Big Data at Aadhaar (slide 9) 49 OOI, CIAD COI TV RabbitMQ 50 Wikipedia, Constrained Application Protocol 51 Wikipedia, Wireless sensor network 52 Wikipedia, Wireless sensor network, Forest fire detection 53 Wikipedia, Structural health monitoring, Examples 54 See MQTT.org 55 OASIS, MQTT 3.1.1 draft 01 / public review draft 01 42 Page 17 protocol developed for machine-to-machine (M2M) communications, initially created by IBM, and currently undergoing standardization at OASIS. In projects that we have done where there was a green field for implementation, AMQP has been the best fit because in addition to it being efficient, reliable, flexible, and broker independent, AMQP is native to Microsoft Azure Service Bus, the key technology component for all these customer projects. Page 18 Security With devices communicating sensitive information and acting on our behalf, we clearly need to ensure that the system and the information it captures, processes, and stores, is secure. With any system, security is a tradeoff with other requirements, such as user friendliness, performance, cost, and so on. In this section, we cover some important security aspects we have come across while working in this field. “This is the weather forecast for the week of June 16, 2024 for Texas,” the weatherman says. “Last week was hot, but this week will be sizzling, with temperatures reaching in excess of 110 degrees, with no rain expected.” In hot weather, irrigation is the key to crop and cattle survival. Because most of the state’s farmers are using a new irrigation system that depends on thousands of sensors to determine the best time to irrigate, few of them worry. What they don’t know is that the system is sending faulty telemetry information that indicates that it rained every day last week. This keeps the system from irrigating, and now, crops and cattle start to die. When distributed systems directly influence the physical world by turning valves, controlling servos, and much more, there is a clear need to ensure that compromised systems do not kill crops, cattle, and people, burn buildings, or crash cars. The security bar for commands and data that make things move must be much higher than in e-commerce or finance. Let’s start with a short list of questions about security for the kinds of systems that we have come across in our work on predictive maintenance—a list of factors to think about as you architect an IoT system. On top of normal security precautions, you also need to know how to:  Securely onboard new devices. You must ensure that only devices that the system can register are allowed into the system.  Prevent devices from being duplicated or substituted. Because devices provide data that the system will directly or indirectly act upon, you must be able to trust data from devices. Peripherals that can be duplicated or substituted might allow a rogue entity to flood the system with false but trusted data. Also, in the past, a pirated copy of a device used to cost money in terms of a lost sale. If it is a connected device, it can now have actual costs in terms of those related to connectivity and cloud compute to support and interact with the device.  Ensure that device data can be trusted. As devices communicate, you need to ensure that the data that they transmit is received unaltered and from verified sources—that the data logged in the service by the device must be trustworthy, representing a point-in-time observation. This requires integrity and authenticity of data in information-security terms.56  Ensure the confidentiality of messages in transit and at rest. Because IoT systems span multiple physical networks and transport information over public and unknown networks through dynamic routes, information in transit must be secured against observation by non-authorized third parties.  Prevent devices from denying service. In modern software architecture, the level of interdependencies is high and increasing. Dependencies within the system—such as devices measuring data potentially critical to effective decision-making—need to be available and accessible.  Accept only authorized commands on devices. In any system that acts on external commands and especially one that interacts with the physical world, it is imperative to ensure that those commands are only acted on if they are properly authenticated and authorized. 56 Wikipedia, Information Security, Authenticity Page 19  Remove rogue devices from the system. If you find a bad actor such as a compromised device in the system, you must be able to remove it quickly.  Authenticate peers. If a system supports peer-to-peer communication among devices, for example, to enrich information or intelligent edge decision-taking without service intervention (autonomous system operation), you must have a way to authenticate in place to ensure that peers in the system are talking to trusted neighbors.  Ensure that devices are always connected to a particular service. A powerful part of how modern communication works is by using hyperlinks to let clients dynamically reroute traffic. Devices will blindly follow these hyperlink redirects without thinking twice (or once, for that matter). Besides offering flexibility, redirects pose a substantial risk if someone redirects the dataflow into an intermediate system to alter system behavior, copy the data, or modify the data stream.  In combinatory devices, ensure fine grained security is possible. When a component of a customer is embedded inside a larger system, such as smart brakes inside a train or components inside machines, ensure each interested party has access to the right information and commands, and that when a component is replaced, it is no longer authorized to act as being part of the larger device. Virtual Private Networks A common way to connect networks over an untrusted network is to use a virtual private network (VPN)57. VPNs act as a virtual network card on both ends of the connection, combining two networks as if they were a single entity. The issue with this approach is that a VPN merely provides secure virtual network cables; it is the two networks and therefor everything in Figure 9. VPN connecting two networks at the link-layer them that are connected. After the connection is established, the VPN provides access to all layers above the link-layer from any device on either network. A VPN does not help establish any notion of authentication and authorization beyond their immediate scope. A network application that sits on the other end of a TCP socket, where a portion of the route is facilitated by the VPN, is oblivious to their existence because it acts on the transport and application layers of the network model. What matters for the trustworthiness of the information that travels from the logic on the device to a remote control system that does not reside on the same network, as well as for commands that travel back up to the device, is solely a fully protected end-to-end communication path spanning networks, where the identity of the parties is established at the application layer. Protecting the route at the transport layer by signature and encryption is done as a service for the application layer either after the application has given its permission (for example, via certificate validation hooks) or just before the application layer performs an authorization handshake, before entering into any conversations. Establishing end-to-end trust is the job of application infrastructure and services, not of networks. Compliance For vertical sectors such as government and healthcare, compliance is a key consideration as you architect an IoT solution. National and local governments and industry groups have mandates that affect 57 Wikipedia, Virtual private network Page 20 what a company can share and with whom. Conversely, some regulations require the sharing of data among government entities or businesses that work on government programs. The EU has model clause regulations that dictate the storage and exposure of personal data. 58 The U.S. has similar regulations, such as the Health Insurance Portability and Accountability Act (HIPAA)59 and the Privacy Act.60 Other countries and entities also have privacy mandates that consider the location of stored data, its origin, the location and nationality of the users, and the location, nationality, and use of the data consumers. If ingested, processed, or published data offers no way to discern details about specific people, it will less likely be affected by regulation. But all data that is made available to the public or even a controlled set of partners must be reviewed to adhere to all applicable mandates because violations present high legal61 and reputational risks.62 Healthcare The HIPAA and HITECH laws in the U.S. apply to healthcare and partner organizations that have access to sensitive patient information, called electronic protected health information (ePHI). Service providers that work with these entities usually must agree in writing to adhere to security and privacy provisions set forth in HIPAA and the HITECH act. If an IoT system that supports applications such as the one we described in the Healthcare scenario captures ePHI, it must adhere to these laws. Microsoft provides a Business Associate Agreement as a contract addendum to its cloud platform, Microsoft Azure. 63 We also provide information on some of the best practices for HIPAA-compliant applications, and we detail Microsoft Azure provisions for handling security breaches.64 58 European Commission, Protection of Personal Data US HHS, Health Information Privacy 60 US HHS, Privacy Act 61 TechRepublic, Data security laws and penalties: Pay IT now or pay out later 62 Experian, Reputation Impact of a Data Breach 63 U.S. Department of Health & Human Services, Health Information Privacy, Business Associates 64 Microsoft, Azure HIPAA Implementation Guidance 59 Page 21 Device communication patterns Many current IoT communication approaches try to answer the basic addressing question with traditional network techniques. That means that the device either gets a public network address or it becomes part of a virtual network and then listens for incoming traffic using that address, acting like a server. In this section, we document various architectural approaches that we have seen, highlight their characteristics, and then propose an alternative that is suitable for many IoT scenarios. NAT-based device network This architectural design approach uses network address translation (NAT)65 to expose internal devices in the network that usually use a private IP address, to the outside world by reserving a port on the edge device and mapping this port to the private IP address. The following diagram illustrates this approach. Figure 10. NAT-based device network The previous figure shows a device that uses an internal IPv4 IP address (192.168.1.112) that is listening on port 8088. The device is exposed to the outside world on IP address 127.x.x.x, using port 721. The DNS entry associated with this IP address is device.mynetwork.com. Clients accessing device.mynetwork.com on port 721 will be routed directly to the internal device. This approach has been used in many traditional networks, and depending on the scenario, it can still work today. However, we have found this scenario to be typically limited by the amount of devices that it can support (about 65,000) due to the number of available ports, the need to be statically located (not moving), and the fact that every exposed device needs to act like a server (receiving, parsing, and answering arbitrary requests from clients), which increases its attack surface for malicious abuse. 65 Wikipedia, Network Address Translation Page 22 IPv6 direct-addressing device network With the rollout of IPv6, it is natural to think about giving every device in an IoT solution its own publically routable IP address to let it connect to peers, services in the system, or other systems. The following diagram conceptually depicts this model, which we have seen many times. Figure 11. IPv6 direct-addressing device network We mentioned the drawbacks with this approach at the start of this section. Many current IoT communication approaches try to answer the basic addressing question with traditional network techniques. That means that the device either gets a public network address or it becomes part of a virtual network and then listens for incoming traffic using that address, acting like a server. For NATbased device networks that use either of these protocols, a device needs to act like a server, and with the implicit direct-connectivity model, it must be stationary to avoid connection loss, or it must employ application-layer measures that can handle this scenario. Page 23 NAT-based, PAN device network For PAN power-constrained and mostly wirelessly connected devices that are often not IP-based, a common approach to bridging the last few feet of connectivity is to use a hub device wired to the main network that can bridge to the devices on the local network. The following figure illustrates this approach. Figure 12. NAT-based, PAN devices network Even though a hub translates between IP and the various PAN protocols, the problem space is the same as with other NAT-based device networks that we described. Generic concerns with direct addressing All previous architectures that provide direct addressability for devices share common concerns. As each device is publically addressable, it needs to handle inbound commands itself, taking care of all application layer responsibilities, such as hosting the server accepting inbound connections, interpreting commands, queuing requests, and so on. Because many devices in large-scale deployments will have limited resources, constraining the number of socket connections that they can handle, and leaving them open to simple denial-of-service (DoS) attacks.66 In this approach, the devices would also have to handle the authentication of users for command + control, using the already scarce sockets, memory and compute power to call out to a service or connect to a database and handle its responses and I/O. Service-assisted communication Another approach to connecting a large number of devices to the central service within a system is to have the device connect to a well-known service (called a gateway) and then use that service to tunnel commands to the device. The goal of this approach is to establish trustworthy and bi-directional communication paths between control systems and special-purpose devices that are deployed in untrusted physical space. To that end, the following principles are established:  66 Security trumps all other capabilities. If a capability cannot be implemented securely, it must not be implemented. Threats are identified and either mitigated or accepted. Wikipedia, Denial-of-service Attack Page 24  Devices do not accept unsolicited network information. All connections and routes are established in an outbound-only fashion.  Devices are peered with a gateway to only connect or establish routes to well-known services. If devices need to feed information to or receive commands from a multitude of services, they are peered with a gateway that takes care of routing information downstream. This ensures that commands are only accepted from authorized parties before routing them to the devices.  The communication path between device and service or device and gateway is secured at the application protocol layer. This mutually authenticates the device to the service or gateway and vice versa. Because the application does not normally concern itself with lower-level layers in the network stack as we discussed earlier in Connectivity, device applications do not trust the link-layer below.  System-level authorization and authentication must be based on per-device identities. One device, one identity ensures that you have granular control over which devices can access the system, provide data, and receive commands.  Access credentials and permissions must be revocable. In case of device abuse, the system must be able to quickly respond by removing the device as an authorized part of the system.  Bi-directional communication for devices may be facilitated by an intermediate store. Devices that are connected sporadically due to power or connectivity concerns may be facilitated through holding commands and notifications for the devices in a queue or mailbox structure until they can connect to retrieve them.  Application payload data may be separately secured. This is to protect transit through gateways to any particular service. Figure 13. Service-assisted communication pattern From the previous illustration, we can derive the following set of attributes:  Device. The device acts like a client; it connects to the gateway and does not listen for unsolicited traffic. The device connects to an external gateway by creating and maintaining an outbound TCP socket across a NAT boundary or by establishing a bi-directional UDP route, potentially using mechanisms such as Session Traversal Utilities for NAT (STUN) or with larger NATs, such as Page 25 Traversal Using Relay NAT (TURN). These facilitate the detection of a NAT and the discovery of the public IP address of the network for binding.  Connection. The connection is routed through the edge device, usually a router. Because the connection is outbound, the port mapping is performed automatically. By only relying on outbound connectivity, the NAT/Firewall device at the edge of the local network will never have to be opened up for any unsolicited inbound traffic. The outbound connection or route is maintained by either client or gateway in a fashion that intermediaries such as NATs will not drop due to inactivity. That means that either side might send some form of a keep-alive packet periodically, or send a payload packet periodically that then doubles as a keep-alive packet. Under most circumstances it will be preferable for the device to send keepalive traffic as it is the originator of the connection or route, and it can and should react to a failure by establishing a new one. As TCP connections are endpoint concepts, a connection will only be declared dead if the route is considered collapsed and the detection of this fact requires packet flow. A device and its gateway may therefore sit idle for quite a while believing that the route and connection is still intact before the lack of acknowledgement of the next packet confirms that assumption is incorrect. This conflict in behavior calls for a tradeoff decision to be made. Carrier-grade NATs (CGNs) employed by mobile network operators permit very long periods of connection inactivity and mobile devices that get direct IPv6 address allocations are not forced through a NAT at all. The push notification mechanisms employed by all popular smartphone platforms use this to dramatically reduce the power consumption of the devices by maintaining the route very infrequently—every 20 minutes or more—so the devices can remain in sleep mode with most systems turned off while idly waiting for payload traffic. The downside of infrequent keep-alive traffic is that the time it takes to detect a bad route is, at worst, as long as the keep-alive interval. Ultimately, it is a tradeoff between battery-power and traffic-volume cost (on metered subscriptions) and acceptable latency for commands and notifications in case of failures. The device can actively detect potential issues and abandon the connection and create a new one when, for instance, it hops to a different network or when it recovers from signal loss. The connection from the device to the gateway is protected end-to-end and ignores any underlying link-level protection measures. The gateway authenticates with the device and the device authenticates with the gateway, so neither is anonymous to the other. In the simplest case, this can be done by exchanging a previously shared key. As we see quite often in more capable devices, it can also be done via a X.509 certificate exchange as performed by Transport Layer Security (TLS), or a combination of a TLS handshake with server authentication where the device later supplies credentials or an authorization token at the application level. The privacy and integrity protection of the route is also established end-to-end, ideally as a byproduct of the authentication handshake so that a potential attacker cannot waste cryptographic resources on either side without producing proof of authorization. Today, TLS/DTLS and Secure Shell (SSH) dominate as application-level connection security protocols. SSH is popular, but it lacks a standard session-resumption gesture. TLS supports both the X.509 certificate-exchange model and a simplified model (TLS-PSK) that uses previously shared keys. Removing support for X.509 certificate handling and wire-level exchange reduces the footprint of the TLS library, and by reducing the supported algorithms (for example, supporting only AES-256 and SHA-256), it’s feasible to use this protocol on compute- and memory-constrained devices while remaining compatible with other application layer protocols that rely on TLS. The result of all this is a secure peer connection between the device and a gateway that only the gateway can feed. Page 26  Edge security. Because there are no ports open to listen on the edge device, the attack surface on the local network and its devices is minimized.  Gateway. The connection is accepted by a hosted process called a gateway, a system hosted in an environment that is defendable against external threats, either at the edge of the internal network or based in the cloud. It provides a well-defined endpoint and API for clients to connect to and communicate with, effectively acting as a proxy for the device. Eventual peer-to-peer connections inside the network are acceptable, but only if the gateway permits them and facilitates a secure handshake between the peers. In case any authorized client wishes to send a command (or a reply to a previous request) to a device, it can do so by sending the command to the gateway, providing one or even several different APIs and protocol surfaces that can be translated to the primary bi-directional protocol used by the device. As the gateway is a layer of abstraction, it provides the device with a stable address, location transparency and location hiding. As this gateway forms an abstraction toward the device, the device could be limited to speak AMQP, MQTT or some proprietary protocol, and yet have a full HTTP/REST interface projection at the gateway, with the gateway taking care of the required translation and also the enrichment where responses from the device can be augmented with reference data. The device can connect from any context and it can even switch contexts, yet its projection into the gateway and its address remains completely stable. The gateway can also be federated with external identity and authorization services, so that only callers acting on behalf of particular users or systems can invoke particular device functions. The gateway therefore provides basic network defense, API virtualization, and authorization services all combined into in one. This approach gets even better when it includes or is based on an intermediary messaging infrastructure that provides a scalable queuing model for both ingress (device to cloud) and egress (cloud to device) traffic. Without this intermediary infrastructure, this approach would still suffer from the issue that devices must be online and available to receive commands and notifications when the control system sends them. With a per-device queue or per-device subscription on a publish/subscribe infrastructure, the control system can drop a command at any time, and the device can pick it up whenever it is online. If the queue provides time-to-live expiration alongside a dead-lettering mechanism for such expired messages, the control system can also know immediately when a message has not been picked up and processed by the device in the allotted time. The queue also ensures that the device can never be overtaxed with commands or notifications. The device maintains one connection into the gateway and it fetches commands and notifications on its own schedule. Any backlog forms in the gateway and can be handled there accordingly. The gateway can start rejecting commands on the device’s behalf if the backlog grows beyond a threshold or the cited expiration mechanism kicks in and the control system gets notified that the command cannot be processed at this time. On the ingress-side (from the gateway perspective) using a queue has the same kind of advantages for the back-end systems. If devices are connected at scale and input from the devices comes in bursts or has significant spikes around certain hours of the day, such as with telematics systems in passenger cars during rush-hour, having the gateway deal with the traffic spikes keeps the back-end system robust. The ingestion queue also allows telemetry and other data to be held temporarily when the back-end systems or their dependencies are taken down for service or suffer from service degradation of any kind. Page 27 Designing for scale The opportunity for IoT is in the ubiquity of connected devices, the volume of data that those devices will supply, the intelligence to be gained from that data, and the command/control that we can exert on the devices. All of these aspects mean that the solution must be designed to scale at all levels. In many respects, designing an IoT solution to effectively scale carries the same aspects as any large scale solution. While IoT does not require a cloud-based deployment, in most cases, taking advantage of the cloud makes sense because of usage-based pricing, a simple model that scales, geographic availability, and infrastructure support provided by the cloud vendor. Many documents and articles have been written about cloud application scalability and availability. For a good overview of this topic, see “Failsafe: Guidance for Resilient Cloud Architectures.”67 The Microsoft patterns & practices team also has a large body of work on Cloud Development that provides guidance on building scalable cloud systems.68 There are specific scalability areas that come up more frequently in IoT scenarios that may not appear in other IT solutions, however. One area is identity. For web properties, the concept of identity federation has taken hold, and most modern consumer web properties now allow a user to use their identity from other well-known identity stores, such as an account registered with Microsoft, Facebook, Google, Yahoo, and so on. Additionally, corporate accounts can be federated with platform as a service (PaaS) vendors and partners. But with the addition of devices, there will often be identities associated with those devices, relationships between those devices and human identities, and relationships between multiple humans and devices. This potentially complex set of relationships should be considered early in an IoT project, and the solution should strive to simplify these relationships as much as possible. In our project experience we have not yet seen a pattern that satisfies this level of complexity with satisfactory results. The initial projects have used Azure Active Directory for human identities, and external data stores for device identity and the associations with Azure Active Directory users. Design, prototyping, and testing is an ongoing process to find more scalable, resilient and feature complete solutions. Communication and ingestion Another scalability area that is tested are the communication paths for ingestion. Most solutions will require secure, authenticated communication between devices and the collection point. Additionally, any implementation choice for messaging technology will have scalability points, limits in certain properties, such as messages per unit (for example queue, topic, and so on), and bandwidth per implementation unit (subscription, instance, and so on). These parameters must be well understood, planned for from the beginning of the project, and tested and verified as the architecture and solution progress. Our projects have used the Microsoft Azure Service Bus69, with Azure Active Directory Access Control (ACS)70 keys granted for each device. In a generic solution, some type of secure key must be generated that will make a device unique, and one that only that device knows about. The system it connects to must know about the device and its key, and then verify that they match when messages arrive. The Service Bus and ACS provide these capabilities, making them a good fit. The solutions use Topics and Subscriptions71, and they are designed to take into account the scalability parameters of the Service 67 MSDN, Failsafe: Guidance for Resilient Cloud Architectures MSDN, Cloud Development 69 MSDN, Service Bus 70 MSDN, Access Control Services 2.0 71 MSDN, Service Bus Queues, Topics, and Subscriptions 68 Page 28 Bus72, and use as many topics as needed to comfortably support the number of devices in the system and scale to additional topics if and when additional devices are added to the system. Data storage scalability Scalable data storage is another area that will be important in these projects. Because of the expected volume of data, blob storage will normally be the preferred choice. The reason for this is that blob storage is the lowest cost storage option, and Big Data analysis tools are built to work against blob storage. Depending on the volume and the geographic dispersion of the devices, the solution may need to use multiple storage accounts, and it may also need to move data from collection data centers into a single data center in order to perform analysis on that data. For additional guidance on managing the data, see “Data Management Patterns and Guidance” 73 on the Microsoft Developer Network. Device registration Registering devices is the critical first step to take to ensure that the system is secure and remains secure, only allows data to be ingested from trusted endpoints, and devices only accept commands from trusted systems. A device must be uniquely identified, the system must authenticate its identity, and the device must know that it is communicating securely with only the correct collection endpoint. Often a device will be created with the knowledge of the expected endpoints, or at least have some influence over the collection point. An example of this is a vehicle whose manufacturer is selling a connected vehicle experience. In this scenario, when the device is manufactured, a unique key will be stored on the device. Either that key or a public key associated with it will be stored in a database, and when the device is enabled, the service can check the database and verify that the device is an approved device. These keys may be service-generated, such as by Azure Active Directory Access Control Services (ACS), or keys created to support the TLS-PSK pattern as described earlier in this paper, or keys intended for service-specific authentication. Typically, even when the device carries a key out of the factory, the device will become “active” in another step; for example, when a customer purchases, installs, and configures the device. Configuration will associate a user with the device, which transforms it to an active device. The device may be issued a new key at this time. In other cases, the set of potentially connected devices will not be known at manufacturing time, so keys cannot be installed on the device prior to its release. In this case, device registration must happen when the device is installed or activated. An example of this might be a traffic service that will collect GPS and movement telemetry from a smartphone, and in turn provide free traffic information for users who opt in to share data. In this case, there would be a registration step where a user must identify the device to the service, the service then sends a key to the device, and then that key is used to manage communication. Equally important to device registration is the ability to unregister the device, or disable it. This is critical because even though the communication with the device is secure, the device itself can become compromised. Being able to unregister the device and refuse communication is a critical aspect of the system. With device specific keys, the keys can be revoked and the system can quickly stop accepting telemetry from the device. 72 73 Microsoft, Service Bus Scalability MSDN, Data Management Patterns and Guidance Page 29 Acquiring data IoT data acquisition is frequently referred to as data ingestion. In literature about Big Data, the three Vs, volume, variety, and velocity are often cited74. There are other aspects to consider as well. In our initial engagements in IoT, we have seen that device bandwidth, connection speed, reliability, and cost have been major influencers in the solution choices made. But each item in this section is important, and the relative importance of each will vary depending on a project’s requirements. The following sections discuss many aspects of data ingestion. Message size and format Messages from devices are the lifeblood of IoT. In a world with no boundaries, we might collect all telemetry data and analyze it extensively, or simply save it in case we need it later. In the real world, we need to consider the size of the message, which will be affected by its number of attributes, the data types, the message formats, the message overhead, and the security overhead. Many common message formats are in use today. Extensible Markup Language (XML) and JavaScript Object Notation (JSON) are common. Binary JSON (BSON), Protocol Buffers, and Avro are more compact formats that are often used when message size and bandwidth are constrained. XML is supported by all development tools, and easy to understand, but its tags can often cause message-size bloat. JSON is quickly becoming as ubiquitous as XML, and it is more compact than XML, but JSON retains the readability of XML. In IoT there is often a premium on memory, bandwidth, and connection cost, so compact message formats can be useful. BSON is a binary encoded version of JSON. It allows you to encode binary data in the message, and it enables storing data as raw bytes versus text. Protocol Buffers define a method of serializing structured data. They were developed at Google, and then given to the open source community. Protocol Buffers are compact, but not self-describing like XML and JSON, so sender and receiver must understand the message being transmitted. Avro is another option for compact formats. It differs from BSON and Protocol Buffers in that it is not self-describing, but it is always accompanied by a schema, so now code generation or prior knowledge of schema is required for processing on the message receiving end. Ultimately, choosing one of these formats comes down to how to balance development environment support, device support, the need for compactness, and storage and processing requirements on the message-receiving side. Message types Your system may require different message types that can differ in schema, data type, or both of these. A real-world example of this is a connected vehicle system that predominantly sends telemetry information for predictive maintenance. This system might also be used to send audio or video clips for emergency management, accident recording, and so on. In these cases, the media files are often enhanced with metadata related to the collection of the media file. Additionally, the media messages may be of lower or higher priority and they may require splitting, compression, resumption on error, and temporary local 74 See The 3 Vs of BIG data Page 30 storage. If different device types are involved, they may provide media files in formats or encoding levels that are optimized or specific to those devices, which could require normalization at the storage point. Message priority Different message types will often have different priorities in an IoT system. A message can be a standard telemetry message that is intended specifically for cataloging, and used for machine learning algorithms downstream. There can be other message types that are considered events and alarms. An event could be an elevator door opening, a car starting, or the temperature being increased in a home, whereas an alarm might be a broken window, a car crash, or a full engine failure. Message priority will be handled either by providing a separate endpoint for priority messages, or by detecting attributes in the message itself to assign priority. Using a separate endpoint for priority massages can reduce the chance of a high priority message delivery being slowed by a flood of the standard flow messages. If the throughput of the initial point of ingestion is considered adequate, then downstream detection is an option, for instance creating a standard subscription and a high priority subscription on an Azure Service Bus Topic. There are also cases where device priority may be required. In a connected vehicle scenario, there may be a premium service that has priority, or there may be sensors in a building with relative priority, such as one that detects a broken window on the first floor that has higher priority than one on the fifth floor of the building. In this case, the priority may be handled similarly to message priority. Another approach is to use a separate service that handles the higher priority devices. Conditional messaging In some of our projects, the solution required the message pattern to change based on conditions. In this case, if a service technician received an alert that an elevator needed attention, the technician could send a message to the device asking for it to increase the detail and frequency of messaging. This would continue for a configurable timeframe. This type of requirement means that the solution must be scaled to handle the conditional events. For instance, if the devices could automatically increase the size and frequency of messages, they could cause a dramatic increase in traffic to the system. Safeguards and throttling should be considered to protect against unplanned data floods in such situations. Contextual messaging Similar to conditional messaging, there are use cases that require contextual messaging, which can follow multiple patterns. There may be situations where the device includes contextual information in the messages that it sends. The data may include GPS coordinates, and a vehicle may need to send additional telemetry when it travels above a certain altitude, or if the ambient temperature rises above a trigger level. The context may require more data in messages, the collection of data from other sensors on the device, or it may require more or less frequent message transmission. Message batching The natural inclination may be to send messages immediately when data is generated, but there are several reasons why messages may be batched. A device may be power constrained, so the connectivity may only be turned on for a limited amount of time. The connection may be unreliable, so it could make Page 31 sense to batch the collected messages for a single transmission once connectivity is available. The device may move in and out of connectivity, or connectivity may be congested or less expensive at certain times of the day. If you allow batched messages, the message receiver must be designed to accept them as well as single messages. In this case, a message envelope that can contain multiple messages or a single message can simplify the solution. Bandwidth and scale Previous topics in this paper discuss bandwidth from the device. The bandwidth and scale of the collection points must also be considered. The size of the network pipe out of the device environment may be constrained. For example, if the solution is collecting building telemetry, and there are devices that are connected to an internal network and sent to an external collection point, the effect on the capacity of the building network should be evaluated. The collection points will also have an upper bound. For example, Microsoft Azure Storage and Service Bus have capacity targets. If your solution needs to extend beyond the targets of the enabling technology, then a scale-out approach should be designed for the project. This approach should include plenty of excess capacity for growth and unplanned spikes. In our projects, we typically plan for no more than 50 percent capacity at steady state. If the connected devices are geographically distributed, consider scaling out the solution to multiple data centers. This can introduce the complexity of directing device traffic to the right collection points. In our projects, we have found success in assigning devices to data centers so that no single device traffic needs to “find” where its data should go. If the device moves geographically, then it may need to be reassigned. It is important to understand how the data will be used, and if it needs to be aggregated before use or if the data can be used autonomously in the data center where it was collected. Storing information In an IoT solution, there are also several aspects to consider for data storage. The following sections discuss many aspects of this topic. Storing data on the device The critical telemetry data is generated on the device, or prior to getting to the device in the case of a gateway. The data may be cached and preprocessed on the device. The reasons for doing this include the desire to optimize the amount of data sent, to minimize “noise” data from analysis, to save on storage costs at the central storage location, minimize transmission time or cost, account for unreliable connectivity, and so on. If data will be stored on the device either temporarily or permanently, there are several local storage considerations, such as those on security, reliability, and capacity. If data is stored on the device, the solution architect needs to consider the implications of losing the data, if the data will expire on the device if it cannot be sent to external storage, and how the system will detect and recover from missing data, should a local outage occur. Transforming data Generally the data will go through multiple transformation steps that extend from the generation, sending, storage, and processing of it. As stated in the previous section, there may be data transformation happening on the device itself, such as converting its format, aggregation, and so on. This will rely on Page 32 local processing capabilities. Other than the local preprocessing, any other transformation would happen at the collection point. For years, data processing has been thought of in terms on Extract, Transform, and Load (ETL). With the advent of Big Data, much of the discussion has changed to Extract, Load, and then Transform (ELT). The key concept in this transition is that your system is ingesting a huge amount of data, and the transformation process costs significant compute power. Additionally, while this transformation is happening, the data is at risk. If it has not yet been serialized, and the server crashes, then the data is lost. With ELT, the system ingests the data and immediately stores it. This minimizes the exposure of the data during ingestion, and provides new opportunities for data transformation and analysis. First, the data can be transformed asynchronously from ingestion. This helps reduce compute demand. Then the data can be transformed multiple times, for multiple purposes, and this process also supports the idea of collecting all data for extended periods of time. This is often referred to as a “data lake”75, and this strategy suggests keeping “all” data for later analysis. The rationale for this is that machine learning algorithms may find interesting patterns or trends that would not be expected, and that these would warrant studying other seemingly unneeded data. Location Most IoT solutions will send data to a public or private cloud. If connected devices are geographically distributed, there may be a case for storing the data across several locations around the globe, in order to store the data closest to where it was generated. There may also be government mandates that require an individual’s data to remain in that person's home country, or the data may only be interesting within the region within which it was collected. However, in a large percentage of projects, the value is in the large body of data, so data must be brought together into a single location for the most insightful analysis. In this case, the considerations will center on the time constraints of the analysis (how often are the algorithms run?), the physical limitations of the data centers, bandwidth, and the cost of moving data. Longevity, format, and cost After the data reaches its long-term storage point there are decisions to be made about how to govern that. A data retention policy must be defined. The arguments for long data retention periods are that cloud storage is inexpensive and getter cheaper all the time, and that data scientists want data saved in case a new insight is discovered that warrants looking at data that was previously uninteresting. Even with those benefits, the costs for large volume data storage can add up, and the data could become unmanageable if you do not have a basic plan for how to store, access, and retrieve it. The terms Data Temperature and Hot and Cold storage76 also come up in this context. The concept centers around how frequently accessed the data is, and how quickly the users or systems expect to be able to use the data. Hot data is frequently accessed and users expect good response time. Cold data is data that is less frequently accessed and expected response times can be lower. Classifying data in this manner allows the architects to choose faster and potentially more expensive storage for hot storage and select lower cost options for cold storage. The format for long-term data storage also needs to be carefully considered. Should it be optimized for Hive queries, or should it be as compact as possible? Or should there be a “fresh” data store with more recent data that is easy to access, process and query, and an archive that is compressed and stored in a 75 76 Forbes, The Data Lake Dream Teradata, Hot and Cold Running Data Page 33 way that minimizes cost, but that requires overhead if and when it needs to be accessed. All of these considerations add in to making decisions on how to best store the data. Processing information After the data is ingested, it must be processed. Processing types range from very simple to long-running and complex. The following sections discuss common IoT data-processing types. Alarm processing A common use case is to watch for specific data items on ingestion and then take action based on that data. These could be alarms from devices, or any kind of simple event processing. The characteristic of this type of processing is that there is a specific set of values that are to be monitored on specific attributes of the incoming data that can trigger predetermined responses. While this type of event processing is logically straightforward, the implementation still requires consideration due to the expected high volume of data being ingested, and the likelihood that the events that must be responded to are of relative importance. In alarm processing, the solution must also account for the potential of alarm floods. If a systemic failure happens, for instance if a home alarm system sends an alarm to the event processing system when the power goes out, there may be a flood of alarms, or if the there is no battery backup, messages may be cached on the device, and then when the power returns, all the devices send their entire set of messages at once. To handle these situations, the devices may be designed to have a random offset for message delays, or the message receiving service can implement a circuit breaker pattern77 to circumvent failure when an abnormal event pattern happens. Complex-event processing Complex-event processing is used to detect conditions or states on data in motion that may not be directly deduced from simple data evaluation. This might include the detection of a certain set of events that arrive in a particular order or frequency, such as an event that is innocuous if it appears once, but that indicates a problem if it occurs a certain number of times in a certain timeframe, or if the same event is transmitted from a set of devices or sensors. Imagine that your car sends telemetry to the manufacturer, and one of the items that it reports is failed starts. By itself, this would mean very little to the manufacturer. However, if the weather got very cold last night, and none of the SuperCar Model 8s in that area started in the morning, that could tell the manufacturer that there is a systemic problem with the car's battery or something related to the starting system. The industry sees complex event processing as one of the keys to monetizing the vast opportunity of IoT.78 When envisioning the solution, ensure that initial requirements are discussed early in the project. This is an area where businesses will learn and improve over time, but one which should be prototyped early in the process to prove out the concepts, and to begin to develop the right mindset for capitalizing on the opportunities. This is a rich area of development within Microsoft, our competitors, and the open source community. Microsoft has developed StreamInsight,79 which can be deployed in the cloud. A 77 MSDN, Circuit Breaker Pattern Venture Beat, Without stream processing, there’s no big data and no Internet of things 79 Microsoft, StreamInsight 78 Page 34 popular open source project is Apache Storm80 for real-time stream processing, and Amazon is offering Kinesis for their cloud solutions, which includes stream processing. Big Data analysis One of the main drivers for IoT is the ability to economically collect and store large amounts of data. After the data is collected, it must be processed, aggregated, analyzed to create datasets that can be visualized and used either for business analysis, informing business decisions and strategy, feedback into product engineering to improve products, or provide views of the data that can be shared with partners for monetization or adding value to the business relationship. The most common approach for this is to use the Map/Reduce81 pattern to batch process collected data. Apache Hadoop is the predominant implementation of that pattern, and Microsoft provides HDInsight, which is a cloud platform service implementation of Hadoop. The approach may be as simple as aggregating and summarizing data for simpler reuse, or it may be complex, multi-step processing that generates insights across the recently collected and historical data. Hadoop includes many tools within its ecosystem that help with searching, querying, and cataloging the data. In solutions today, Hadoop will frequently be used to preprocess data, such that Hadoop jobs will run and create summarized datasets that can be used for querying, reporting, and as input to machine learning activities, or as reference datasets in Complex Event Processing solutions. Machine learning Machine learning refers to the concept of studying data and deriving insights from the data. The results will be a model that can be used to predict future outcomes from similar data sets. The first step is to train the model. This is normally an iterative step performed by a data scientist where a training set of data is used to infer a function, or model, from that data. That model will be used to make decisions on incoming data. The model is typically retrained periodically, so that the model can improve over time, learning from additional new data and patterns. Machine learning falls into two broad categories: supervised learning and unsupervised learning. Supervised learning studies the data looking for a known set of desired outcomes. In other words, in the vehicle scenario, I may want to minimize the number of times that a car needs its oil changed. So I would run studies against the data looking for patterns that give me information about the consequences of delaying oil changes, conditions, and so on. In unsupervised learning, the concept is to naturally find patterns and relationships of any kind in the data. After something interesting is observed, then these data points will be further investigated until they are found to be either useful or not useful. Common tools for machine learning include MATLAB82, Mahout83 and R84. Microsoft introduced its ML tooling in June 2014, called Azure ML.85 Azure ML is a machine learning service that democratizes the practice of machine learning. It provides a visual experience for constructing data experiments, and easy to use implementations of many commonly used machine learning algorithms, relieving the data scientist of implementing them in a programming language. Azure ML integrates easily with Azure Storage, HDInsight, and Windows Azure SQL Database, and it can expose the models as web services so that they are simple to integrate into the runtime data flow or applications. 80 Apache, Storm, distributed and fault-tolerant realtime computation Wikipedia, MapReduce 82 Wikipedia, MATLAB 83 Wikipedia, Apache Mahout 84 Wikipedia, R (programming language) 85 Techcrunch, Microsoft announces Azure ML, Cloud-based Machine Learning Platform That Can Predict Future Events 81 Page 35 Data enhancement Another core piece of the IoT architecture is data enhancement. The data collected from the devices, the volume of it, and the hidden patterns within it provide tremendous value, but often combining the device data is either critical in order for it to make sense to the business, or there is even more significant value to be gained by adding other data sets to analyze with the device data. Enterprise data may be used for simple things, such as relating device data to customer data. Other areas of opportunity include data markets that publish datasets that are either sold or available for free. Microsoft offers the Azure DataMarket86, which offers datasets from governments, research institutions, historical, environmental, business organizations, and more. One of the most frequent datasets that gets combined with device data is weather. Devices often exist all over the globe in different conditions, so predictive maintenance will frequently factor in weather data, which is normally sourced from weather data providers as opposed to collecting it with the device itself. Publishing insights After data stored in the system has been processed into information of value to others, the question becomes how to approach this exposure in a secure and compatible manner that is easy to discover and consume. Some organizations want to make their data available to partners both up and down the supply chain to realize efficiencies that result in lower costs and improve margins. Others are realizing the data they have can be directly monetized as services available for consumption by individuals, corporations and governments around the world. In addition to the stand-alone value of the data, it may also be seen as valuable to augment other data services. Data that may seem uninteresting to those within the organization could in reality be a key ingredient used in a number of potential external applications or analytical recipes. For an in-depth discussion of data-publishing considerations, see the paper “Making Public Data Public” from Microsoft.87 The following sections discuss many aspects of this topic. Audience The target audience for the data will have a significant impact on how it is published. Will it be used to enhance analysis of other data? Will it be used through data visualization tools, such as PowerBI or Tableau? Will it be metered and have a price associated with it? Or will there be different views and price points of the data for different partners? Publishing format The choice of publishing format will be influenced by the targeted audience and the type of information being published. Similar to the discussion earlier in this paper about the incoming message format, the most likely choices for publishing data are XML, JSON, and AtomPub. OData88 is a standardized protocol for creating and consuming data APIs. OData originated at Microsoft, but it has become well-accepted in 86 See https://datamarket.azure.com/ Microsoft, Making Public Data Public 88 Odata, OData Home page 87 Page 36 the industry. OData supports both JSON and AtomPub, so it is widely consumable by nearly all current tools and programming languages. There are tools that can help scale, secure, and normalize the data publishing task. The Microsoft Azure DataMarket89 is a global marketplace for data and applications that provides discoverability, interface normalization, and a monetization approach. Microsoft Azure API Management90 is a service that facilitates publishing APIs. It includes features for API translation, versioning, aggregation, discovery, authorization, caching, and quotas. Both Azure DataMarket and Azure API Management can be part of the publishing strategy, using DataMarket for the broad exposure of large datasets, and API Management to expose APIs securely with usage metrics and management capabilities. 89 90 Microsoft, Microsoft Azure Marketplace Publishing Microsoft, Microsoft Azure API Management Page 37 Cost modeling and estimation Determining the cost of an Internet of Things (IoT) solution focused on predictive maintenance is generally a complex problem. This section will list an initial approach that we have used with our customers to estimate the cost of the architecture to support their predictive maintenance solutions. With any calculation, it is very specific to a scenario and this model will not be applicable to all situations or be totally complete. Before we go into the specifics of determining the cost for a solution, we want to stress that cost modeling, like capacity planning, is an iterative exercise. The process repeats itself, and performance testing and other data gathered will change capacity distribution (for example, different workloads could be combined in a single unit to save cost because these workloads are compatible in load profile) and tune the model over time. In other words, the first cost estimate will not be perfect, and it provides only an indicator of the cost of the solution. A common architecture for IoT Although you need to verify whether is satisfies your specific requirements, from our work with customers, a reference architecture surfaced which helps in implementing the Service Assisted Connectivity pattern by acting as the mentioned gateway. This architecture is built on top of Microsoft Azure Service Bus. Within Service Bus, it utilizes Event Hubs for the ingress (device to cloud) of data and topics for sending Command & Control messages as well as replies. Event Hubs Event Hubs is a new feature of Microsoft Azure Service Bus. It stands next to topics and queues as a Service Bus entity, and provides a different type of queue, offering time based retention, client-side cursors, publish subscribe support, and high scale stream ingestion. Although it could be argued the use of topics could satisfy the technical requirement for receiving data from devices, Event Hubs supports higher throughput and has an increased horizontal capacity. Architectural details Starting at the logical architecture level, the main architectural components are depicted in the following figure. Page 38 Figure 14. Reference architecture conceptual overview The previous conceptual architecture figure includes four important components within the system: 1. The provisioning service that takes in information on authorized devices, creates its configuration, and stores access keys. 2. Devices that interact using either AMQP or HTTP towards Service Bus directly, or a component called the Custom Protocol Gateway Host, which hosts adapters for other protocols, such as MQTT and CoAP. 3. Telemetry requests that are distributed by the router, using adapters to communicate with downstream storage and processing engines. 4. Commands send to devices through the use of the notification/command router that is internally surfaced through the Command API host. To ensure the architecture is able to support a large number of devices, a partitioned model where the device population is divided into manageable groups is used. This partition model can be seen in the following figure. Page 39 Figure 15. Reference architecture details and partition overview The figure details some important aspects of the reference architecture:  Master. Part of the requirements assumption for the architecture is that solutions built on top of it will aim for a unified global or at least regional management model, independent from technical scale limitations that might inform how large a particular partition may grow. This motivates an overarching architectural model with a common ‘‘Master’’ service, shown on the far left of the figure, that takes care of shared management and deployment tasks, as well as of device provisioning and placement, and several parallel and independent deployments of ‘‘Partition’’ services that each take ownership of one or more logical system partitions.  Partition. Instead of looking at a population of millions of connected devices as a whole, the system divides the device population into smaller, more manageable partitions of large numbers of devices each. Each resource in the distributed system has a throughput- and storage-capacity ceiling, limiting the number of devices associated with any single Service Bus ingress entity so that the events sent by the devices will not exceed that entity’s ingestion throughput capacity, and any message backlog that might temporarily build up does not exceed the entity’s storage capacity. In order to allocate appropriate compute resources and not overload the storage backend with too many concurrent write operations, a relatively small set of resources with reasonably well-known performance characteristics is bundled into an autonomous, and mostly isolated “scale-unit.” Each scale-unit supports a maximum and tested number of devices, which is also important for limiting risks in a scalability ramp-up. The principle behind this is that a production system can only be scaled up as much as it can be scaled up in testing on a regular basis. A benefit of introducing scale-units is that they significantly reduce the risk of full system outages. If a system depends on a single data store and that store has availability issues, the whole system is affected. However, if the system consists of 10 scale-units that each maintain an independent store, issues in one store only affect 10 percent of the system. Page 40 The principle of running all traffic ingestion through asynchronous Service Bus messaging entities, instead of into a service edge that writes data straight to the database, is that Service Bus already provides a scaled-out and secure network service gateway for messaging, and it is specifically designed to deal with bad network conditions, traffic bursts, and even sustained traffic peaks. A backend datastore that is the target of the ingested data should not be dimensioned to handle specific bursts, such as vehicle telemetry during core European or U.S. East Coast rush hours. The group called “partition” is a set of resources focused on handling data from a well-defined and known device population that has been assigned to and configured into the partition through provisioning. Cross-partition distribution of devices will be based on your solution-specific logic, and allocation within the partition is handled by provisioning. The “partition” group is the unit of scale. Through testing, the load specifications for the partition have to be determined and a so-called scale-unit can be defined. A scale-unit is a group of resources that can effectively support a well-known load profile for the system, allowing replication of the scale-unit to provide support for an extrapolation of this load profile. Within the “partition” group, there are two basic paths, ingestion (sending data from the device to the cloud) and egress (sending data from the cloud to the device). These paths accomplish the following: − Ingestion. Ingestion has a given device connect through its supported protocol, delivering messages to its specific Event Hub, using its assigned credentials. − Egress. Egress routes messages (replies, Command & Control) to their device destination.  Device Repo. The device repository contains configuration information about the registered devices for a given partition. Capacity modeling Before cost can be modeled, the way that the system will scale needs to be considered and the characteristics of the architecture need to be determined. Essentially, the attributes of the previously mentioned scale-unit need to be defined. There is a throughput ceiling for each of the components in the architecture, including each of the Service Bus entities. The reason to be cautious when evaluating throughput is that when dealing with distributed devices that send messages periodically, we cannot assume perfect, random distribution of event submissions across any given period. There will be bursts and we need to allow for ample capacity reserve to handle such bursts. Assuming a scenario of a 10-minute event interval with one extra control interaction feedback message per device per hour, seven messages per hour from each device can be expected, and roughly 50,000 devices can be associated with each entity with a 100 messages per second average throughput capacity. Having covered the flow rate, we can conclude that storage throughput is of little concern. However, storage capacity and the manageability of the event store are concerns. The per-device event data at a resolution of one hour for 50,000 devices amounts to some 438 million event records per year. Even if these event records are limited in size to only 50 bytes, the yearly payload data is still 22 GB per year for each of the scale-units. This underlines the need to keep an eye on the storage capacity and storage growth when thinking about sizing scale-units. These considerations manifest in a capacity model in the deployment model, which informs how many entities must be created in the Service Bus namespace backing a partition for a given device population size like 50,000 devices and for a given load profile. Page 41 The load profile is currently informed by how many (telemetry-) messages a device is generally expected to send, how many commands or notifications the device is expected to receive per hour, and what the average size of these messages is. The inputs should be well-informed, but generous estimates because while changing the shape of a scale-unit layout at a later time is possible, doing so may require reprovisioning the devices. Determining partitions is not only motivated by capacity concerns, however. Because a partition also forms a configuration scope, it provides a suitable mechanism to segregate device populations by region, country, owner, operator, product, or other concerns. As an example, one deployment can have up to 1,024 partitions. Each partition corresponds to exactly one Service Bus namespace. Because there can only be 50 namespaces per Azure subscription, and other dependent services have similar quotas, a fully built-out architecture will therefore most likely span multiple subscriptions. In summary, the attributes that we have found to determine the capacity model are:  Number of devices. This is the number of sensors supplying telemetry information to the scale-unit.  Average message interval ingress / egress. This represents the average number of messages that a given device emits per hour (ingress) / and the system emits per hour (egress).  Average message size ingress / egress. This is the average size of the messages that a device emits (ingress) or the system sends (egress), in bytes. Figure 16. Scale Units in the reference architecture Cost estimation With the estimation of cost for a solution built on top of this architecture, there are many factors to consider. We will work through the list from the ingress of device data to sending commands. Cost is estimated based on architectural design and necessary scale for success. As such, cost estimation has variables for the scale that is needed applied to the formula for calculation. Before we dig into the details, we feel the need to underscore the fact that cost modeling, like capacity modeling, are inputs for architectural decision making and business case modeling, where the combination of all inputs should always be considered as a whole. As an example, you might find using HTTP for communications will be somewhat less expensive from a cost modeling perspective. However, choosing HTTP over AMQP will inherently impact performance. For all pricing related information in the cost estimation formulas outlined in this section, it is important to state that prices will vary over time and the examples are aimed only at explaining the formula itself. The latest pricing information can always be found at http://azure.microsoft.com/en-us/pricing/overview/. Page 42 Ingress path cost using Event Hubs As events consumed from an Event Hub, as well as Management operations and “control calls” such as checkpoints, are not counted as billable ingress events, the formula for estimating cost for the architecture when using Event Hubs is a combination of: 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠 = 𝐶𝑜𝑠𝑡𝑏𝑎𝑠𝑒 𝑐ℎ𝑎𝑟𝑔𝑒 + 𝐶𝑜𝑠𝑡𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠 + 𝐶𝑜𝑠𝑡𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠 + 𝐶𝑜𝑠𝑡𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 + 𝐶𝑜𝑠𝑡𝑝𝑟𝑜𝑡𝑜𝑐𝑜𝑙 𝑔𝑎𝑡𝑒𝑤𝑎𝑦 + 𝐶𝑜𝑠𝑡𝑡𝑒𝑙𝑒𝑚𝑒𝑡𝑟𝑦 𝑝𝑢𝑚𝑝 + 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠 𝑡𝑟𝑎𝑓𝑓𝑖𝑐 Which expands into a more detailed formula we can work with to fill in the appropriate variables: Figure 17. The ingress path of the reference architecture 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠 ∑ 𝑎 $0.015 𝑎>500𝑘 𝑇𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠 = 𝐶𝑜𝑠𝑡𝑏𝑎𝑠𝑒 𝑐ℎ𝑎𝑟𝑔𝑒 + ( − 1000) 744 ∑ 𝑎 $0.025 100𝑘≤ 𝑎≤500𝑘 ∑ 𝑎 $0.03 ( ) 𝐴𝑚𝑠𝑔 𝑠𝑖𝑧𝑒 𝑖𝑛 𝐾𝐵 𝑁𝑑𝑒𝑣𝑖𝑐𝑒𝑠 𝐴𝑚𝑠𝑔 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ 𝑐𝑒𝑖𝑙 ( ) 64 + 744𝑁𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠 $0.03 + ( − 12.5) $0.028 1,000,000 + ( ∑ 𝑎<100𝑘 744 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟 ) + 𝐴𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 𝑥 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 𝑤𝑜𝑟𝑘𝑒𝑟 𝑟𝑜𝑙𝑒𝑠 Equation 1 - The cost estimation formula for the ingress path It should be noted this formula is using the “Standard” tier offering of Event Hubs91, which offers additional brokered connections, filters, and additional storage capacity. The fixed pricing elements in the formula uses pricing from a point in time, susceptible to change. Also, the formula assumes a flat use of brokered connections while actual billing is based on peak use prorated per hour; the dynamics of your system will likely deviate. The variables in this equation are: Variable 91 Description See http://azure.microsoft.com/en-us/pricing/details/event-hubs/. Page 43 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠 The cost of the ingestion of events, per month. 𝑇𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠 The total amount of hours connections to the system are made, summing all simultaneous connection time. 𝑁𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠 𝑁𝑑𝑒𝑣𝑖𝑐𝑒𝑠 𝐴𝑚𝑠𝑔 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ 𝐴𝑚𝑠𝑔 𝑠𝑖𝑧𝑒 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟 𝐴𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 The number of throughput units92 needed to support the ingress of data into the system. A throughput unit is the combination of inbound bandwidth, temporary storage and outbound bandwidth, as described in the reference. The number of deployed devices sending data to the system. The average number of messages sent into the system, per device, per month. The average size of each message sent into the system, per month. The number of worker roles necessary to support the projected scale of the system. Normally, at least two (2) are needed to fall within SLA support of Microsoft Azure. The cost per worker role for the ingress path when using custom protocols and for the telemetry pump, per hour. The average amount of egress traffic, per gigabyte. The cost of egress traffic93, per gigabyte. Example calculation An example calculation where 1,000,000 deployed devices send a message averaging 128 bytes every 60 seconds, having an average number of 100,000 simultaneously connected devices during the entire month would yield the following results: Variable 𝑇𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠 𝑁𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠 𝑁𝑑𝑒𝑣𝑖𝑐𝑒𝑠 𝐴𝑚𝑠𝑔 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ 𝐴𝑚𝑠𝑔 𝑠𝑖𝑧𝑒 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟 𝐴𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 92 93 Value 100,000 (100,000 simultaneous connections for the full month). 17 (44,640 messages per device, per month. 44,640,000,000 messages per month, equaling 16,666.66̄ per second. Given a single throughput unit supports up to 1,000 messages per second, rounding up 16,666.66̄ /1,000 equals 17). 1,000,000 44,640 (744 hours * 3,600 equals 2,678,400 seconds per month. 1 message every 60 seconds equals 44,640 messages per month) 1KB (rounding up 128 bytes in KB (128 / 1,024 equals 0.125)). 50 (assuming a rough estimate of 20,000 devices would be supported per worker role). Note again, this is not a capacity modeling exercise, these numbers should come from performance tests on your specific scenario. $0.08 per hour (assuming A1 worker role size). 0 (assuming all downstream processing happens inside the same region DC. Not Applicable Microsoft, Microsoft Azure, Event Hubs pricing, FAQ “What are throughput units and how are they billed?” Microsoft, Microsoft Azure, Data Transfer Pricing Details Page 44 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠 = 𝐶𝑜𝑠𝑡𝑏𝑎𝑠𝑒 𝑐ℎ𝑎𝑟𝑔𝑒 + 99,000 ∗ $0.03 + 744 ∗ 17 ∗ $0.03 1 1,000,000 ∗ 44,640 ∗ 𝑐𝑒𝑖𝑙 ( ) 64 − 12.5) $0.028 + (50 ∗ 744 ∗ $0.08) +( 1,000,000 = $2,970 + $379.44 + $1,249.57 + $2,976.00 = $𝟕, 𝟓𝟕𝟓. 𝟎𝟏 Egress path cost As with ingress, the egress path also has multiple components that incur cost. As sizes often vary between ingestion data and command & control, the message size is not the same value as used in the ingress path. The components involved in egress are:  Command API Host. The process in charge of sending notifications and commands to devices and groups of devices. It encapsulates the notification/command router, and routes egress messages to the appropriate topic on Microsoft Azure Service Bus, depending on the type of request. It is hosted inside a worker role.  Subscriptions. There are two different types of messages that the Command API supports: notifications and commands. A command can both yield a single or multiple response messages. Notifications and commands can also target groups of devices. All of these messages incur cost. Figure 18. The egress path for the reference Response messages have not been accounted architecture for in the egress calculation and should be estimated here. Command replies are not routed through the telemetry adapters.  Egress traffic. Each egress message will incur cost. Given these components, the egress path cost can be calculated using the following formula: 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑒𝑔𝑟𝑒𝑠𝑠 = 𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟 𝑟𝑜𝑙𝑒𝑠 𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼 + 𝐶𝑜𝑠𝑡 𝑛𝑜𝑡𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠 𝑠𝑖𝑛𝑔𝑙𝑒 𝑐𝑜𝑚𝑚𝑎𝑛𝑑 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 𝑚𝑢𝑙𝑡𝑖 𝑐𝑜𝑚𝑚𝑎𝑛𝑑 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 𝑐𝑜𝑚𝑚𝑎𝑛𝑑 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 + 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠 𝑡𝑟𝑎𝑓𝑓𝑖𝑐 Which also expands into a more detailed formula we can work with to fill in the appropriate variables: Page 45 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑒𝑔𝑟𝑒𝑠𝑠 = 744 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼 𝑠𝑖𝑧𝑒𝑎 ∑ ∑ 𝐴𝑛𝑚 𝑐𝑒𝑖𝑙 ( ) 64 𝐴 𝑐𝑠𝑚 𝐴𝑐𝑚𝑚 𝐴𝑟𝑚 + 1,000,000 𝑎 $0.20 ∑ 𝐴𝑛𝑚 𝑠𝑖𝑧𝑒𝑎 𝑎 > 2,500 ∑ − 12.5 ( 𝐴𝑐𝑠𝑚 𝐴𝑐𝑚𝑚 𝑎 $0.50 + )( ( ∑ 𝑎 $0.80 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 1,048,576 100 ≤ 𝑎 ≤ 2,500 ) ) 𝑎< 100 Equation 2 - The cost estimation formula for the reference architecture egress path This calculation combines both single device notifications and commands, as well as group broadcast messaging. Determining the magnitude and distribution in order to figure out the averages within the formula is left to the reader as part of the capacity modeling for the system architecture. The variables in this equation are: Variable Description 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 The number of roles necessary to support the projected scale of the system. Normally, at least two (2) are needed to fall within SLA support of Microsoft Azure. 𝐶𝑜𝑠𝑡𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼 The cost per worker role for the command API host, per hour. 𝐴𝑛𝑚 The average number of notifications per month. 𝐴𝑐𝑠𝑚 The average number of single response command messages per month. 𝐴𝑐𝑚𝑚 The average number of multiple response command messages per month. 𝐴𝑟𝑚 𝑆𝑖𝑧𝑒𝑎 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 The average number of response messages to commands, per month. The average response size, in kilobytes, averaged over all outbound message types. The cost of egress traffic94, per gigabyte. Example calculation An example calculation using 100,000 notifications per month of 20 KB each, 130,000 commands of 35 KB each with single replies of 80 KB each, and 20,000 commands of 20 KB each with on average three (3) replies of 70 KB each would yield the following results: Variable 94 Value 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 2 𝐶𝑜𝑠𝑡𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼 $0.08 (A1) 𝐴𝑛𝑚 100,000 𝐴𝑐𝑠𝑚 150,000 𝐴𝑐𝑚𝑚 20,000 Microsoft, Microsoft Azure, Data Transfer Pricing Details Page 46 𝐴𝑟𝑚 190,000 (130,000 + 3 * 20,000 equals 190,000) 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 $0.138 𝑪𝒐𝒔𝒕𝒎𝒐𝒏𝒕𝒉𝒍𝒚 𝒆𝒈𝒓𝒆𝒔𝒔 = 744 ∗ 2 ∗ $0.08 + ( 100,000 𝑐𝑒𝑖𝑙 ( 20 35 210 121 ) + 150,000 𝑐𝑒𝑖𝑙 ( ) + 20,000 𝑐𝑒𝑖𝑙 ( ) + 190,000 𝑐𝑒𝑖𝑙 ( ) 64 64 64 64 1,000,000 ∑ 𝑎 $0.20 100,000 𝑥 20 𝑎 > 2,500 ∑ − 12.5) ∑ (150,000 𝑥 35) 𝑎 $0.50 + 100 ≤ 𝑎 ≤ 2,500 20,000 𝑥 210 1,048,576 $0.25 ( ) ) 𝑎< 100 100,000 + 150,000 + 80,000 + 380,000 11,450,000 = $238.08 + $0.80 + $0.138 1,000,000 1,048,576 = $238.08 + $0.80 + $1.51 = $𝟐𝟒𝟏. 𝟔𝟏 ∑ 𝑎 $0.80 ( Management cost Besides the messaging related components in the reference architecture, there is also the concept of one or more masters for managing the system, as discussed previously in this paper. The master is tasked with provisioning devices, creating appropriate queues and topics, storing device information, provisioning security, and so on. The master contains the following cost components:  Provisioning Runtime. The component called by tooling to provision a device or a set of devices into the system, creating the necessary service bus, compute, and storage artifacts. It is hosted inside a worker role.  Device Repo. The datastore collecting the registered devices per partition.  Partition Repo. The datastore collecting partition registration information. Given these components, the egress path cost can be calculated using the following formula: Figure 19 - The "master" component within the reference architecture 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑚𝑔𝑚𝑡 = 744 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑚𝑎𝑠𝑡𝑒𝑟 + (𝑆𝑖𝑧𝑒𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 + 𝑆𝑖𝑧𝑒𝑑𝑒𝑣𝑖𝑐𝑒 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 )𝐶𝑜𝑠𝑡𝑡𝑠𝐺𝑅𝑆 + ∆𝑑𝑖 𝐶𝑜𝑠𝑡𝑡𝑥 Equation 3 - The cost estimation formula for management of the reference architecture Page 47 The variables in this equation are: Variable 𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑚𝑔𝑚𝑡 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑚𝑎𝑠𝑡𝑒𝑟 𝑆𝑖𝑧𝑒𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 Description The cost of the management for the architecture, per month. The number of roles necessary to support the projected scale of the system. Normally, at least two (2) are needed to fall within SLA support of Microsoft Azure. The cost per worker role for the management host, per hour. The number of gigabytes used in the partition repository for administrative purposes. 𝑆𝑖𝑧𝑒𝑑𝑒𝑣𝑖𝑐𝑒 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 The number of gigabytes used in the device repository. 𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 The number of partitions to allow for appropriate scale. 𝐶𝑜𝑠𝑡𝑡𝑠𝐺𝑅𝑆 The cost for Geo Redundant Storage (GRS) table storage ($0.095 / GB at the time of writing). ∆𝑑𝑖 The change for device information. Any change to the device information stored in the system and subsequently in a device repository inside a partition, will account for at least two operations on table storage. 𝐶𝑜𝑠𝑡𝑡𝑥 The cost for storage transactions ($0.0036 / 100k transactions at the time of writing). Example calculation An example calculation using 10,000 changes to device registration per month (either new devices, changes in activation, or removed devices) leading to a total partition repo (assuming a single master instance is used) size of 256 MB and 128 MB device repository per partition, using 10 partitions, would yield the following results: Variable 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑚𝑎𝑠𝑡𝑒𝑟 Value 2 $0.16 (medium) 𝑆𝑖𝑧𝑒𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 0.25 𝑆𝑖𝑧𝑒𝑑𝑒𝑣𝑖𝑐𝑒 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 0.125 𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 10 𝐶𝑜𝑠𝑡𝑡𝑠𝐺𝑅𝑆 $0.095 / GB ∆𝑑𝑖 𝐶𝑜𝑠𝑡𝑡𝑥 10,000 $0.0036 / 100k 𝑪𝒐𝒔𝒕𝒎𝒐𝒏𝒕𝒉𝒍𝒚 𝒎𝒈𝒎𝒕 = 744 𝑥 2 𝑥 $0.16 + (0.25 + 0.125 𝑥 10) $0.095 + 0.1 𝑥 $0.005 = $238.08 + $0.1425 + $0.00036 = $𝟐𝟑𝟖. 𝟐𝟐 As can be observed from the outcome of the formula, the cost of management for the reference architecture is mostly dependent on the worker roles running to support it. Page 48 System processing cost An IoT system with only the ability to ingest and offload data combined with the ability to send commands is not complete. This is just the communication interface for connecting devices to a central system. Although it is not included in this example, in order to complete an IoT system, there is a need to perform data analysis, either in flight by using an event processing engine, or at rest by using solutions for machine learning. With a high degree of certainty, you will also need components that take advantage of key parts of this underlying technology to surface management and control mechanisms to users through the use of one or more portals, expose the gathered knowledge from machine learning to other parties through web services, and so on. Cost estimate calculation In the previous sections of this paper, we discussed the various components that make up the cost for the data ingestion and communication platform inside the reference architecture. When we combine these, we can calculate the total estimated cost for a partition, and extrapolate the total estimated OPEX cost for the system based on the number of needed partitions using the following formula: 𝐶𝑜𝑡𝑡𝑜𝑡𝑎𝑙 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ = (𝐶𝑜𝑠𝑡𝑖𝑛𝑔𝑟𝑒𝑠𝑠 + 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠 )𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 + 𝐶𝑜𝑠𝑡𝑚𝑎𝑛𝑎𝑔𝑒𝑚𝑒𝑛𝑡 Page 49 Important topics not yet covered In this paper, we have strived to capture many of our learnings from implementing predictive maintenance solutions in the Internet of Things (IoT) space. However, in addition to the topics discussed, there is both much detail to add and more things to think about when architecting for IoT. This final section touches on some of these topics. Networks with automatic handover and fallbacks When we think about IoT scenarios, there seems to be an emerging need for networks working together in a seamless manner in order to provide frequently roaming users with the ability to perform command and control to either partially or fully closed IoT systems that they can access. This capability would require working across vendors and standards to ensure that the right connectivity type is available at the right time, and at the right price. The need for the commoditization of devices Many solutions today use their own proprietary hub for connecting their point solution to the Internet. This approach needs to change, with vendors selling connectivity bridges that work much like today's home Internet routers. In fact, such Internet routers could prove to be a great point of integration with standardized PAN/LAN devices, and support autonomous operations when connectivity is not available. Ideally, these bridges would support current legacy non-IP PAN device protocols, such as Z-Wave, traditional ZigBee, and so on. The creation and use of information marketplaces As IoT systems evolve, especially those capturing telemetry for intelligent decision making, there is a clear need for data augmentation to provide context for machine learning. Information marketplaces, such as Microsoft Azure DataMarket, need to expand their offerings, providing new opportunities for data providers. Management solutions There are standards put forth for managing devices 95, such as OMA Device Management96 (of which Microsoft implemented a subset, called Mobile Device Management97), CPE WAN Management Protocol98, Lighweight M2M99, and UPnP-DM100. 95 Blackberry, A Comparison of Protocols for Device Management and Software Updates Wikipedia, OMA Device Management Microsoft, MS-MDM: Mobile Device Management Protocol 98 Wikipedia, TR-069 99 Ericsson, “Lightweight M2M”: Enabling Device Management and Applications for the Internet of Things 100 See Introduction to UPnP Device Management 96 97 Page 50 As millions of devices become part of IoT systems, there is a clear need for IoT solutions that can monitor and manage incidents in the systems, visualize information and effectively control the environment, and span the various connectivity options and supporting legacy systems. The redefinition of SLAs Although it represents a very hard problem to find a solution for, customers will ask for different types of Service Level Agreements (SLAs) in this space. Where current SLAs provide a system availability guarantee, this definition has to evolve to provide a concrete answer to questions, such as how much bandwidth is available, what is the maximum and average latency to expect, how many I/O operations per second (IOPS101) can the storage system provide, and do on. Moving beyond those basic guarantees, customers will seek answers from SaaS solutions for IoT based on simply the number of devices that they can support. Integration simplicity As IoT promises to extend vertical solutions across horizontal markets, and connect systems in ways never seen before to add value to businesses and people’s lives, the integration between these systems and how they are secured needs to happen in a way that standardizes the integration. AMQP provides an example of this in regard to transport-layer integration. 101 Wikipedia, IOPS Page 51 Conclusions This paper has gone into great detail about the particulars of building IoT solutions, based on our experience in working with enterprise customers. As you can see, IoT solutions can be complex but also offer massive promise for increasing revenue, cutting cost and finding new business models based on innovate use of technology. An enterprise might believe that its requirements are so unique that only a custom IoT solution can meet their needs. But the unusual requirements of IoT solutions in security, communication, and scale make them complex and expensive to build as custom solutions from the ground up. The Microsoft Azure platform, on the other hand, has a comprehensive set of building blocks that you need to build an IoT solution relatively quickly and painlessly by using the mentioned reference architecture. Page 52 How Microsoft can help you succeed Microsoft Services can help establish an effective strategy for your Predictive Maintenance scenario and provide direction, implementation guidance, delivery, and support to help your realize your Internet of Things strategy. We offer: ▪ Customer value discovery and ideation workshops ▪ Strategy workshops ▪ Implementation guidance ▪ Microsoft Services Subject Matter Expertise, both in your vertical industry and on the topic of general IoT and Predictive Maintenance. For more information about Consulting and Support solutions from Microsoft, contact your Microsoft Services representative or visit www.microsoft.com/services. Page 53

Building the Internet of Things - Center

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib