Building the Internet of Things
Early learnings from architecting solutions focused
on predictive maintenance
Authors
Martijn Hoogendoorn, Architect, Applied Incubation, Microsoft
Mark Kottke, Architect, Applied Incubation, Microsoft
Intended audience
This white paper is aimed at technical decision makers, solution architects, and developers.
Abstract
This white paper provides a comprehensive overview of lessons learned from the authors'
experiences in implementing large scale customer projects that target predictive maintenance as
a space in IoT. It frames various elements and considerations of importance within the Internet
of Things, highlighting tradeoffs, opportunities and grounding the implementation activities using
a reference architecture and an associated comprehensive cost model.
Acknowledgments
The authors would like to thank the following people, who contributed to, reviewed, and helped improve
this white paper.
Contributors
Marc Mercuri, Principal Program Manager, Azure Customer Advisory Team, Microsoft
Clemens Vasters, Principal Program Manager, Azure Application Platform, Microsoft
Reviewers
Arno Harteveld, Architect, Client Solutions, Microsoft
Carolina Piavis, Director Business Programs, Applied Incubation, Microsoft
Ray Stephenson, Director, Applied Incubation, Microsoft
Mani Subramanian, Senior SDET, Patterns & Practices, Microsoft
Version
1.2
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the
date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment
on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of
this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter
in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document
does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
The descriptions of other companies’ products in this document, if any, are provided only as a convenience to you. Any such
references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee their accuracy, and the
products may change over time. Also, the descriptions are intended as brief highlights to aid understanding, rather than as thorough
coverage. For authoritative descriptions of these products, please consult their respective manufacturers.
© 2014 Microsoft Corporation. All rights reserved. Any use or distribution of these materials without express authorization of
Microsoft Corp. is strictly prohibited.
Microsoft and Windows are either registered trademarks of Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Page i
Table of contents
Executive summary ............................................................................................................................................................1
IoT and predictive maintenance .................................................................................................................................... 2
The Internet of Things ............................................................................................................................... 2
Business value ...................................................................................................................................... 4
Megatrends ........................................................................................................................................... 4
Technology enablers ............................................................................................................................. 6
Standardization efforts .......................................................................................................................... 7
Predictive maintenance ............................................................................................................................. 7
Predictive maintenance scenarios ................................................................................................................................. 8
Healthcare ................................................................................................................................................. 8
Automotive ................................................................................................................................................ 9
Manufacturing ........................................................................................................................................... 9
Architectural considerations ......................................................................................................................................... 10
Connectivity ............................................................................................................................................. 10
Interaction patterns ............................................................................................................................. 10
Connectivity pathways ........................................................................................................................ 12
Connectivity network types ................................................................................................................. 12
Protocol choices ...................................................................................................................................... 15
Transport-layer protocol choices ......................................................................................................... 15
Transport-layer protocol security......................................................................................................... 16
Application-layer protocol choices ...................................................................................................... 17
Security ................................................................................................................................................... 19
Virtual Private Networks ...................................................................................................................... 20
Compliance ......................................................................................................................................... 20
Device communication patterns .............................................................................................................. 22
NAT-based device network ................................................................................................................. 22
IPv6 direct-addressing device network ............................................................................................... 23
NAT-based, PAN device network........................................................................................................ 24
Generic concerns with direct addressing ............................................................................................ 24
Service-assisted communication......................................................................................................... 24
Designing for scale .................................................................................................................................. 28
Communication and ingestion ............................................................................................................. 28
Data storage scalability ....................................................................................................................... 29
Device registration .................................................................................................................................. 29
Acquiring data ......................................................................................................................................... 30
Page ii
Message size and format .................................................................................................................... 30
Message types .................................................................................................................................... 30
Message priority .................................................................................................................................. 31
Conditional messaging ........................................................................................................................ 31
Contextual messaging ......................................................................................................................... 31
Message batching ............................................................................................................................... 31
Bandwidth and scale ........................................................................................................................... 32
Storing information .................................................................................................................................. 32
Storing data on the device .................................................................................................................. 32
Transforming data ............................................................................................................................... 32
Location ............................................................................................................................................... 33
Longevity, format, and cost ................................................................................................................. 33
Processing information............................................................................................................................ 34
Alarm processing ................................................................................................................................ 34
Complex-event processing .................................................................................................................. 34
Big Data analysis ................................................................................................................................ 35
Machine learning ................................................................................................................................. 35
Data enhancement .............................................................................................................................. 36
Publishing insights .................................................................................................................................. 36
Audience ............................................................................................................................................. 36
Publishing format ................................................................................................................................ 36
Cost modeling and estimation..................................................................................................................................... 38
Common architecture overview ................................................................ Error! Bookmark not defined.
Capacity modeling ................................................................................................................................... 41
Cost estimation ....................................................................................................................................... 42
Ingress path cost ................................................................................... Error! Bookmark not defined.
Egress path cost ................................................................................................................................. 45
Management cost ................................................................................................................................ 47
System processing cost .......................................................................................................................... 49
Cost estimate calculation ........................................................................................................................ 49
Strategic choices ........................................................................................................... Error! Bookmark not defined.
Buy, build, or hybrid .................................................................................. Error! Bookmark not defined.
Important topics not yet covered ............................................................................................................................... 50
Networks with automatic handover and fallbacks ................................................................................... 50
The need for the commoditization of devices ......................................................................................... 50
The creation and use of information marketplaces ................................................................................. 50
Management solutions ............................................................................................................................ 50
Page iii
The redefinition of SLAs .......................................................................................................................... 51
Integration simplicity................................................................................................................................ 51
Conclusions ....................................................................................................................................................................... 52
How Microsoft can help you succeed ........................................................................................................................ 53
Page iv
Executive summary
For decades, technology experts have anticipated the Internet of Things (IoT): the proliferation of tens of
billions of connected devices that contain embedded microchips, and the rise of machine-to-machine and
service-to-service communications. IoT will make inanimate objects, networks, and processes “smart”—
everything from tiny components, appliances, machines, homes, buildings, and factories to energy grids,
transportation networks, and logistics systems. It’s a game-changing opportunity in IT. By analyzing the
vast new streams of data, and by harnessing the precise control that IoT provides, your organization can
reduce costs, create new revenue streams, increase customer satisfaction and retention, spot trends
faster, gain from opportunities more easily, and innovate with agility. IoT will be especially beneficial in
predictive maintenance: performing maintenance at the right time to predict and prevent failures.
To take full advantage of IoT opportunities in predictive maintenance, you need to think strategically
about the many elements of IoT. For example, one should consider connectivity pathways and types,
transport-layer and application-layer protocol choices, device interaction and communication patterns,
and how to design for the vast scale of IoT. It is especially critical to understand the complex issues of
data security and regulatory compliance, which can expose the enterprise to legal difficulties if they are
not handled properly. You also should think about how the enterprise’s communications systems will
ingest data, including message types, sizes, formats, and priorities, conditional and contextual
messaging, message batching, bandwidth, and how to scale a messaging system.
Another pivotal set of questions to ask relate to the data: where will data be stored and how will it be
distributed or potentially sold, and what is the longevity of the data, the right format, and the associated
cost to do that? What is the most efficient way to analyze Big Data, how can you best take advantage of
possibilities, such as alarm processing, complex-event processing, Big Data analysis, machine learning,
and data enhancement? Because data that seems at first uninteresting can be very valuable to the right
audience, how do you find that audience to monetize the insights gained from processing it?
The elements that are needed for security, communication, and scale in an IoT solution make it very
challenging to build one from scratch. To succeed with any IoT solution, it will very likely require the
implementation of a reference architecture that can help accelerate the use of massive data from millions
or even billions of devices. Modeling the system’s capacity to scale, and calculating the costs to do so for
related aspects, such as ingress (device to cloud) and egress (cloud to device, cloud to system) paths
and system processing, is paramount. Depending on the company background, a classic buy vs. build vs.
hybrid decision should be made, based on what you are already using, what is available, and what will be
available in the near future at a price that is acceptable to your business. This white paper introduces and
describes all of these considerations and provides you with the tools necessary to estimate the
operational cost of an implemented reference architecture in production.
With the Microsoft Azure platform, Microsoft offers a broad set of building blocks to help you get an IoT
solution up and running quickly.
Page 1
IoT and predictive maintenance
At Microsoft, we hear constantly from customers who say that the Internet of Things (IoT) is one of the
most exciting trends in IT.1 Many of our customers are interested in deploying sensors and devices in
every part of their businesses in order to capture information from the physical world and act upon the
knowledge gained from refining it. They will buy or build systems that can deliver these capabilities in
order to optimize their bottom line, keep customers satisfied, and explore new revenue potential.
Predictive maintenance is an IoT scenario where a device can provide data that leads to insightful,
proactive maintenance before the likely failure can take place. Predictive maintenance offers a new
revenue stream for device manufacturers, and it is very interesting to their customers because it enables
better business continuity, which usually generates extra revenue. In this way, the cost of a new service
from a device manufacturer is justifiable to customers, given the cost and impact on them of unplanned
downtime of the device.
The Internet of Things
An expert in radio-frequency identification named Kevin Ashton first used the term “the Internet of Things”
in 1999,2 though the idea had been around at least a decade earlier. As with many terms in technology,
IoT is a loaded term that people interpret differently depending on their viewpoint and purpose. For
example, Gartner defines it as “The network of physical objects that contain embedded technology to
communicate and interact with their internal states or the external environment.” 3 Formulating this
differently:
The Internet of Things is a metaphor for a set of systems in which direct human intermediation is
dramatically reduced by equipping distributed systems with sensors that let us acquire information,
make decisions, and control things in the physical world.
Based on this definition, IoT consists of a set of four composable
activities:

Acquiring data. Using sensors to record information about
the physical world. Examples include measuring location,
humidity, temperature, light, heart rate, blood pressure, brain
waves, current, and gas detection.

Processing information. Take action based on data
captured and on contextual information retrieved previously
or sourced from other systems. This processing could
involve using actuators that can alter the state of the
physical world, such as opening valves, switching machines
on and off, sounding alarms, controlling servos, closing
doors, and many other things.

1
2
3
Storing information. To enable trend analysis, forecasting
and insight-driven decision making, historical information
Figure 1. Foundational activities,
composable within and between
devices and systems
Microsoft, “What Our Customers Are Saying: Top Enterprise Trends of 2014,” Susan Houser
Wikipedia, Internet of Things
Gartner, IT Glossary, Internet of Things
Page 2
and context is needed. Storing the information retrieved in its contextual form (for example, including
the location where it was captured, the date and time it was captured, the state of the system at the
time it was captured, and so on.) is critical for this process.

Publishing insights. When embedded sensor data is combined with both internal and external data
from other systems, additional insight from analyzing the data can be learned and acted upon.
Exposing that insight can also drive additional value for other stakeholders outside the immediate
needs of the current system, allowing for the monetization of this knowledge.
On top of familiar devices, such as phones for input and presentation, a set of core components to
support those activities is needed, though business goals and technical constraints will drive those that
are required. Core components may include:

Sensors: the components that translate a value from the physical world into bits. Examples include
sensors that measure pressure, humidity, heart rate, gas levels, and acceleration.

Devices: networked, physical, special-purpose systems that emit telemetry data, accept external
information, request external information, and execute remotely-issued commands. Examples include
factory floor equipment, environmental pollution sensors, and control modules in vehicles.

Bridges: systems that act as communication brokers between a device and a gateway, typically by
translating data traffic between different link protocols or methods, for instance between short-range
and long-range wireless protocols. A bridge can also be a connectivity infrastructure that manages a
nationwide or world-wide wireless network on one side, and a bridge to a cloud system on the other.
A bridge might also perform intelligent preprocessing of data, or act as an autonomous local
communications hub in addition to its bridging function relative to a cloud system4. Bridges are often
also referred to as gateways, but we reserve the term gateway for a network-based service with
which a bridge communicates.

Gateways: network-based services that manage connectivity and connections with devices either
directly or through bridges. The service establishes a trusted communication relationship with a
device, deals with ingestion and routing of telemetry data, and provides access to command and
notification data destined for the device. On top of these services, it provides data pipeline
processing, possible containing transformation, complex event processing capabilities, data analytics
components, machine learning, and so on.

Machine learning: computational algorithms that can analyze large sums of data and extract
patterns from it to help a system act and “learn” from that data to drive more intelligent system
responses in the physical world.

Interconnections: different systems sharing learnings and data that in turn form composite systems.
We have read thought-provoking papers about IoT. Two that we found especially valuable in providing
context to the concepts and opportunities of IoT are:

“Recommendations for the Strategic Initiative INDUSTRIE 4.0.” 5

“Industrial Internet: Pushing the Boundaries of Minds and Machines, a European Perspective.”6
IoT enables you to build, enhance or extend a business model based on data-driven insights from
pervasive sensors that help you optimize resource use and reduce cost and environmental impact. IoT
also helps you maintain a closer relationship with customers beyond the point of sale of physical products
4
5
6
Microsoft, “How Microsoft tech is helping affordable housing tenants save money” (section on “Captain”)
Deutsche Akademie der Technikwissenschaften, Final report of the Industrie 4.0 Working Group
General Electric, “Industrial Internet: Pushing the Boundaries of Minds and Machines, a European Perspective”
Page 3
by enabling contextual, remote actions automatically and intelligently. Examples include remote servicing,
proactive sales, best-practices guidance, and more.
Business value
At least 26 billion devices will be connected on the Internet by 2020, and organizations in every sector will
use them.7 Billions of connected devices will help businesses to:

Reduce cost. Businesses can use the increased insight into manufacturing and delivery processes to
optimize those processes and reduce cost. For example, reducing the number of scheduled visits a
technician must make by scheduling service visits based on duty cycles and expected product
lifespans informed by actual usage.

Create new revenue streams. Using the ability to sense from and actuate in the physical, new
business models are emerging. Business can capitalize on these new opportunities and create new
innovate revenue streams. Some examples would be monetizing newly collected datasets, offering
APIs to create new business partnerships, increasing service revenue by notifying and offering
improved convenience to customers, offering differentiating SKUs based on usage patterns,
supplying optimized configuration services, and so on.

Increase customer satisfaction and retention. By knowing how customers of physical products use
them, opportunities exist to extend the customer experience into scenarios of higher value, and retain
and extend the customer base. Capturing data on how customers actually use products, and ensuring
that they do not experience frustrating service issues helps companies retain customers.
In the blog post “10 reasons businesses need a strategy for the Internet of Things now,”8 the author
identified a concise set of benefits that a company can realize by adopting an IoT strategy.
Megatrends
The world faces many challenges, such as changes in wealth distribution, resource scarcity, and an aging
population in developed countries. The authors of the book “From Machine-to-Machine to the Internet of
Things: Introduction to a New Age of Intelligence” analyzed these megatrends and capabilities in detail.9
They found that these megatrends are driving a proliferation of embedded devices with sensors, which in
turn require new capabilities for new market scenarios, as the graphic below shows.
7
8
9
Gartner, “Gartner says the Internet of Things Installed Base Will Grow to 26 billion units by 2020”, December 2013
Microsoft, 10 reasons businesses need a strategy for the Internet of Things now
“From Machine-to-Machine to the Internet of Things: Introduction to a New Age of Intelligence”, ISBN 978-0124076846
Page 4
Figure 2. "Megatrends." From Machine-To-Machine to the Internet of Things: Introduction to a New Age of
Intelligence. Amsterdam, Netherlands, Elsevier, January 2014.
Among the list of megatrends listed in the previous figure, we want to explain in this paper how some of
them relate to the Internet of Things:

Natural resource constraints. The world population is growing at a high rate, with a projected peak
population of 9.22 billion in 2075.10 Given this growth and the impact it has on the growth of the
worldwide economy, the world will increasingly have to do more with less, and optimize the way that
we produce. IoT can support the optimization of production, loss reduction, and the efficiency of the
necessary supply chain.

Economic shifts. Much like the shift in IT, going from packaged products to as-a-service solutions,
the global economy is moving from a product-oriented to a service-oriented perspective.11 For a
viable service-oriented economy to come into existence, it needs to be supported by a large set of
devices that provide context to the customer environment for the system in order to offer the right
service, at the right price, and at the right time.

Changing demographics. With the world population, especially in more-developed countries,
increasingly aging, the change in demographics will need smart solutions that can help elderly people
remain self-supporting.

Climate change. The impact of human activities on the environment, although debated at length, is
detrimental to the sustainability of the world. In recent years, there has been a growing movement of
10
11
United Nations, Economic & Social Affairs, World Population to 2300
Wikipedia, Service economy
Page 5
“green” technologies and services, ranging from electric cars to corporate and government policy
changes. IoT can be a supporting factor in both providing footprint insight and reduction.
Technology enablers
The ever-decreasing cost and size of components, such as accelerometers, Wi-Fi radios,12 GPS,
microcontrollers, and Bluetooth radios is also enabling the Internet of Things (IoT). It allows components
and devices to be used in new settings, such as wearables, on-person devices, and even smaller
equipment.
As shown in Figure 2, IoT depends on several other major technologies and trends. Some of these
technology enablers as well as others warrant clarification:

Ubiquitous connectivity. Low-powered wireless networking enables devices to talk to a gateway,
among each other, or directly to the outside world. A foundation for IoT implementations, connectivity
must be managed carefully. To learn more, see the Connectivity section in this paper.

Cloud computing. For systems that connect hundreds of millions of devices, cloud computing is the
technology that allows for vast scale and acceptable costs, providing the ability to store large
amounts of machine generated data at low cost and perform Big Data analytics and machine
learning.

Small, low-power, low-cost microcontrollers. Microcontrollers today can perform tasks at very low
power and have a battery life of many years.13 For example, the Texas Instruments MSP430 runs at
less than 100µA/MHz and can operate on a single coin battery for more than 20 years.14 (Device
battery life always depends on components and application cycle use). The memory embedded in
this microcontroller is ferroelectric read-only memory (FRAM), an improvement on flash memory that
sports very high data throughput at a power consumption three times lower than flash memory and 99
percent lower than comparable dynamic random-access memory (DRAM).

Power supply and storage technologies. Given the tiny size of many new devices, their
deployment location, and the vast number of them that will be deployed, changing batteries is often
impractical or impossible. Besides optimizing hardware design for these scenarios, 15 enhancing
circuitry by limiting their quiescent current (Iq) will further improve battery life. Also, with energy
harvesting techniques, such as solar power supplies, devices can recharge their built in batteries as
long as there is a minimal charge left.

Embedded operating system platforms. With the vast number of devices that will be installed, cost
and energy consumption per device become decisive. Engineers will create devices that cost less
and that are more energy-efficient, even if they have limited processing capabilities and memory.
CPU cycles spent, and the memory allocated will become important factors for choosing operating
system platforms, installed components, and security configurations. There is a plethora of good
general-purpose operating systems, ranging from Windows Embedded and Embedded Linux to realtime operating systems, such as FreeRTOS, ThreadX, Integrity, Nucleus, Qnx, Atomthreads, AVIXRT, ChibiOS/RT, ERIKA Enterprise, TinyOS, Thingsquare Mist/Contiki, and others.16
In sum, IoT is gaining momentum because of growing customer and enterprise needs meeting
technology enablers at the right cost.
For example, a network chip for less than $10 for 1,000 units. Texas Instruments, SimpleLink™ Wi-Fi Module CC3000
maxEmbedded, What is a microcontroller? And how does it differ from a microprocessor?
14
Texas Instruments, MSP430 documentation
15
Texas Instruments, Using power solutions to extend battery life in MSP430 applications
16
For a comprehensive list, see http://en.wikipedia.org/wiki/List_of_real-time_operating_systems
12
13
Page 6
Standardization efforts
Throughout the world, many organizations are working on the standardization of IoT, based on specific
technology or holistically on reference architectures. Examples of this work include:

ITU-Telecom (ITU-T), Internet of Things Global Standards Initiative (IoT-GSI).17

European Union, Internet of Things Architecture (IoT-A).18
In addition to these efforts, there is a lot of work going on in depth in many different technology areas,
such as the standardization of protocols. Protocol choices, both at the transport as well as the application
layer, are discussed later in this document.
Predictive maintenance
This white paper focuses on a common scenario IoT enables that we call predictive maintenance:
performing maintenance with a focus on timeliness, acting exactly when needed instead of at regular
intervals, and predicting and preventing failures before they happen, based on learning from historical
data. Predictive maintenance—just-in-time maintenance—will massively transform how organizations and
consumers manage equipment as well as people. Predictive maintenance also informs more traditional
preventative maintenance patterns, optimizing routine maintenance activities.
17
18
ITU-T, Internet of Things Global Standards Initiative
European Union, Internet of Things Architecture
Page 7
Predictive maintenance
scenarios
The potential for useful applications in the Internet of Things (IoT) is endless. This section focuses on
scenarios that illustrate concrete benefits based on predictive maintenance, where maintenance can be
performed on both inanimate and living things. The following scenarios that we describe provide
examples of the enormous potential that IoT holds for enterprises.
Healthcare
With the previously described change in world demographics, there is an increasing
need for “remote patient management,” allowing elderly citizens to only come to the
doctor or the hospital when the need arises, based on telemetry captured by smart
devices. Some early innovation in this space, more geared toward health selfmanagement and consumer devices can be seen in watches with sensors that collect
a variety of data, such as blood pressure and heart rate. When body temperature,
oxygen levels, and CO2 levels are combined with the ability to display this data to the patient and
physician in real time, this alleviates the stress of full waiting rooms and reduces the cost per patient.19
Another example is an in-home glucose monitor that uploads a patient’s vital signs to a cloud-based
health platform, where the data is analyzed and presented back to the patient in an easy-to-understand
format on a mobile device, and in a more complex format on a touchscreen to the doctor. The doctor can
review the patient’s information and then use the touchscreen to send feedback to the patient and write a
prescription.20
Powerful, specialized, cloud-connected devices like these that enable doctors and patients to work
together to remotely monitor vital signs, exchange information, communicate, and alert relatives, all in
real time, are either becoming available or in development. By actively monitoring patients at home21,22 or
while they are mobile, healthcare professionals can provide a higher level of care, reduce in-hospital
waiting time and costs, and reduce stress for everyone involved, which leads to better patient outcomes.
Using technology to accurately predict and signal medical staff about conditions that need attention,
enables healthcare professionals to anticipate patient issues instead of reacting to them, and remedy
them before they become critical, all while maintaining the security and privacy of the data collected from
such technologies.23 As a positive side effect, the collected evidence of provided care could also help
alleviate the issue where doctors in the U.S. are sometimes reluctant to provide prescriptions or diagnosis
over the phone because of billing restrictions,24 which forces patients to visit the office of the healthcare
provider, and as a result waste a lot of everyone’s time for the treatment of common or recurring ailments.
“Samsung Simband aims to take a big step in wearable health,” www.cnet.com/products/samsung-simband/
Microsoft Healthvault Medical Intelligent System, www.youtube.com/watch?v=j8Y4ukdNM60
21
Medical Design Technology Magazine, The Internet of Things and Medical Device Product Development: Practical Strategy
Suggestions, March, 2014
22
YouTube, Medical Intelligent System, Proof of Concept
23
Deloitte, Networked medical device cybersecurity and patient safety: Perspectives of health care information cybersecurity
executives
24
Texas Medical Association, Coding for Telephone Consultations
19
20
Page 8
Automotive
Vehicles contain telemetry about their operation, and about the service activities and
faults that happen on them. They travel through different locations, different weather
conditions, and different usage scenarios—a four-wheel drive vehicle climbing trails,
a sports car in the mountains, or a family van loaded with children. Each of these
factors can have an effect on how the vehicle operates, as well as its reliability,
comfort, safety, and performance. If the vehicle manufacturer or a vendor-agnostic
data aggregator/analyst can collect this data, and analyze it over time, trends can be identified to find
new, timelier, and more cost effective and impactful actions to take. These can include maintenance on
the vehicle, reconfiguring it, which in turn can help to prevent recalls, or conversely trigger recalls to keep
the vehicle safe, and more fun, useful, and cost effective for everyone involved, including the owner, the
operator, and the passengers.
Manufacturing
A service technician is dispatched to analyze an elevator after someone reports that
its doors will not close. The building owner is hearing from people who are unhappy
that they have to walk up the stairs. It takes the engineer an hour to drive to the
building and find the elevator. After arriving, he works through a standard checklist for
another hour, only to conclude that the elevator works as expected. As so often
happens, a fleeting obstruction, such as a coffee cup between the doors of the
elevator or accumulated dust and dirt in the sliding rail might have caused the problem.
The service technician drives back to his office, having spent a total of three hours on a phantom
problem. At $150 USD per hour and with more than one million elevators in service, incidents where
equipment is evaluated as operating normally upon inspection such as in this scenario can have a big
impact on the profitability of an elevator company, depending on the type of maintenance contract.
Moving beyond this reactive maintenance illustration, capturing telemetry about the motors that operate
the elevator or the speed that the doors of the elevator close allows the engineer to take a more
predictive approach. For example, an increase in the consumption of energy or a decrease in the door
closing speed might signal a service request, and trigger a maintenance crew to provide the service
before the elevator breaks down and customers call support, thus saving money, reducing downtime, and
increasing customer satisfaction.
Page 9
Architectural considerations
Designing any system reveals concerns that transcend the individual components of the system. In this
section, we discuss various considerations and architectural approaches that we have encountered while
helping our customers design solutions in the realm of predictive maintenance.
Connectivity
Figure 3. An overview of network layers and mapped logical protocols
A key technical enabler of the Internet of Things (IoT) is ubiquitous connectivity. Let’s first look at the
Open Systems Interconnection (OSI) model.25 Even though the Internet model uses a simplified
abstraction, the models in the previous figure and the associated well-known logical protocols are
comparable.
Application-layer protocols are not concerned with the lower-level layers in the stack other than being
aware of the key attributes of those layers, such as IP addresses and ports. The right side of the figure
shows the logical protocol breakdown transposed over the OSI model and the TCP/IP model.
Interaction patterns
Special-purpose devices differ not only in the depth of their relationship with back-end services, but in the
interaction patterns of these services when compared to information-centric devices because of their role
as peripherals. They are not the origin of command-and-control gestures; instead, they typically
contribute information to decisions, and receive commands as a result of decisions. The decision-maker
does not interface with them locally, and the device acts as an immediate proxy; the decision-maker is
remotely connected and might be a machine. We usually classify interaction patterns for special-purpose
devices into the four categories indicated in the following figure.
25
Wikipedia, OSI Model
Page 10
Figure 4. Device communication patterns

Telemetry is information flowing in one direction that a device volunteers to a collecting service,
either on a schedule or based on circumstances. That information represents the current or
temporally aggregated state of the device or the state of its environment, such as readings from
sensors that are associated with it.

Notifications are one-way, service-initiated messages that inform a device or a group of devices
about some environmental state that they would otherwise not be aware of. For example, wind parks
can be fed weather forecast information, and cities can broadcast information about air pollution,
suggesting that fossil-fueled systems either throttle CO2 output or vehicles may want to show weather
or news alerts or text messages to drivers.

Inquiries occur when a device solicits information about the state of the world beyond its own reach
based on its current needs; an inquiry can be a singular request, but it might also ask a service to
supply ongoing updates about a particular information scope. For example, a vehicle might supply a
set of geo-coordinates for a route, and then ask for continuous traffic alert updates about a particular
route until it arrives at the destination.

Commands are service-initiated instructions sent to either a single device or a group of devices.
Commands can tell a device to provide information about its state, or to change the state of the
device, including activities with effects on the physical world. That includes, for instance, sending a
command from a smartphone app to unlock the doors of your vehicle, whereby the command first
flows to an intermediating service and then from there is routed to the vehicle's onboard control
system.
Telemetry and inquiries are device-initiated, and their counterparts, commands and notifications, are
service-initiated. This means that there must be a network path for messages to flow from the service to
the device, which bubbles up a set of important technical questions. How do you:

Address a device on a network when it is roaming or if it is power-constrained and duty cycling the
radio to conserve energy? 26, 27

Send commands or notifications with acceptable latency for a given scenario?

Ensure that the device only accepts legitimate commands and trustworthy notifications?
26
27
Wikipedia, Duty Cycle
Georgia State University, ActSee: Activity-Aware Radio Duty Cycling for Sensor Networks in Smart Environments
Page 11

Ensure that the device is not easily susceptible to denial-of-service (DoS) attacks that render it
inoperable?

Perform this with millions of devices attached to a telemetry-and-control system?
Connectivity pathways
In the architectures that we have worked on, there are four common
connectivity pathways:

Peer-to-peer: A method of communication between devices of a system
without the use of a centralized administrative system. The peers in the
network can exchange information and communicate only the necessary
information back to the system. Besides providing the ability to create
specific case and self-organizing networks of devices, this method of
communication enhances the capabilities of the system—nodes can
work together to become smarter. The disadvantages for smart systems
in this type of inter-device communication is the lack of centralized
control, and the impact it has on the security of the system. It also
requires a higher level of logic (“intelligence”) for some or all peers to
use peer-to-peer communication.

Device-to-service: A device that communicates to a supporting backend in the system, often called the service.

Service-to-device: A service that communicates to a device; the
opposite of the previous connectivity pathway.

Figure 5. Communication styles
Service-to-service. Communication between two separate systems,
exchanging data to augment knowledge in the system.
From the work that the authors have done, we have learned that for predictive maintenance
implementations, a bi-directional communication pattern is key to a manageable solution. The reason for
this bi-directional communication ability is to ensure that the system can tell devices to change the way
that they capture telemetry, for example, the rate at which it is captured or the fidelity of the readings. We
have not come across a case where the requirements were simply to capture data from devices in a oneway communication flow. Because most systems will need a method of telling devices to capture data at
differing frequency or with increased fidelity, we consider a one-way communication flow a subset of the
more common pattern.
Connectivity network types
The connectivity type demonstrates how a device and service communicate. The type of connectivity
chosen for a system has broad implications to its architecture. We commonly see three types of
connectivity with different implementations and implications:
Page 12
Figure 6. The increasing geographical reach of varying network types

Wide area network (WAN). A good example of a WAN is a cellular network. This network type is a
wireless network that is distributed over land areas called cells, each served by at least one fixedlocation transceiver, known as a cell site or base station. In a cellular network, each cell uses a
different set of frequencies from neighboring cells, to avoid interference and provide guaranteed
bandwidth within each cell. When joined together, these cells provide radio coverage over a wide
geographic area. This enables a large number of portable transceivers (for example, mobile phones,
pagers, and so on.) to communicate with each other and with fixed transceivers and telephones
anywhere in the network, via base stations, even if some of the transceivers are moving through more
than one cell during transmission.28 The most common cellular network is the type that cellphones
use. Cellphones and many integrated components for devices support network technologies, such as
Global System for Mobile Communications (GSM), Universal Mobile Telecommunications System
(UMTS) and as an evolution technology, Long Term Evolution (LTE), as well as others. As with many
technologies in the ecosystem of IoT, there is still a large opportunity for optimization of resource
usage and cost for these technologies. 29

Local area network/wireless local area network (LAN/WLAN). A LAN uses networking technology
to connect computers and devices in a limited area, such as a home, school, computer laboratory, or
office building. Unlike WANs, LANs cover a limited geographic area, and do not include leased
telecommunication lines. Ethernet over twisted-pair cables and Wi-Fi are the two most common
technologies used in LANs today. Though Ethernet 10/100Base-T structured cabling is the basis for
many commercial LANs, fiber-optic cabling is increasingly used in commercial applications. Cabling is
often inconvenient or impossible to use. With the increasing capability of WLAN devices that use
28
29
Wikipedia, Cellular network
Ericsson Labs, “4G for IoT”
Page 13
radio waves based on the Wi-Fi industry standard, WLAN is now the standard for wireless
connectivity. Wi-Fi has a maximum range of about 250 meters outdoors.30 The connection between
different LANs, often times owned by a single entity and extending its range inside a metropolitan
area, is referred to as a Metropolitan Area Network (MAN).

Personal area network (PAN). One of the most interesting developments in networking is the use of
a PAN to transmit data among devices, such as computers, telephones, and personal digital
assistants. PANs can be used to communicate among the devices themselves (intrapersonal
communication), or to connect to the Internet. Until recently, PAN devices could not communicate
over IP, so they needed a bridge to translate between their proprietary protocol and IP. With the
introduction and adoption of IPv6 over low-power wireless personal area networks (6LoWPAN), these
devices, which use developing standards, such as Bluetooth LE,31 will communicate via IP directly
and take a more active role in IoT.
A wireless personal area network (WPAN) is a PAN carried over wireless network technologies such
as the following:
− Bluetooth and Bluetooth Low Energy (LE): A wireless technology standard for exchanging data
over short distances (using short-wavelength UHF radio waves in the ISM band from 2.4 to 2.485
GHz from fixed and mobile devices, and building personal area networks (PANs). Invented by
telecom Ericsson in 1994, it was originally conceived as a wireless alternative to RS-232 data
cables. It can connect several devices, overcoming problems of synchronization. Bluetooth LE 32
uses 5 to 10 times less power than older Bluetooth,33 making it a good fit for certain IoT
applications.
− Z-Wave: A wireless communications protocol designed for home automation to remotely control
applications in residential and light commercial environments. Z-Wave is licensed through the ZWave alliance.34
− ZigBee(-IP): Built on IEEE 802.15.4, the physical layer for low-rate WPANs, ZigBee35 is often used
to transmit low-powered periodic or intermittent data or a single signal from a sensor or input
device, wireless light switch, electrical meter with in-home-display, traffic management system, or
other consumer and industrial equipment that requires short-range wireless data transfer at
relatively low rates. The new ZigBee IP,36 based on 6LoWPAN, lets ZigBee devices communicate
without a bridge.
For an interesting comparison of power consumption between ZigBee and Bluetooth LE, see “Power
Consumption Analysis of Bluetooth Low Energy, ZigBee and ANT Sensor Nodes in a Cyclic Sleep
Scenario”.37 Here are some of our observations from this study:
− BLE would appear to have an intrinsic disadvantage in a cyclic sleep scenario because the
frequency hopping scheme it uses inherently takes longer to connect compare to the fixed RF
channel used in ZigBee and ANT.
30
Wikipedia, IEEE 802.11, Protocols
IEEE, Transmission of IPv6 Packets over BLUETOOTH Low Energy
32
Wikipedia, Bluetooth Low Energy
33
Bluetooth SIG, A look at the Basics of Bluetooth Technology
34
Z-Wave alliance, Z-Wave For Developers And OEMs: How To Get Started
35
ZigBee Alliance, ZigBee Specification Overview
36
ZigBee Alliance, ZigBee IP Specification Overview
37
Artem Dementyev, Steve Hodges, Stuart Taylor and Joshua Smith, Power Consumption Analysis of Bluetooth Low Energy,
ZigBee and ANT Sensor Nodes in a Cyclic Sleep Scenario
31
Page 14
− BLE took longer for one connection (1.15 s), than ANT (0.93 s) and ZigBee (0.25 s). This is
because the BLE node was able to sleep for longer between individual RF packets, improving its
duty cycle significantly.
− We found that BLE achieved the lowest power consumption, followed by ZigBee and ANT. The
parameters that dominated power consumption were not the active or sleep currents but rather the
time required to reconnect after a sleep cycle and to what extent the RF module slept between
individual RF packets.
Protocol choices
After you have chosen a connectivity type, you need to determine which protocols suit the purpose of
your IoT solution. As you can see in the overview of logical protocols, the term protocol applies to
different layers in the stack, and there are different protocols to choose from for each layer.
Transport-layer protocol choices
The transport layer provides communication services in the layered architecture of a network. In the
Internet era, two such protocols have emerged as favorites: the connection-oriented Transmission Control
Protocol (TCP) and the connectionless User Datagram Protocol (UDP). Depending on the environment
that an IoT system must function in, the capabilities of its devices, and how much it must guarantee
message delivery, you can choose to support either one of these protocols or both of them. The following
figure and table provide an overview of the packet structure and a lightweight comparison of TCP and
UDP, as well as factors to consider before choosing to use either protocol.
Figure 7. TCP and UDP basic packet structures
Table 1. Factors for using TCP vs. UDP
Capability
Connection type
Reliability
Protocol overhead
Resource usage
Broadcast transmission support
Ordering of packets
Header size
Error checking
TCP
Connection-oriented
Full
+++
++
no
active
20 bytes
Yes, retransmit
UDP
Connectionless
None
+
+
yes
none
8 bytes
Yes, no recovery possible
Page 15
Acknowledgement
Special features
Yes
-
No
Broadcasts
Multicast
UDP is a good candidate to transmit data from constrained devices over constrained networks in close
proximity, such as LANs or PANs where congestion and packet loss can be low. The following factors
contribute to this:

UDP has very little overhead compared to TCP.

UDP is connectionless, so with no state to maintain, it uses less memory.

UDP transactions require only two datagrams, which reduces network pressure.

UDP has no retransmission delays.
On networks with a higher probability of packet loss, TCP, being more reliable and secure, is a viable
candidate. The following factors contribute to this:

TCP supplies reliability, which is especially important in long-haul communications where there is a
high chance of packet loss.

Because TCP is connection oriented, a device that uses TCP can better defend itself because it can
ignore communications unrelated to current connections, whereas a device that uses UDP must
accept every packet it receives on the listening port.
In scenarios that use streaming video or audio, where high throughput is more important than guaranteed
packet delivery, and in telemetry solutions in which segments are missing, architects often choose UDP
because packet loss is often a better tradeoff than experiencing delays caused by TCP retransmission.38
There are also scenarios where the occasional packet lost for the telemetry channel would be acceptable,
but requirements would exist for guaranteed delivery of commands, making the case for a composite
model to address these needs to uses both UDP and TCP.
Transport-layer protocol security
Another perspective on the choice between UDP and TCP is security. Because UDP is a connectionless
protocol, it lacks the header values that TCP uses for connection management, such as keeping track of
packet ordering (sequence numbers) and packet delivery (acknowledgment number), shown in Figure 7.
UDP is thus a more lightweight protocol, but its lack of header values also lowers the barrier that an
attacker has to overcome to send false information to the system. The ability of an attacker to “just” spoof
the sender address on the IP layer instead of also accounting for altering connection management
information demonstrates this vulnerability. In addition to spoofing, UDP is more susceptible to “flooding
attacks,” where the attacker floods the system with requests, because of missing flow control 39 and
subsequent throttling behavior. TCP is also vulnerable to flooding attacks, but TCP systems can be fairly
well secured by using SYN cookies.
In our work with customers, we have seen many who use devices with limited resources. For example, a
120 MHz microcontroller, with 256 KB SRAM, and 2 MB flash successfully use TCP as a transport
protocol, although the stack in embedded systems often needs modification.40 These customers needed
reliability for long-haul (direct) communication.
38
39
40
Wireshark, Packet loss
Wikipedia, Transmission Control Protocol, Flow control
“Embedded”, Reworking the TCP/IP stack for use on embedded IoT devices
Page 16
Application-layer protocol choices
From our experience, we have seen three dominant protocols on the rise in this space:

Advanced Message Queuing Protocol (AMQP). AMQP is an open protocol for message-oriented
middleware that JP Morgan Chase developed. The same problems of connecting systems together
would crop up regularly. Each time the same discussions about which products to use would happen,
and each time the architecture of some system would be curtailed to allow for the fact that the chosen
middleware was reassuringly expensive.41
The first implementation of AMQP was iMatix OpenAMQ,42 but others have emerged as well, notably
Apache Qpid,43 Microsoft Azure Service Bus,44 and RabbitMQ.45
AMQP is a binary wire protocol that supports programming languages such as C#, C, Java, Perl,
Python, Ruby, PHP, and Lisp.
Where many traditional queuing mechanisms have failed, AMQP seems to be thriving and is
currently used in many systems, such as:46
− Aadhaar,47 a large-scale identity system with 1.2 billion identities and about 100 million
authentications per day.48
− The National Science Foundation’s Oceans Observatory Initiative, processing 3 petabytes per
year49
For more information, see the AMQP site to read the specifications on the protocol or try a
free implementation.


Constrained Application Protocol (CoAP).50 Targeted mostly at resource
constrained sensors and actuators (devices) such as valves and switches, this
protocol fits the bill for specific purpose networks, such as Wireless Sensor
Networks (WSNs),51 with applications such as forest fire detection,52 and
structural health monitoring.53 CoAP is by default bound to UDP and optionally
Datagram Transport Layer Security (DTLS), providing communications privacy.
With the default binding to UDP, CoAP supports multicast messaging, allowing
for the addressing of a group of destinations at once. CoAP over TCP transport
is currently in draft.
Figure 8. Multicast
MQ Telemetry Transport (MQTT).54, 55 From the documentation, “MQTT is a Client Server
publish/subscribe messaging transport protocol. The protocol runs over TCP/IP, or over other network
protocols that provide ordered, lossless, bi-directional connections.” MQTT is a publish-subscribe
41
Association for Computing Machinery, Toward a Commodity Enterprise Middleware
See http://www.openamq.org
43
Apache, Apache Qpid™
44
Microsoft, AMQP 1.0 support in Service Bus
45
See http://www.rabbitmq.com/
46
Amqp.org, Products and success stories, Notable AMQP Users
47
Unique Identification Authority of India, Aadhaar technology
48
Slideshare, Big Data at Aadhaar (slide 9)
49
OOI, CIAD COI TV RabbitMQ
50
Wikipedia, Constrained Application Protocol
51
Wikipedia, Wireless sensor network
52
Wikipedia, Wireless sensor network, Forest fire detection
53
Wikipedia, Structural health monitoring, Examples
54
See MQTT.org
55
OASIS, MQTT 3.1.1 draft 01 / public review draft 01
42
Page 17
protocol developed for machine-to-machine (M2M) communications, initially created by IBM, and
currently undergoing standardization at OASIS.
In projects that we have done where there was a green field for implementation, AMQP has been
the best fit because in addition to it being efficient, reliable, flexible, and broker independent,
AMQP is native to Microsoft Azure Service Bus, the key technology component for all these
customer projects.
Page 18
Security
With devices communicating sensitive information and acting on our behalf, we clearly need to ensure
that the system and the information it captures, processes, and stores, is secure. With any system,
security is a tradeoff with other requirements, such as user friendliness, performance, cost, and so on. In
this section, we cover some important security aspects we have come across while working in this field.
“This is the weather forecast for the week of June 16, 2024 for Texas,” the weatherman says.
“Last week was hot, but this week will be sizzling, with temperatures reaching in excess of 110
degrees, with no rain expected.” In hot weather, irrigation is the key to crop and cattle survival.
Because most of the state’s farmers are using a new irrigation system that depends on thousands
of sensors to determine the best time to irrigate, few of them worry. What they don’t know is that
the system is sending faulty telemetry information that indicates that it rained every day last week.
This keeps the system from irrigating, and now, crops and cattle start to die.
When distributed systems directly influence the physical world by turning valves, controlling servos, and
much more, there is a clear need to ensure that compromised systems do not kill crops, cattle, and
people, burn buildings, or crash cars. The security bar for commands and data that make things move
must be much higher than in e-commerce or finance.
Let’s start with a short list of questions about security for the kinds of systems that we have come across
in our work on predictive maintenance—a list of factors to think about as you architect an IoT system. On
top of normal security precautions, you also need to know how to:

Securely onboard new devices. You must ensure that only devices that the system can register are
allowed into the system.

Prevent devices from being duplicated or substituted. Because devices provide data that the
system will directly or indirectly act upon, you must be able to trust data from devices. Peripherals
that can be duplicated or substituted might allow a rogue entity to flood the system with false but
trusted data. Also, in the past, a pirated copy of a device used to cost money in terms of a lost sale. If
it is a connected device, it can now have actual costs in terms of those related to connectivity and
cloud compute to support and interact with the device.

Ensure that device data can be trusted. As devices communicate, you need to ensure that the data
that they transmit is received unaltered and from verified sources—that the data logged in the service
by the device must be trustworthy, representing a point-in-time observation. This requires integrity
and authenticity of data in information-security terms.56

Ensure the confidentiality of messages in transit and at rest. Because IoT systems span multiple
physical networks and transport information over public and unknown networks through dynamic
routes, information in transit must be secured against observation by non-authorized third parties.

Prevent devices from denying service. In modern software architecture, the level of
interdependencies is high and increasing. Dependencies within the system—such as devices
measuring data potentially critical to effective decision-making—need to be available and accessible.

Accept only authorized commands on devices. In any system that acts on external commands
and especially one that interacts with the physical world, it is imperative to ensure that those
commands are only acted on if they are properly authenticated and authorized.
56
Wikipedia, Information Security, Authenticity
Page 19

Remove rogue devices from the system. If you find a bad actor such as a compromised device in
the system, you must be able to remove it quickly.

Authenticate peers. If a system supports peer-to-peer communication among devices, for example,
to enrich information or intelligent edge decision-taking without service intervention (autonomous
system operation), you must have a way to authenticate in place to ensure that peers in the system
are talking to trusted neighbors.

Ensure that devices are always connected to a particular service. A powerful part of how modern
communication works is by using hyperlinks to let clients dynamically reroute traffic. Devices will
blindly follow these hyperlink redirects without thinking twice (or once, for that matter). Besides
offering flexibility, redirects pose a substantial risk if someone redirects the dataflow into an
intermediate system to alter system behavior, copy the data, or modify the data stream.

In combinatory devices, ensure fine grained security is possible. When a component of a
customer is embedded inside a larger system, such as smart brakes inside a train or components
inside machines, ensure each interested party has access to the right information and commands,
and that when a component is replaced, it is no longer authorized to act as being part of the larger
device.
Virtual Private Networks
A common way to connect networks over an
untrusted network is to use a virtual private
network (VPN)57. VPNs act as a virtual network
card on both ends of the connection, combining
two networks as if they were a single entity.
The issue with this approach is that a VPN
merely provides secure virtual network cables;
it is the two networks and therefor everything in
Figure 9. VPN connecting two networks at the link-layer
them that are connected. After the connection
is established, the VPN provides access to all
layers above the link-layer from any device on either network.
A VPN does not help establish any notion of authentication and authorization beyond their immediate
scope. A network application that sits on the other end of a TCP socket, where a portion of the route is
facilitated by the VPN, is oblivious to their existence because it acts on the transport and application
layers of the network model. What matters for the trustworthiness of the information that travels from the
logic on the device to a remote control system that does not reside on the same network, as well as for
commands that travel back up to the device, is solely a fully protected end-to-end communication path
spanning networks, where the identity of the parties is established at the application layer. Protecting the
route at the transport layer by signature and encryption is done as a service for the application layer
either after the application has given its permission (for example, via certificate validation hooks) or just
before the application layer performs an authorization handshake, before entering into any conversations.
Establishing end-to-end trust is the job of application infrastructure and services, not of networks.
Compliance
For vertical sectors such as government and healthcare, compliance is a key consideration as you
architect an IoT solution. National and local governments and industry groups have mandates that affect
57
Wikipedia, Virtual private network
Page 20
what a company can share and with whom. Conversely, some regulations require the sharing of data
among government entities or businesses that work on government programs. The EU has model clause
regulations that dictate the storage and exposure of personal data. 58 The U.S. has similar regulations,
such as the Health Insurance Portability and Accountability Act (HIPAA)59 and the Privacy Act.60 Other
countries and entities also have privacy mandates that consider the location of stored data, its origin, the
location and nationality of the users, and the location, nationality, and use of the data consumers.
If ingested, processed, or published data offers no way to discern details about specific people, it will less
likely be affected by regulation. But all data that is made available to the public or even a controlled set of
partners must be reviewed to adhere to all applicable mandates because violations present high legal61
and reputational risks.62
Healthcare
The HIPAA and HITECH laws in the U.S. apply to healthcare and partner organizations that have access
to sensitive patient information, called electronic protected health information (ePHI). Service providers
that work with these entities usually must agree in writing to adhere to security and privacy provisions set
forth in HIPAA and the HITECH act. If an IoT system that supports applications such as the one we
described in the Healthcare scenario captures ePHI, it must adhere to these laws. Microsoft provides a
Business Associate Agreement as a contract addendum to its cloud platform, Microsoft Azure. 63 We also
provide information on some of the best practices for HIPAA-compliant applications, and we detail
Microsoft Azure provisions for handling security breaches.64
58
European Commission, Protection of Personal Data
US HHS, Health Information Privacy
60
US HHS, Privacy Act
61
TechRepublic, Data security laws and penalties: Pay IT now or pay out later
62
Experian, Reputation Impact of a Data Breach
63
U.S. Department of Health & Human Services, Health Information Privacy, Business Associates
64
Microsoft, Azure HIPAA Implementation Guidance
59
Page 21
Device communication patterns
Many current IoT communication approaches try to answer the basic addressing question with traditional
network techniques. That means that the device either gets a public network address or it becomes part
of a virtual network and then listens for incoming traffic using that address, acting like a server. In this
section, we document various architectural approaches that we have seen, highlight their characteristics,
and then propose an alternative that is suitable for many IoT scenarios.
NAT-based device network
This architectural design approach uses network address translation (NAT)65 to expose internal devices in
the network that usually use a private IP address, to the outside world by reserving a port on the edge
device and mapping this port to the private IP address. The following diagram illustrates this approach.
Figure 10. NAT-based device network
The previous figure shows a device that uses an internal IPv4 IP address (192.168.1.112) that is listening
on port 8088. The device is exposed to the outside world on IP address 127.x.x.x, using port 721. The
DNS entry associated with this IP address is device.mynetwork.com. Clients accessing
device.mynetwork.com on port 721 will be routed directly to the internal device.
This approach has been used in many traditional networks, and depending on the scenario, it can still
work today. However, we have found this scenario to be typically limited by the amount of devices that it
can support (about 65,000) due to the number of available ports, the need to be statically located (not
moving), and the fact that every exposed device needs to act like a server (receiving, parsing, and
answering arbitrary requests from clients), which increases its attack surface for malicious abuse.
65
Wikipedia, Network Address Translation
Page 22
IPv6 direct-addressing device network
With the rollout of IPv6, it is natural to think about giving every device in an IoT solution its own publically
routable IP address to let it connect to peers, services in the system, or other systems. The following
diagram conceptually depicts this model, which we have seen many times.
Figure 11. IPv6 direct-addressing device network
We mentioned the drawbacks with this approach at the start of this section. Many current IoT
communication approaches try to answer the basic addressing question with traditional network
techniques. That means that the device either gets a public network address or it becomes part of a
virtual network and then listens for incoming traffic using that address, acting like a server. For NATbased device networks that use either of these protocols, a device needs to act like a server, and with the
implicit direct-connectivity model, it must be stationary to avoid connection loss, or it must employ
application-layer measures that can handle this scenario.
Page 23
NAT-based, PAN device network
For PAN power-constrained and mostly wirelessly connected devices that are often not IP-based, a
common approach to bridging the last few feet of connectivity is to use a hub device wired to the main
network that can bridge to the devices on the local network. The following figure illustrates this approach.
Figure 12. NAT-based, PAN devices network
Even though a hub translates between IP and the various PAN protocols, the problem space is the same
as with other NAT-based device networks that we described.
Generic concerns with direct addressing
All previous architectures that provide direct addressability for devices share common concerns. As each
device is publically addressable, it needs to handle inbound commands itself, taking care of all application
layer responsibilities, such as hosting the server accepting inbound connections, interpreting commands,
queuing requests, and so on. Because many devices in large-scale deployments will have limited
resources, constraining the number of socket connections that they can handle, and leaving them open to
simple denial-of-service (DoS) attacks.66 In this approach, the devices would also have to handle the
authentication of users for command + control, using the already scarce sockets, memory and compute
power to call out to a service or connect to a database and handle its responses and I/O.
Service-assisted communication
Another approach to connecting a large number of devices to the central service within a system is to
have the device connect to a well-known service (called a gateway) and then use that service to tunnel
commands to the device. The goal of this approach is to establish trustworthy and bi-directional
communication paths between control systems and special-purpose devices that are deployed in
untrusted physical space. To that end, the following principles are established:

66
Security trumps all other capabilities. If a capability cannot be implemented securely, it must not
be implemented. Threats are identified and either mitigated or accepted.
Wikipedia, Denial-of-service Attack
Page 24

Devices do not accept unsolicited network information. All connections and routes are
established in an outbound-only fashion.

Devices are peered with a gateway to only connect or establish routes to well-known services.
If devices need to feed information to or receive commands from a multitude of services, they are
peered with a gateway that takes care of routing information downstream. This ensures that
commands are only accepted from authorized parties before routing them to the devices.

The communication path between device and service or device and gateway is secured at the
application protocol layer. This mutually authenticates the device to the service or gateway and
vice versa. Because the application does not normally concern itself with lower-level layers in the
network stack as we discussed earlier in Connectivity, device applications do not trust the link-layer
below.

System-level authorization and authentication must be based on per-device identities. One
device, one identity ensures that you have granular control over which devices can access the
system, provide data, and receive commands.

Access credentials and permissions must be revocable. In case of device abuse, the system
must be able to quickly respond by removing the device as an authorized part of the system.

Bi-directional communication for devices may be facilitated by an intermediate store. Devices
that are connected sporadically due to power or connectivity concerns may be facilitated through
holding commands and notifications for the devices in a queue or mailbox structure until they can
connect to retrieve them.

Application payload data may be separately secured. This is to protect transit through gateways
to any particular service.
Figure 13. Service-assisted communication pattern
From the previous illustration, we can derive the following set of attributes:

Device. The device acts like a client; it connects to the gateway and does not listen for unsolicited
traffic. The device connects to an external gateway by creating and maintaining an outbound TCP
socket across a NAT boundary or by establishing a bi-directional UDP route, potentially using
mechanisms such as Session Traversal Utilities for NAT (STUN) or with larger NATs, such as
Page 25
Traversal Using Relay NAT (TURN). These facilitate the detection of a NAT and the discovery of the
public IP address of the network for binding.

Connection. The connection is routed through the edge device, usually a router. Because the
connection is outbound, the port mapping is performed automatically. By only relying on outbound
connectivity, the NAT/Firewall device at the edge of the local network will never have to be opened up
for any unsolicited inbound traffic.
The outbound connection or route is maintained by either client or gateway in a fashion that
intermediaries such as NATs will not drop due to inactivity. That means that either side might send
some form of a keep-alive packet periodically, or send a payload packet periodically that then doubles
as a keep-alive packet. Under most circumstances it will be preferable for the device to send keepalive traffic as it is the originator of the connection or route, and it can and should react to a failure by
establishing a new one.
As TCP connections are endpoint concepts, a connection will only be declared dead if the route is
considered collapsed and the detection of this fact requires packet flow. A device and its gateway
may therefore sit idle for quite a while believing that the route and connection is still intact before the
lack of acknowledgement of the next packet confirms that assumption is incorrect. This conflict in
behavior calls for a tradeoff decision to be made.
Carrier-grade NATs (CGNs) employed by mobile network operators permit very long periods of
connection inactivity and mobile devices that get direct IPv6 address allocations are not forced
through a NAT at all. The push notification mechanisms employed by all popular smartphone
platforms use this to dramatically reduce the power consumption of the devices by maintaining the
route very infrequently—every 20 minutes or more—so the devices can remain in sleep mode with
most systems turned off while idly waiting for payload traffic. The downside of infrequent keep-alive
traffic is that the time it takes to detect a bad route is, at worst, as long as the keep-alive interval.
Ultimately, it is a tradeoff between battery-power and traffic-volume cost (on metered subscriptions)
and acceptable latency for commands and notifications in case of failures. The device can actively
detect potential issues and abandon the connection and create a new one when, for instance, it hops
to a different network or when it recovers from signal loss.
The connection from the device to the gateway is protected end-to-end and ignores any underlying
link-level protection measures. The gateway authenticates with the device and the device
authenticates with the gateway, so neither is anonymous to the other. In the simplest case, this can
be done by exchanging a previously shared key. As we see quite often in more capable devices, it
can also be done via a X.509 certificate exchange as performed by Transport Layer Security (TLS),
or a combination of a TLS handshake with server authentication where the device later supplies
credentials or an authorization token at the application level. The privacy and integrity protection of
the route is also established end-to-end, ideally as a byproduct of the authentication handshake so
that a potential attacker cannot waste cryptographic resources on either side without producing proof
of authorization.
Today, TLS/DTLS and Secure Shell (SSH) dominate as application-level connection security
protocols. SSH is popular, but it lacks a standard session-resumption gesture. TLS supports both the
X.509 certificate-exchange model and a simplified model (TLS-PSK) that uses previously shared
keys. Removing support for X.509 certificate handling and wire-level exchange reduces the footprint
of the TLS library, and by reducing the supported algorithms (for example, supporting only AES-256
and SHA-256), it’s feasible to use this protocol on compute- and memory-constrained devices while
remaining compatible with other application layer protocols that rely on TLS. The result of all this is a
secure peer connection between the device and a gateway that only the gateway can feed.
Page 26

Edge security. Because there are no ports open to listen on the edge device, the attack surface on
the local network and its devices is minimized.

Gateway. The connection is accepted by a hosted process called a gateway, a system hosted in an
environment that is defendable against external threats, either at the edge of the internal network or
based in the cloud. It provides a well-defined endpoint and API for clients to connect to and
communicate with, effectively acting as a proxy for the device. Eventual peer-to-peer connections
inside the network are acceptable, but only if the gateway permits them and facilitates a secure
handshake between the peers.
In case any authorized client wishes to send a command (or a reply to a previous request) to a
device, it can do so by sending the command to the gateway, providing one or even several different
APIs and protocol surfaces that can be translated to the primary bi-directional protocol used by the
device. As the gateway is a layer of abstraction, it provides the device with a stable address, location
transparency and location hiding.
As this gateway forms an abstraction toward the device, the device could be limited to speak AMQP,
MQTT or some proprietary protocol, and yet have a full HTTP/REST interface projection at the
gateway, with the gateway taking care of the required translation and also the enrichment where
responses from the device can be augmented with reference data. The device can connect from any
context and it can even switch contexts, yet its projection into the gateway and its address remains
completely stable. The gateway can also be federated with external identity and authorization
services, so that only callers acting on behalf of particular users or systems can invoke particular
device functions. The gateway therefore provides basic network defense, API virtualization, and
authorization services all combined into in one. This approach gets even better when it includes or is
based on an intermediary messaging infrastructure that provides a scalable queuing model for both
ingress (device to cloud) and egress (cloud to device) traffic.
Without this intermediary infrastructure, this approach would still suffer from the issue that
devices must be online and available to receive commands and notifications when the control
system sends them. With a per-device queue or per-device subscription on a
publish/subscribe infrastructure, the control system can drop a command at any time, and the
device can pick it up whenever it is online. If the queue provides time-to-live expiration
alongside a dead-lettering mechanism for such expired messages, the control system can
also know immediately when a message has not been picked up and processed by the
device in the allotted time.
The queue also ensures that the device can never be overtaxed with commands or
notifications. The device maintains one connection into the gateway and it fetches commands
and notifications on its own schedule. Any backlog forms in the gateway and can be handled
there accordingly. The gateway can start rejecting commands on the device’s behalf if the
backlog grows beyond a threshold or the cited expiration mechanism kicks in and the control
system gets notified that the command cannot be processed at this time.
On the ingress-side (from the gateway perspective) using a queue has the same kind of
advantages for the back-end systems. If devices are connected at scale and input from the
devices comes in bursts or has significant spikes around certain hours of the day, such as
with telematics systems in passenger cars during rush-hour, having the gateway deal with the
traffic spikes keeps the back-end system robust. The ingestion queue also allows telemetry
and other data to be held temporarily when the back-end systems or their dependencies are
taken down for service or suffer from service degradation of any kind.
Page 27
Designing for scale
The opportunity for IoT is in the ubiquity of connected devices, the volume of data that those devices will
supply, the intelligence to be gained from that data, and the command/control that we can exert on the
devices. All of these aspects mean that the solution must be designed to scale at all levels.
In many respects, designing an IoT solution to effectively scale carries the same aspects as any large
scale solution. While IoT does not require a cloud-based deployment, in most cases, taking advantage of
the cloud makes sense because of usage-based pricing, a simple model that scales, geographic
availability, and infrastructure support provided by the cloud vendor. Many documents and articles have
been written about cloud application scalability and availability. For a good overview of this topic, see
“Failsafe: Guidance for Resilient Cloud Architectures.”67 The Microsoft patterns & practices team also has
a large body of work on Cloud Development that provides guidance on building scalable cloud systems.68
There are specific scalability areas that come up more frequently in IoT scenarios that may not appear in
other IT solutions, however. One area is identity. For web properties, the concept of identity federation
has taken hold, and most modern consumer web properties now allow a user to use their identity from
other well-known identity stores, such as an account registered with Microsoft, Facebook, Google, Yahoo,
and so on. Additionally, corporate accounts can be federated with platform as a service (PaaS) vendors
and partners. But with the addition of devices, there will often be identities associated with those devices,
relationships between those devices and human identities, and relationships between multiple humans
and devices. This potentially complex set of relationships should be considered early in an IoT project,
and the solution should strive to simplify these relationships as much as possible.
In our project experience we have not yet seen a pattern that satisfies this level of complexity with
satisfactory results. The initial projects have used Azure Active Directory for human identities, and
external data stores for device identity and the associations with Azure Active Directory users. Design,
prototyping, and testing is an ongoing process to find more scalable, resilient and feature complete
solutions.
Communication and ingestion
Another scalability area that is tested are the communication paths for ingestion. Most solutions will
require secure, authenticated communication between devices and the collection point. Additionally, any
implementation choice for messaging technology will have scalability points, limits in certain properties,
such as messages per unit (for example queue, topic, and so on), and bandwidth per implementation unit
(subscription, instance, and so on). These parameters must be well understood, planned for from the
beginning of the project, and tested and verified as the architecture and solution progress.
Our projects have used the Microsoft Azure Service Bus69, with Azure Active Directory Access Control
(ACS)70 keys granted for each device. In a generic solution, some type of secure key must be generated
that will make a device unique, and one that only that device knows about. The system it connects to
must know about the device and its key, and then verify that they match when messages arrive. The
Service Bus and ACS provide these capabilities, making them a good fit. The solutions use Topics and
Subscriptions71, and they are designed to take into account the scalability parameters of the Service
67
MSDN, Failsafe: Guidance for Resilient Cloud Architectures
MSDN, Cloud Development
69
MSDN, Service Bus
70
MSDN, Access Control Services 2.0
71
MSDN, Service Bus Queues, Topics, and Subscriptions
68
Page 28
Bus72, and use as many topics as needed to comfortably support the number of devices in the system
and scale to additional topics if and when additional devices are added to the system.
Data storage scalability
Scalable data storage is another area that will be important in these projects. Because of the expected
volume of data, blob storage will normally be the preferred choice. The reason for this is that blob storage
is the lowest cost storage option, and Big Data analysis tools are built to work against blob storage.
Depending on the volume and the geographic dispersion of the devices, the solution may need to use
multiple storage accounts, and it may also need to move data from collection data centers into a single
data center in order to perform analysis on that data. For additional guidance on managing the data, see
“Data Management Patterns and Guidance” 73 on the Microsoft
Developer Network.
Device registration
Registering devices is the critical first step to take to ensure that
the system is secure and remains secure, only allows data to be
ingested from trusted endpoints, and devices only accept
commands from trusted systems. A device must be uniquely
identified, the system must authenticate its identity, and the
device must know that it is communicating securely with only the
correct collection endpoint.
Often a device will be created with the knowledge of the expected endpoints, or at least have some
influence over the collection point. An example of this is a vehicle whose manufacturer is selling a
connected vehicle experience. In this scenario, when the device is manufactured, a unique key will be
stored on the device. Either that key or a public key associated with it will be stored in a database, and
when the device is enabled, the service can check the database and verify that the device is an approved
device. These keys may be service-generated, such as by Azure Active Directory Access Control
Services (ACS), or keys created to support the TLS-PSK pattern as described earlier in this paper, or
keys intended for service-specific authentication. Typically, even when the device carries a key out of the
factory, the device will become “active” in another step; for example, when a customer purchases, installs,
and configures the device. Configuration will associate a user with the device, which transforms it to an
active device. The device may be issued a new key at this time.
In other cases, the set of potentially connected devices will not be known at manufacturing time, so keys
cannot be installed on the device prior to its release. In this case, device registration must happen when
the device is installed or activated. An example of this might be a traffic service that will collect GPS and
movement telemetry from a smartphone, and in turn provide free traffic information for users who opt in to
share data. In this case, there would be a registration step where a user must identify the device to the
service, the service then sends a key to the device, and then that key is used to manage communication.
Equally important to device registration is the ability to unregister the device, or disable it. This is critical
because even though the communication with the device is secure, the device itself can become
compromised. Being able to unregister the device and refuse communication is a critical aspect of the
system. With device specific keys, the keys can be revoked and the system can quickly stop accepting
telemetry from the device.
72
73
Microsoft, Service Bus Scalability
MSDN, Data Management Patterns and Guidance
Page 29
Acquiring data
IoT data acquisition is frequently referred to as data ingestion. In
literature about Big Data, the three Vs, volume, variety, and
velocity are often cited74. There are other aspects to consider as
well. In our initial engagements in IoT, we have seen that device
bandwidth, connection speed, reliability, and cost have been major
influencers in the solution choices made. But each item in this
section is important, and the relative importance of each will vary
depending on a project’s requirements. The following sections
discuss many aspects of data ingestion.
Message size and format
Messages from devices are the lifeblood of IoT. In a world with no boundaries, we might collect all
telemetry data and analyze it extensively, or simply save it in case we need it later. In the real world, we
need to consider the size of the message, which will be affected by its number of attributes, the data
types, the message formats, the message overhead, and the security overhead.
Many common message formats are in use today. Extensible Markup Language (XML) and JavaScript
Object Notation (JSON) are common. Binary JSON (BSON), Protocol Buffers, and Avro are more
compact formats that are often used when message size and bandwidth are constrained. XML is
supported by all development tools, and easy to understand, but its tags can often cause message-size
bloat. JSON is quickly becoming as ubiquitous as XML, and it is more compact than XML, but JSON
retains the readability of XML.
In IoT there is often a premium on memory, bandwidth, and connection cost, so compact message
formats can be useful. BSON is a binary encoded version of JSON. It allows you to encode binary data in
the message, and it enables storing data as raw bytes versus text. Protocol Buffers define a method of
serializing structured data. They were developed at Google, and then given to the open source
community. Protocol Buffers are compact, but not self-describing like XML and JSON, so sender and
receiver must understand the message being transmitted. Avro is another option for compact formats. It
differs from BSON and Protocol Buffers in that it is not self-describing, but it is always accompanied by a
schema, so now code generation or prior knowledge of schema is required for processing on the
message receiving end. Ultimately, choosing one of these formats comes down to how to balance
development environment support, device support, the need for compactness, and storage and
processing requirements on the message-receiving side.
Message types
Your system may require different message types that can differ in schema, data type, or both of these. A
real-world example of this is a connected vehicle system that predominantly sends telemetry information
for predictive maintenance. This system might also be used to send audio or video clips for emergency
management, accident recording, and so on. In these cases, the media files are often enhanced with
metadata related to the collection of the media file. Additionally, the media messages may be of lower or
higher priority and they may require splitting, compression, resumption on error, and temporary local
74
See The 3 Vs of BIG data
Page 30
storage. If different device types are involved, they may provide media files in formats or encoding levels
that are optimized or specific to those devices, which could require normalization at the storage point.
Message priority
Different message types will often have different priorities in an IoT system. A message can be a
standard telemetry message that is intended specifically for cataloging, and used for machine learning
algorithms downstream. There can be other message types that are considered events and alarms. An
event could be an elevator door opening, a car starting, or the temperature being increased in a home,
whereas an alarm might be a broken window, a car crash, or a full engine failure.
Message priority will be handled either by providing a separate endpoint for priority messages, or by
detecting attributes in the message itself to assign priority. Using a separate endpoint for priority
massages can reduce the chance of a high priority message delivery being slowed by a flood of the
standard flow messages. If the throughput of the initial point of ingestion is considered adequate, then
downstream detection is an option, for instance creating a standard subscription and a high priority
subscription on an Azure Service Bus Topic.
There are also cases where device priority may be required. In a connected vehicle scenario, there may
be a premium service that has priority, or there may be sensors in a building with relative priority, such as
one that detects a broken window on the first floor that has higher priority than one on the fifth floor of the
building. In this case, the priority may be handled similarly to message priority. Another approach is to
use a separate service that handles the higher priority devices.
Conditional messaging
In some of our projects, the solution required the message pattern to change based on conditions. In this
case, if a service technician received an alert that an elevator needed attention, the technician could send
a message to the device asking for it to increase the detail and frequency of messaging. This would
continue for a configurable timeframe.
This type of requirement means that the solution must be scaled to handle the conditional events. For
instance, if the devices could automatically increase the size and frequency of messages, they could
cause a dramatic increase in traffic to the system. Safeguards and throttling should be considered to
protect against unplanned data floods in such situations.
Contextual messaging
Similar to conditional messaging, there are use cases that require contextual messaging, which can
follow multiple patterns. There may be situations where the device includes contextual information in the
messages that it sends. The data may include GPS coordinates, and a vehicle may need to send
additional telemetry when it travels above a certain altitude, or if the ambient temperature rises above a
trigger level. The context may require more data in messages, the collection of data from other sensors
on the device, or it may require more or less frequent message transmission.
Message batching
The natural inclination may be to send messages immediately when data is generated, but there are
several reasons why messages may be batched. A device may be power constrained, so the connectivity
may only be turned on for a limited amount of time. The connection may be unreliable, so it could make
Page 31
sense to batch the collected messages for a single transmission once connectivity is available. The
device may move in and out of connectivity, or connectivity may be congested or less expensive at
certain times of the day. If you allow batched messages, the message receiver must be designed to
accept them as well as single messages. In this case, a message envelope that can contain multiple
messages or a single message can simplify the solution.
Bandwidth and scale
Previous topics in this paper discuss bandwidth from the device. The bandwidth and scale of the
collection points must also be considered. The size of the network pipe out of the device environment
may be constrained. For example, if the solution is collecting building telemetry, and there are devices
that are connected to an internal network and sent to an external collection point, the effect on the
capacity of the building network should be evaluated. The collection points will also have an upper bound.
For example, Microsoft Azure Storage and Service Bus have capacity targets. If your solution needs to
extend beyond the targets of the enabling technology, then a scale-out approach should be designed for
the project. This approach should include plenty of excess capacity for growth and unplanned spikes. In
our projects, we typically plan for no more than 50 percent capacity at steady state.
If the connected devices are geographically distributed, consider scaling out the solution to multiple data
centers. This can introduce the complexity of directing device traffic to the right collection points. In our
projects, we have found success in assigning devices to data centers so that no single device traffic
needs to “find” where its data should go. If the device moves geographically, then it may need to be
reassigned. It is important to understand how the data will be used, and if it needs to be aggregated
before use or if the data can be used autonomously in the data center where it was collected.
Storing information
In an IoT solution, there are also several aspects to consider for
data storage. The following sections discuss many aspects of
this topic.
Storing data on the device
The critical telemetry data is generated on the device, or prior to
getting to the device in the case of a gateway. The data may be
cached and preprocessed on the device. The reasons for doing
this include the desire to optimize the amount of data sent, to
minimize “noise” data from analysis, to save on storage costs at the central storage location, minimize
transmission time or cost, account for unreliable connectivity, and so on. If data will be stored on the
device either temporarily or permanently, there are several local storage considerations, such as those on
security, reliability, and capacity. If data is stored on the device, the solution architect needs to consider
the implications of losing the data, if the data will expire on the device if it cannot be sent to external
storage, and how the system will detect and recover from missing data, should a local outage occur.
Transforming data
Generally the data will go through multiple transformation steps that extend from the generation, sending,
storage, and processing of it. As stated in the previous section, there may be data transformation
happening on the device itself, such as converting its format, aggregation, and so on. This will rely on
Page 32
local processing capabilities. Other than the local preprocessing, any other transformation would happen
at the collection point.
For years, data processing has been thought of in terms on Extract, Transform, and Load (ETL). With the
advent of Big Data, much of the discussion has changed to Extract, Load, and then Transform (ELT). The
key concept in this transition is that your system is ingesting a huge amount of data, and the
transformation process costs significant compute power. Additionally, while this transformation is
happening, the data is at risk. If it has not yet been serialized, and the server crashes, then the data is
lost. With ELT, the system ingests the data and immediately stores it. This minimizes the exposure of the
data during ingestion, and provides new opportunities for data transformation and analysis. First, the data
can be transformed asynchronously from ingestion. This helps reduce compute demand. Then the data
can be transformed multiple times, for multiple purposes, and this process also supports the idea of
collecting all data for extended periods of time. This is often referred to as a “data lake”75, and this
strategy suggests keeping “all” data for later analysis. The rationale for this is that machine learning
algorithms may find interesting patterns or trends that would not be expected, and that these would
warrant studying other seemingly unneeded data.
Location
Most IoT solutions will send data to a public or private cloud. If connected devices are geographically
distributed, there may be a case for storing the data across several locations around the globe, in order to
store the data closest to where it was generated. There may also be government mandates that require
an individual’s data to remain in that person's home country, or the data may only be interesting within the
region within which it was collected. However, in a large percentage of projects, the value is in the large
body of data, so data must be brought together into a single location for the most insightful analysis. In
this case, the considerations will center on the time constraints of the analysis (how often are the
algorithms run?), the physical limitations of the data centers, bandwidth, and the cost of moving data.
Longevity, format, and cost
After the data reaches its long-term storage point there are decisions to be made about how to govern
that. A data retention policy must be defined. The arguments for long data retention periods are that cloud
storage is inexpensive and getter cheaper all the time, and that data scientists want data saved in case a
new insight is discovered that warrants looking at data that was previously uninteresting. Even with those
benefits, the costs for large volume data storage can add up, and the data could become unmanageable
if you do not have a basic plan for how to store, access, and retrieve it. The terms Data Temperature and
Hot and Cold storage76 also come up in this context. The concept centers around how frequently
accessed the data is, and how quickly the users or systems expect to be able to use the data. Hot data is
frequently accessed and users expect good response time. Cold data is data that is less frequently
accessed and expected response times can be lower. Classifying data in this manner allows the
architects to choose faster and potentially more expensive storage for hot storage and select lower cost
options for cold storage.
The format for long-term data storage also needs to be carefully considered. Should it be optimized for
Hive queries, or should it be as compact as possible? Or should there be a “fresh” data store with more
recent data that is easy to access, process and query, and an archive that is compressed and stored in a
75
76
Forbes, The Data Lake Dream
Teradata, Hot and Cold Running Data
Page 33
way that minimizes cost, but that requires overhead if and when it needs to be accessed. All of these
considerations add in to making decisions on how to best store the data.
Processing information
After the data is ingested, it must be processed.
Processing types range from very simple to long-running
and complex. The following sections discuss common
IoT data-processing types.
Alarm processing
A common use case is to watch for specific data items on ingestion and then take action based on that
data. These could be alarms from devices, or any kind of simple event processing. The characteristic of
this type of processing is that there is a specific set of values that are to be monitored on specific
attributes of the incoming data that can trigger predetermined responses. While this type of event
processing is logically straightforward, the implementation still requires consideration due to the expected
high volume of data being ingested, and the likelihood that the events that must be responded to are of
relative importance.
In alarm processing, the solution must also account for the potential of alarm floods. If a systemic failure
happens, for instance if a home alarm system sends an alarm to the event processing system when the
power goes out, there may be a flood of alarms, or if the there is no battery backup, messages may be
cached on the device, and then when the power returns, all the devices send their entire set of messages
at once. To handle these situations, the devices may be designed to have a random offset for message
delays, or the message receiving service can implement a circuit breaker pattern77 to circumvent failure
when an abnormal event pattern happens.
Complex-event processing
Complex-event processing is used to detect conditions or states on data in motion that may not be
directly deduced from simple data evaluation. This might include the detection of a certain set of events
that arrive in a particular order or frequency, such as an event that is innocuous if it appears once, but
that indicates a problem if it occurs a certain number of times in a certain timeframe, or if the same event
is transmitted from a set of devices or sensors. Imagine that your car sends telemetry to the
manufacturer, and one of the items that it reports is failed starts. By itself, this would mean very little to
the manufacturer. However, if the weather got very cold last night, and none of the SuperCar Model 8s in
that area started in the morning, that could tell the manufacturer that there is a systemic problem with the
car's battery or something related to the starting system.
The industry sees complex event processing as one of the keys to monetizing the vast opportunity of
IoT.78 When envisioning the solution, ensure that initial requirements are discussed early in the project.
This is an area where businesses will learn and improve over time, but one which should be prototyped
early in the process to prove out the concepts, and to begin to develop the right mindset for capitalizing
on the opportunities. This is a rich area of development within Microsoft, our competitors, and the open
source community. Microsoft has developed StreamInsight,79 which can be deployed in the cloud. A
77
MSDN, Circuit Breaker Pattern
Venture Beat, Without stream processing, there’s no big data and no Internet of things
79
Microsoft, StreamInsight
78
Page 34
popular open source project is Apache Storm80 for real-time stream processing, and Amazon is offering
Kinesis for their cloud solutions, which includes stream processing.
Big Data analysis
One of the main drivers for IoT is the ability to economically collect and store large amounts of data. After
the data is collected, it must be processed, aggregated, analyzed to create datasets that can be
visualized and used either for business analysis, informing business decisions and strategy, feedback into
product engineering to improve products, or provide views of the data that can be shared with partners for
monetization or adding value to the business relationship.
The most common approach for this is to use the Map/Reduce81 pattern to batch process collected data.
Apache Hadoop is the predominant implementation of that pattern, and Microsoft provides HDInsight,
which is a cloud platform service implementation of Hadoop. The approach may be as simple as
aggregating and summarizing data for simpler reuse, or it may be complex, multi-step processing that
generates insights across the recently collected and historical data. Hadoop includes many tools within its
ecosystem that help with searching, querying, and cataloging the data. In solutions today, Hadoop will
frequently be used to preprocess data, such that Hadoop jobs will run and create summarized datasets
that can be used for querying, reporting, and as input to machine learning activities, or as reference
datasets in Complex Event Processing solutions.
Machine learning
Machine learning refers to the concept of studying data and deriving insights from the data. The results
will be a model that can be used to predict future outcomes from similar data sets. The first step is to train
the model. This is normally an iterative step performed by a data scientist where a training set of data is
used to infer a function, or model, from that data. That model will be used to make decisions on incoming
data. The model is typically retrained periodically, so that the model can improve over time, learning from
additional new data and patterns.
Machine learning falls into two broad categories: supervised learning and unsupervised learning.
Supervised learning studies the data looking for a known set of desired outcomes. In other words, in the
vehicle scenario, I may want to minimize the number of times that a car needs its oil changed. So I would
run studies against the data looking for patterns that give me information about the consequences of
delaying oil changes, conditions, and so on. In unsupervised learning, the concept is to naturally find
patterns and relationships of any kind in the data. After something interesting is observed, then these
data points will be further investigated until they are found to be either useful or not useful.
Common tools for machine learning include MATLAB82, Mahout83 and R84. Microsoft introduced its ML
tooling in June 2014, called Azure ML.85 Azure ML is a machine learning service that democratizes the
practice of machine learning. It provides a visual experience for constructing data experiments, and easy
to use implementations of many commonly used machine learning algorithms, relieving the data scientist
of implementing them in a programming language. Azure ML integrates easily with Azure Storage,
HDInsight, and Windows Azure SQL Database, and it can expose the models as web services so that
they are simple to integrate into the runtime data flow or applications.
80
Apache, Storm, distributed and fault-tolerant realtime computation
Wikipedia, MapReduce
82
Wikipedia, MATLAB
83
Wikipedia, Apache Mahout
84
Wikipedia, R (programming language)
85
Techcrunch, Microsoft announces Azure ML, Cloud-based Machine Learning Platform That Can Predict Future Events
81
Page 35
Data enhancement
Another core piece of the IoT architecture is data enhancement. The data collected from the devices, the
volume of it, and the hidden patterns within it provide tremendous value, but often combining the device
data is either critical in order for it to make sense to the business, or there is even more significant value
to be gained by adding other data sets to analyze with the device data. Enterprise data may be used for
simple things, such as relating device data to customer data. Other areas of opportunity include data
markets that publish datasets that are either sold or available for free. Microsoft offers the Azure
DataMarket86, which offers datasets from governments, research institutions, historical, environmental,
business organizations, and more. One of the most frequent datasets that gets combined with device
data is weather. Devices often exist all over the globe in different conditions, so predictive maintenance
will frequently factor in weather data, which is normally sourced from weather data providers as opposed
to collecting it with the device itself.
Publishing insights
After data stored in the system has been processed into
information of value to others, the question becomes how
to approach this exposure in a secure and compatible
manner that is easy to discover and consume. Some
organizations want to make their data available to
partners both up and down the supply chain to realize
efficiencies that result in lower costs and improve margins. Others are realizing the data they have can be
directly monetized as services available for consumption by individuals, corporations and governments
around the world. In addition to the stand-alone value of the data, it may also be seen as valuable to
augment other data services. Data that may seem uninteresting to those within the organization could in
reality be a key ingredient used in a number of potential external applications or analytical recipes. For an
in-depth discussion of data-publishing considerations, see the paper “Making Public Data Public” from
Microsoft.87 The following sections discuss many aspects of this topic.
Audience
The target audience for the data will have a significant impact on how it is published. Will it be used to
enhance analysis of other data? Will it be used through data visualization tools, such as PowerBI or
Tableau? Will it be metered and have a price associated with it? Or will there be different views and price
points of the data for different partners?
Publishing format
The choice of publishing format will be influenced by the targeted audience and the type of information
being published. Similar to the discussion earlier in this paper about the incoming message format, the
most likely choices for publishing data are XML, JSON, and AtomPub. OData88 is a standardized protocol
for creating and consuming data APIs. OData originated at Microsoft, but it has become well-accepted in
86
See https://datamarket.azure.com/
Microsoft, Making Public Data Public
88
Odata, OData Home page
87
Page 36
the industry. OData supports both JSON and AtomPub, so it is widely consumable by nearly all current
tools and programming languages.
There are tools that can help scale, secure, and normalize the data publishing task. The Microsoft Azure
DataMarket89 is a global marketplace for data and applications that provides discoverability, interface
normalization, and a monetization approach. Microsoft Azure API Management90 is a service that
facilitates publishing APIs. It includes features for API translation, versioning, aggregation, discovery,
authorization, caching, and quotas. Both Azure DataMarket and Azure API Management can be part of
the publishing strategy, using DataMarket for the broad exposure of large datasets, and API Management
to expose APIs securely with usage metrics and management capabilities.
89
90
Microsoft, Microsoft Azure Marketplace Publishing
Microsoft, Microsoft Azure API Management
Page 37
Cost modeling and estimation
Determining the cost of an Internet of Things (IoT) solution focused on predictive maintenance is
generally a complex problem. This section will list an initial approach that we have used with our
customers to estimate the cost of the architecture to support their predictive maintenance solutions. With
any calculation, it is very specific to a scenario and this model will not be applicable to all situations or be
totally complete.
Before we go into the specifics of determining the cost for a solution, we want to stress that cost
modeling, like capacity planning, is an iterative exercise. The process repeats itself, and performance
testing and other data gathered will change capacity distribution (for example, different workloads could
be combined in a single unit to save cost because these workloads are compatible in load profile) and
tune the model over time. In other words, the first cost estimate will not be perfect, and it provides only an
indicator of the cost of the solution.
A common architecture for IoT
Although you need to verify whether is satisfies your specific requirements, from our work with customers,
a reference architecture surfaced which helps in implementing the Service Assisted Connectivity pattern
by acting as the mentioned gateway. This architecture is built on top of Microsoft Azure Service Bus.
Within Service Bus, it utilizes Event Hubs for the ingress (device to cloud) of data and topics for sending
Command & Control messages as well as replies.
Event Hubs
Event Hubs is a new feature of Microsoft Azure Service Bus. It stands next to topics and queues as a
Service Bus entity, and provides a different type of queue, offering time based retention, client-side
cursors, publish subscribe support, and high scale stream ingestion. Although it could be argued the use
of topics could satisfy the technical requirement for receiving data from devices, Event Hubs supports
higher throughput and has an increased horizontal capacity.
Architectural details
Starting at the logical architecture level, the main architectural components are depicted in the following
figure.
Page 38
Figure 14. Reference architecture conceptual overview
The previous conceptual architecture figure includes four important components within the system:
1. The provisioning service that takes in information on authorized devices, creates its configuration,
and stores access keys.
2. Devices that interact using either AMQP or HTTP towards Service Bus directly, or a component
called the Custom Protocol Gateway Host, which hosts adapters for other protocols, such as MQTT
and CoAP.
3. Telemetry requests that are distributed by the router, using adapters to communicate with
downstream storage and processing engines.
4. Commands send to devices through the use of the notification/command router that is internally
surfaced through the Command API host.
To ensure the architecture is able to support a large number of devices, a partitioned model where the
device population is divided into manageable groups is used. This partition model can be seen in the
following figure.
Page 39
Figure 15. Reference architecture details and partition overview
The figure details some important aspects of the reference architecture:

Master. Part of the requirements assumption for the architecture is that solutions built on top of it will
aim for a unified global or at least regional management model, independent from technical scale
limitations that might inform how large a particular partition may grow.
This motivates an overarching architectural model with a common ‘‘Master’’ service, shown on the far
left of the figure, that takes care of shared management and deployment tasks, as well as of device
provisioning and placement, and several parallel and independent deployments of ‘‘Partition’’ services
that each take ownership of one or more logical system partitions.

Partition. Instead of looking at a population of millions of connected devices as a whole, the system
divides the device population into smaller, more manageable partitions of large numbers of devices
each.
Each resource in the distributed system has a throughput- and storage-capacity ceiling, limiting the
number of devices associated with any single Service Bus ingress entity so that the events sent by
the devices will not exceed that entity’s ingestion throughput capacity, and any message backlog that
might temporarily build up does not exceed the entity’s storage capacity.
In order to allocate appropriate compute resources and not overload the storage backend with too
many concurrent write operations, a relatively small set of resources with reasonably well-known
performance characteristics is bundled into an autonomous, and mostly isolated “scale-unit.”
Each scale-unit supports a maximum and tested number of devices, which is also important for
limiting risks in a scalability ramp-up. The principle behind this is that a production system can only be
scaled up as much as it can be scaled up in testing on a regular basis.
A benefit of introducing scale-units is that they significantly reduce the risk of full system outages. If a
system depends on a single data store and that store has availability issues, the whole system is
affected. However, if the system consists of 10 scale-units that each maintain an independent store,
issues in one store only affect 10 percent of the system.
Page 40
The principle of running all traffic ingestion through asynchronous Service Bus messaging entities,
instead of into a service edge that writes data straight to the database, is that Service Bus already
provides a scaled-out and secure network service gateway for messaging, and it is specifically
designed to deal with bad network conditions, traffic bursts, and even sustained traffic peaks. A backend datastore that is the target of the ingested data should not be dimensioned to handle specific
bursts, such as vehicle telemetry during core European or U.S. East Coast rush hours.
The group called “partition” is a set of resources focused on handling data from a well-defined and
known device population that has been assigned to and configured into the partition through
provisioning. Cross-partition distribution of devices will be based on your solution-specific logic, and
allocation within the partition is handled by provisioning.
The “partition” group is the unit of scale. Through testing, the load specifications for the partition have
to be determined and a so-called scale-unit can be defined. A scale-unit is a group of resources that
can effectively support a well-known load profile for the system, allowing replication of the scale-unit
to provide support for an extrapolation of this load profile. Within the “partition” group, there are two
basic paths, ingestion (sending data from the device to the cloud) and egress (sending data from the
cloud to the device). These paths accomplish the following:
− Ingestion. Ingestion has a given device connect through its supported protocol, delivering
messages to its specific Event Hub, using its assigned credentials.
− Egress. Egress routes messages (replies, Command & Control) to their device destination.

Device Repo. The device repository contains configuration information about the registered devices
for a given partition.
Capacity modeling
Before cost can be modeled, the way that the system will scale needs to be considered and the
characteristics of the architecture need to be determined. Essentially, the attributes of the previously
mentioned scale-unit need to be defined.
There is a throughput ceiling for each of the components in the architecture, including each of the Service
Bus entities. The reason to be cautious when evaluating throughput is that when dealing with distributed
devices that send messages periodically, we cannot assume perfect, random distribution of event
submissions across any given period. There will be bursts and we need to allow for ample capacity
reserve to handle such bursts.
Assuming a scenario of a 10-minute event interval with one extra control interaction feedback message
per device per hour, seven messages per hour from each device can be expected, and roughly 50,000
devices can be associated with each entity with a 100 messages per second average throughput
capacity.
Having covered the flow rate, we can conclude that storage throughput is of little concern. However,
storage capacity and the manageability of the event store are concerns. The per-device event data at a
resolution of one hour for 50,000 devices amounts to some 438 million event records per year. Even if
these event records are limited in size to only 50 bytes, the yearly payload data is still 22 GB per year for
each of the scale-units. This underlines the need to keep an eye on the storage capacity and storage
growth when thinking about sizing scale-units.
These considerations manifest in a capacity model in the deployment model, which informs how many
entities must be created in the Service Bus namespace backing a partition for a given device population
size like 50,000 devices and for a given load profile.
Page 41
The load profile is currently informed by how many (telemetry-) messages a device is generally expected
to send, how many commands or notifications the device is expected to receive per hour, and what the
average size of these messages is. The inputs should be well-informed, but generous estimates because
while changing the shape of a scale-unit layout at a later time is possible, doing so may require reprovisioning the devices.
Determining partitions is not only motivated by capacity concerns, however. Because a partition also
forms a configuration scope, it provides a suitable mechanism to segregate device populations by region,
country, owner, operator, product, or other concerns. As an example, one deployment can have up to
1,024 partitions.
Each partition corresponds to exactly one Service Bus namespace. Because there can only be 50
namespaces per Azure subscription, and other dependent services have similar quotas, a fully built-out
architecture will therefore most likely span multiple subscriptions.
In summary, the attributes that we have found to determine the capacity model are:

Number of devices. This is the number of
sensors supplying telemetry information to
the scale-unit.

Average message interval ingress /
egress. This represents the average number
of messages that a given device emits per
hour (ingress) / and the system emits per
hour (egress).

Average message size ingress / egress.
This is the average size of the messages
that a device emits (ingress) or the
system sends (egress), in bytes.
Figure 16. Scale Units in the reference architecture
Cost estimation
With the estimation of cost for a solution built on top of this architecture, there are many factors to
consider. We will work through the list from the ingress of device data to sending commands. Cost is
estimated based on architectural design and necessary scale for success. As such, cost estimation has
variables for the scale that is needed applied to the formula for calculation.
Before we dig into the details, we feel the need to underscore the fact that cost modeling, like capacity
modeling, are inputs for architectural decision making and business case modeling, where the
combination of all inputs should always be considered as a whole. As an example, you might find using
HTTP for communications will be somewhat less expensive from a cost modeling perspective. However,
choosing HTTP over AMQP will inherently impact performance.
For all pricing related information in the cost estimation formulas outlined in this section, it is important to
state that prices will vary over time and the examples are aimed only at explaining the formula itself. The
latest pricing information can always be found at http://azure.microsoft.com/en-us/pricing/overview/.
Page 42
Ingress path cost using Event Hubs
As events consumed from an Event Hub,
as well as Management operations and
“control calls” such as checkpoints, are
not counted as billable ingress events,
the formula for estimating cost for the
architecture when using Event Hubs is a
combination of:
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠
= 𝐶𝑜𝑠𝑡𝑏𝑎𝑠𝑒 𝑐ℎ𝑎𝑟𝑔𝑒
+ 𝐶𝑜𝑠𝑡𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠
+ 𝐶𝑜𝑠𝑡𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
+ 𝐶𝑜𝑠𝑡𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
+ 𝐶𝑜𝑠𝑡𝑝𝑟𝑜𝑡𝑜𝑐𝑜𝑙 𝑔𝑎𝑡𝑒𝑤𝑎𝑦
+ 𝐶𝑜𝑠𝑡𝑡𝑒𝑙𝑒𝑚𝑒𝑡𝑟𝑦 𝑝𝑢𝑚𝑝
+ 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠 𝑡𝑟𝑎𝑓𝑓𝑖𝑐
Which expands into a more detailed
formula we can work with to fill in the
appropriate variables:
Figure 17. The ingress path of the reference architecture
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠
∑ 𝑎 $0.015
𝑎>500𝑘
𝑇𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠
= 𝐶𝑜𝑠𝑡𝑏𝑎𝑠𝑒 𝑐ℎ𝑎𝑟𝑔𝑒 + (
− 1000)
744
∑
𝑎 $0.025
100𝑘≤ 𝑎≤500𝑘
∑ 𝑎 $0.03
(
)
𝐴𝑚𝑠𝑔 𝑠𝑖𝑧𝑒 𝑖𝑛 𝐾𝐵
𝑁𝑑𝑒𝑣𝑖𝑐𝑒𝑠 𝐴𝑚𝑠𝑔 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ 𝑐𝑒𝑖𝑙 (
)
64
+ 744𝑁𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠 $0.03 + (
− 12.5) $0.028
1,000,000
+ (
∑
𝑎<100𝑘
744 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟 ) + 𝐴𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵 𝑥 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
𝑤𝑜𝑟𝑘𝑒𝑟 𝑟𝑜𝑙𝑒𝑠
Equation 1 - The cost estimation formula for the ingress path
It should be noted this formula is using the “Standard” tier offering of Event Hubs91, which offers additional
brokered connections, filters, and additional storage capacity. The fixed pricing elements in the formula
uses pricing from a point in time, susceptible to change. Also, the formula assumes a flat use of brokered
connections while actual billing is based on peak use prorated per hour; the dynamics of your system will
likely deviate.
The variables in this equation are:
Variable
91
Description
See http://azure.microsoft.com/en-us/pricing/details/event-hubs/.
Page 43
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠
The cost of the ingestion of events, per month.
𝑇𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠
The total amount of hours connections to the system are made, summing all
simultaneous connection time.
𝑁𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
𝑁𝑑𝑒𝑣𝑖𝑐𝑒𝑠
𝐴𝑚𝑠𝑔 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ
𝐴𝑚𝑠𝑔 𝑠𝑖𝑧𝑒
𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒
𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟
𝐴𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
The number of throughput units92 needed to support the ingress of data into the
system. A throughput unit is the combination of inbound bandwidth, temporary
storage and outbound bandwidth, as described in the reference.
The number of deployed devices sending data to the system.
The average number of messages sent into the system, per device, per month.
The average size of each message sent into the system, per month.
The number of worker roles necessary to support the projected scale of the system.
Normally, at least two (2) are needed to fall within SLA support of Microsoft Azure.
The cost per worker role for the ingress path when using custom protocols and for
the telemetry pump, per hour.
The average amount of egress traffic, per gigabyte.
The cost of egress traffic93, per gigabyte.
Example calculation
An example calculation where 1,000,000 deployed devices send a message averaging 128 bytes every
60 seconds, having an average number of 100,000 simultaneously connected devices during the entire
month would yield the following results:
Variable
𝑇𝑏𝑟𝑜𝑘𝑒𝑟𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠
𝑁𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
𝑁𝑑𝑒𝑣𝑖𝑐𝑒𝑠
𝐴𝑚𝑠𝑔 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ
𝐴𝑚𝑠𝑔 𝑠𝑖𝑧𝑒
𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒
𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟
𝐴𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
92
93
Value
100,000 (100,000 simultaneous connections for the full month).
17 (44,640 messages per device, per month. 44,640,000,000 messages per month,
equaling 16,666.66̄ per second. Given a single throughput unit supports up to 1,000
messages per second, rounding up 16,666.66̄ /1,000 equals 17).
1,000,000
44,640 (744 hours * 3,600 equals 2,678,400 seconds per month. 1 message every
60 seconds equals 44,640 messages per month)
1KB (rounding up 128 bytes in KB (128 / 1,024 equals 0.125)).
50 (assuming a rough estimate of 20,000 devices would be supported per worker
role). Note again, this is not a capacity modeling exercise, these numbers should
come from performance tests on your specific scenario.
$0.08 per hour (assuming A1 worker role size).
0 (assuming all downstream processing happens inside the same region DC.
Not Applicable
Microsoft, Microsoft Azure, Event Hubs pricing, FAQ “What are throughput units and how are they billed?”
Microsoft, Microsoft Azure, Data Transfer Pricing Details
Page 44
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑖𝑛𝑔𝑟𝑒𝑠𝑠
= 𝐶𝑜𝑠𝑡𝑏𝑎𝑠𝑒 𝑐ℎ𝑎𝑟𝑔𝑒 + 99,000 ∗ $0.03 + 744 ∗ 17 ∗ $0.03
1
1,000,000 ∗ 44,640 ∗ 𝑐𝑒𝑖𝑙 ( )
64 − 12.5) $0.028 + (50 ∗ 744 ∗ $0.08)
+(
1,000,000
= $2,970 + $379.44 + $1,249.57 + $2,976.00 = $𝟕, 𝟓𝟕𝟓. 𝟎𝟏
Egress path cost
As with ingress, the egress path also has multiple components that incur cost. As sizes often vary
between ingestion data and command & control, the message size is not the same value as used in the
ingress path.
The components involved in egress are:

Command API Host. The process in
charge of sending notifications and
commands to devices and groups of
devices. It encapsulates the
notification/command router, and routes
egress messages to the appropriate topic
on Microsoft Azure Service Bus, depending
on the type of request. It is hosted inside a
worker role.

Subscriptions. There are two different
types of messages that the Command API
supports: notifications and commands. A
command can both yield a single or multiple
response messages. Notifications and
commands can also target groups of
devices. All of these messages incur cost.
Figure 18. The egress path for the reference
Response messages have not been accounted
architecture
for in the egress calculation and should be
estimated here. Command replies are not routed through the telemetry adapters.

Egress traffic. Each egress message will incur cost.
Given these components, the egress path cost can be calculated using the following formula:
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑒𝑔𝑟𝑒𝑠𝑠 = 𝐶𝑜𝑠𝑡𝑤𝑜𝑟𝑘𝑒𝑟 𝑟𝑜𝑙𝑒𝑠 𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼 + 𝐶𝑜𝑠𝑡
𝑛𝑜𝑡𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠
𝑠𝑖𝑛𝑔𝑙𝑒 𝑐𝑜𝑚𝑚𝑎𝑛𝑑 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
𝑚𝑢𝑙𝑡𝑖 𝑐𝑜𝑚𝑚𝑎𝑛𝑑 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
𝑐𝑜𝑚𝑚𝑎𝑛𝑑 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
+ 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠 𝑡𝑟𝑎𝑓𝑓𝑖𝑐
Which also expands into a more detailed formula we can work with to fill in the appropriate variables:
Page 45
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑒𝑔𝑟𝑒𝑠𝑠
= 744 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼
𝑠𝑖𝑧𝑒𝑎
∑
∑ 𝐴𝑛𝑚 𝑐𝑒𝑖𝑙 (
)
64
𝐴
𝑐𝑠𝑚
𝐴𝑐𝑚𝑚
𝐴𝑟𝑚
+
1,000,000
𝑎 $0.20
∑ 𝐴𝑛𝑚 𝑠𝑖𝑧𝑒𝑎
𝑎 > 2,500
∑
− 12.5
(
𝐴𝑐𝑠𝑚
𝐴𝑐𝑚𝑚
𝑎 $0.50 +
)(
(
∑ 𝑎 $0.80
𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
1,048,576
100 ≤ 𝑎 ≤ 2,500
)
)
𝑎< 100
Equation 2 - The cost estimation formula for the reference architecture egress path
This calculation combines both single device notifications and commands, as well as group broadcast
messaging. Determining the magnitude and distribution in order to figure out the averages within the
formula is left to the reader as part of the capacity modeling for the system architecture.
The variables in this equation are:
Variable
Description
𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒
The number of roles necessary to support the projected scale of the system.
Normally, at least two (2) are needed to fall within SLA support of Microsoft Azure.
𝐶𝑜𝑠𝑡𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼
The cost per worker role for the command API host, per hour.
𝐴𝑛𝑚
The average number of notifications per month.
𝐴𝑐𝑠𝑚
The average number of single response command messages per month.
𝐴𝑐𝑚𝑚
The average number of multiple response command messages per month.
𝐴𝑟𝑚
𝑆𝑖𝑧𝑒𝑎
𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
The average number of response messages to commands, per month.
The average response size, in kilobytes, averaged over all outbound message
types.
The cost of egress traffic94, per gigabyte.
Example calculation
An example calculation using 100,000 notifications per month of 20 KB each, 130,000 commands of 35
KB each with single replies of 80 KB each, and 20,000 commands of 20 KB each with on average three
(3) replies of 70 KB each would yield the following results:
Variable
94
Value
𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒
2
𝐶𝑜𝑠𝑡𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝐴𝑃𝐼
$0.08 (A1)
𝐴𝑛𝑚
100,000
𝐴𝑐𝑠𝑚
150,000
𝐴𝑐𝑚𝑚
20,000
Microsoft, Microsoft Azure, Data Transfer Pricing Details
Page 46
𝐴𝑟𝑚
190,000 (130,000 + 3 * 20,000 equals 190,000)
𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠𝐺𝐵
$0.138
𝑪𝒐𝒔𝒕𝒎𝒐𝒏𝒕𝒉𝒍𝒚 𝒆𝒈𝒓𝒆𝒔𝒔
= 744 ∗ 2 ∗ $0.08
+ (
100,000 𝑐𝑒𝑖𝑙 (
20
35
210
121
) + 150,000 𝑐𝑒𝑖𝑙 ( ) + 20,000 𝑐𝑒𝑖𝑙 (
) + 190,000 𝑐𝑒𝑖𝑙 (
)
64
64
64
64
1,000,000
∑
𝑎 $0.20
100,000 𝑥 20
𝑎 > 2,500
∑
− 12.5)
∑ (150,000 𝑥 35)
𝑎 $0.50 +
100 ≤ 𝑎 ≤ 2,500
20,000 𝑥 210
1,048,576
$0.25
(
)
)
𝑎< 100
100,000 + 150,000 + 80,000 + 380,000
11,450,000
= $238.08 +
$0.80 +
$0.138
1,000,000
1,048,576
= $238.08 + $0.80 + $1.51 = $𝟐𝟒𝟏. 𝟔𝟏
∑ 𝑎 $0.80
(
Management cost
Besides the messaging related components in the reference
architecture, there is also the concept of one or more masters for
managing the system, as discussed previously in this paper. The
master is tasked with provisioning devices, creating appropriate
queues and topics, storing device information, provisioning security,
and so on. The master contains the following cost components:

Provisioning Runtime. The component called by tooling to
provision a device or a set of devices into the system, creating
the necessary service bus, compute, and storage artifacts. It is
hosted inside a worker role.

Device Repo. The datastore collecting the registered devices
per partition.

Partition Repo. The datastore collecting partition registration
information.
Given these components, the egress path cost can be calculated
using the following formula:
Figure 19 - The "master"
component within the reference
architecture
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑚𝑔𝑚𝑡
= 744 𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒 𝐶𝑜𝑠𝑡𝑚𝑎𝑠𝑡𝑒𝑟
+ (𝑆𝑖𝑧𝑒𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 + 𝑆𝑖𝑧𝑒𝑑𝑒𝑣𝑖𝑐𝑒 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵 𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 )𝐶𝑜𝑠𝑡𝑡𝑠𝐺𝑅𝑆 + ∆𝑑𝑖 𝐶𝑜𝑠𝑡𝑡𝑥
Equation 3 - The cost estimation formula for management of the reference architecture
Page 47
The variables in this equation are:
Variable
𝐶𝑜𝑠𝑡𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑚𝑔𝑚𝑡
𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒
𝐶𝑜𝑠𝑡𝑚𝑎𝑠𝑡𝑒𝑟
𝑆𝑖𝑧𝑒𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵
Description
The cost of the management for the architecture, per month.
The number of roles necessary to support the projected scale of the system.
Normally, at least two (2) are needed to fall within SLA support of Microsoft Azure.
The cost per worker role for the management host, per hour.
The number of gigabytes used in the partition repository for administrative
purposes.
𝑆𝑖𝑧𝑒𝑑𝑒𝑣𝑖𝑐𝑒 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵
The number of gigabytes used in the device repository.
𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠
The number of partitions to allow for appropriate scale.
𝐶𝑜𝑠𝑡𝑡𝑠𝐺𝑅𝑆
The cost for Geo Redundant Storage (GRS) table storage ($0.095 / GB at the time
of writing).
∆𝑑𝑖
The change for device information. Any change to the device information stored in
the system and subsequently in a device repository inside a partition, will account
for at least two operations on table storage.
𝐶𝑜𝑠𝑡𝑡𝑥
The cost for storage transactions ($0.0036 / 100k transactions at the time of
writing).
Example calculation
An example calculation using 10,000 changes to device registration per month (either new devices,
changes in activation, or removed devices) leading to a total partition repo (assuming a single master
instance is used) size of 256 MB and 128 MB device repository per partition, using 10 partitions, would
yield the following results:
Variable
𝑁𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑒𝑑 𝑠𝑐𝑎𝑙𝑒
𝐶𝑜𝑠𝑡𝑚𝑎𝑠𝑡𝑒𝑟
Value
2
$0.16 (medium)
𝑆𝑖𝑧𝑒𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵
0.25
𝑆𝑖𝑧𝑒𝑑𝑒𝑣𝑖𝑐𝑒 𝑟𝑒𝑝𝑜 𝑖𝑛 𝐺𝐵
0.125
𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠
10
𝐶𝑜𝑠𝑡𝑡𝑠𝐺𝑅𝑆
$0.095 / GB
∆𝑑𝑖
𝐶𝑜𝑠𝑡𝑡𝑥
10,000
$0.0036 / 100k
𝑪𝒐𝒔𝒕𝒎𝒐𝒏𝒕𝒉𝒍𝒚 𝒎𝒈𝒎𝒕 = 744 𝑥 2 𝑥 $0.16 + (0.25 + 0.125 𝑥 10) $0.095 + 0.1 𝑥 $0.005
= $238.08 + $0.1425 + $0.00036 = $𝟐𝟑𝟖. 𝟐𝟐
As can be observed from the outcome of the formula, the cost of management for the reference
architecture is mostly dependent on the worker roles running to support it.
Page 48
System processing cost
An IoT system with only the ability to ingest and offload data combined with the ability to send commands
is not complete. This is just the communication interface for connecting devices to a central system.
Although it is not included in this example, in order to complete an IoT system, there is a need to perform
data analysis, either in flight by using an event processing engine, or at rest by using solutions for
machine learning. With a high degree of certainty, you will also need components that take advantage of
key parts of this underlying technology to surface management and control mechanisms to users through
the use of one or more portals, expose the gathered knowledge from machine learning to other parties
through web services, and so on.
Cost estimate calculation
In the previous sections of this paper, we discussed the various components that make up the cost for the
data ingestion and communication platform inside the reference architecture. When we combine these,
we can calculate the total estimated cost for a partition, and extrapolate the total estimated OPEX cost for
the system based on the number of needed partitions using the following formula:
𝐶𝑜𝑡𝑡𝑜𝑡𝑎𝑙 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ = (𝐶𝑜𝑠𝑡𝑖𝑛𝑔𝑟𝑒𝑠𝑠 + 𝐶𝑜𝑠𝑡𝑒𝑔𝑟𝑒𝑠𝑠 )𝑁𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 + 𝐶𝑜𝑠𝑡𝑚𝑎𝑛𝑎𝑔𝑒𝑚𝑒𝑛𝑡
Page 49
Important topics not yet
covered
In this paper, we have strived to capture many of our learnings from implementing predictive maintenance
solutions in the Internet of Things (IoT) space. However, in addition to the topics discussed, there is both
much detail to add and more things to think about when architecting for IoT. This final section touches on
some of these topics.
Networks with automatic handover and fallbacks
When we think about IoT scenarios, there seems to be an emerging need for networks working together
in a seamless manner in order to provide frequently roaming users with the ability to perform command
and control to either partially or fully closed IoT systems that they can access. This capability would
require working across vendors and standards to ensure that the right connectivity type is available at the
right time, and at the right price.
The need for the commoditization of devices
Many solutions today use their own proprietary hub for connecting their point solution to the Internet. This
approach needs to change, with vendors selling connectivity bridges that work much like today's home
Internet routers. In fact, such Internet routers could prove to be a great point of integration with
standardized PAN/LAN devices, and support autonomous operations when connectivity is not available.
Ideally, these bridges would support current legacy non-IP PAN device protocols, such as Z-Wave,
traditional ZigBee, and so on.
The creation and use of information marketplaces
As IoT systems evolve, especially those capturing telemetry for intelligent decision making, there is a
clear need for data augmentation to provide context for machine learning. Information marketplaces, such
as Microsoft Azure DataMarket, need to expand their offerings, providing new opportunities for data
providers.
Management solutions
There are standards put forth for managing devices 95, such as OMA Device Management96 (of which
Microsoft implemented a subset, called Mobile Device Management97), CPE WAN Management
Protocol98, Lighweight M2M99, and UPnP-DM100.
95
Blackberry, A Comparison of Protocols for Device Management and Software Updates
Wikipedia, OMA Device Management
Microsoft, MS-MDM: Mobile Device Management Protocol
98
Wikipedia, TR-069
99
Ericsson, “Lightweight M2M”: Enabling Device Management and Applications for the Internet of Things
100
See Introduction to UPnP Device Management
96
97
Page 50
As millions of devices become part of IoT systems, there is a clear need for IoT solutions that can monitor
and manage incidents in the systems, visualize information and effectively control the environment, and
span the various connectivity options and supporting legacy systems.
The redefinition of SLAs
Although it represents a very hard problem to find a solution for, customers will ask for different types of
Service Level Agreements (SLAs) in this space. Where current SLAs provide a system availability
guarantee, this definition has to evolve to provide a concrete answer to questions, such as how much
bandwidth is available, what is the maximum and average latency to expect, how many I/O operations per
second (IOPS101) can the storage system provide, and do on. Moving beyond those basic guarantees,
customers will seek answers from SaaS solutions for IoT based on simply the number of devices that they
can support.
Integration simplicity
As IoT promises to extend vertical solutions across horizontal markets, and connect systems in ways
never seen before to add value to businesses and people’s lives, the integration between these systems
and how they are secured needs to happen in a way that standardizes the integration. AMQP provides an
example of this in regard to transport-layer integration.
101
Wikipedia, IOPS
Page 51
Conclusions
This paper has gone into great detail about the particulars of building IoT solutions, based on our
experience in working with enterprise customers. As you can see, IoT solutions can be complex but also
offer massive promise for increasing revenue, cutting cost and finding new business models based on
innovate use of technology. An enterprise might believe that its requirements are so unique that only a
custom IoT solution can meet their needs. But the unusual requirements of IoT solutions in security,
communication, and scale make them complex and expensive to build as custom solutions from the
ground up.
The Microsoft Azure platform, on the other hand, has a comprehensive set of building blocks that you
need to build an IoT solution relatively quickly and painlessly by using the mentioned reference
architecture.
Page 52
How Microsoft can help you
succeed
Microsoft Services can help establish an effective strategy for your Predictive Maintenance scenario and
provide direction, implementation guidance, delivery, and support to help your realize your Internet of
Things strategy. We offer:
▪
Customer value discovery and ideation workshops
▪
Strategy workshops
▪
Implementation guidance
▪
Microsoft Services Subject Matter Expertise, both in your vertical industry and on the topic of general
IoT and Predictive Maintenance.
For more information about Consulting and Support solutions from Microsoft, contact your Microsoft
Services representative or visit www.microsoft.com/services.
Page 53