12 Data Analytics and Machine Learning in the Cloud and Edge

The value of an IoT system is not a single sensor event, or a million sensor events archived away. A significant amount of the value of IoT is in the interpretation of data and decisions based on that data.

While a world of billions of things connected and communicating with each other and the cloud is well and good, the value lies in what is within the data, what is not in the data, and what the patterns of data tell us. These are the data science and data analytics portions of IoT, and probably the most valuable areas for the customer.

Analytics for the IoT segment deals with:

Structured data (for example, SQL storage): A predictable format of data
Unstructured data (for example, raw video data or signals): A high degree of randomness and variance
Semi-structured (for example, Twitter feeds): Some degree of variance and randomness in form

Data also may need to be interpreted and analyzed in real time as a streaming dataflow, or it may be archived and retrieved for deep analytics in the cloud. This is the data ingest phase. Depending on the use case, the data may need to be correlated with other sources in flight. In other cases, the data is simply logged and dumped to a data lake like a Hadoop database.

Next comes some type of staging, meaning a messaging system like Kafka will route data to a stream processor, or a batch processor, or perhaps both. Stream processing tolerates a continuous stream of data. Processing is typically constrained and very fast, as the data is processed in memory. Therefore, processing must be as fast, or faster, than the rate of data entering the system. While stream processing provides near-real-time processing in the cloud, when we consider industrial machinery and self-driving cars, stream processing does not provide hard real-time operating characteristics.

Batch processing, on the other hand, is efficient in dealing with high-volume data. It is particularly useful when IoT data needs to correlate against historical data.

After this phase, there may be a prediction and response phase where information may be presented on some form of dashboard and logged, or perhaps the system will respond back to the edge device, where corrective actions can be applied to resolve some issue.

This chapter will discuss various data analysis models from complex event processing to machine learning. Several use cases will be taught to help generalize where one model can work and others may fail.

Basic data analytics in IoT

Data analytics intends to find events, usually in a streaming series of data. There are multiple types of events and roles that a real-time streaming analysis machine must provide. The following is a superset of analytic functions based on the work of Srinath Perera and Sriskandarajah Suhothayan (Solution patterns for real-time streaming analytics. Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems (DEBS '15). ACM, New York, NY, USA, 247-255. The following is an enumerated listing of these analytic functions:

Preprocessing: This includes filtering out events of little interest, denaturing, feature extraction, segmentation, transforming data to a more suitable form (although data lakes prefer no immediate transformation), and adding attributes to data such as a tag (data lakes do need tags).

Alerting: Inspect data, and if it exceeds some boundary condition, then raise an alert. The simplest example is when the temperature rises above a set limit on a sensor.

Windowing: A sliding window of events is created that only draws rules upon that window. Windows can be based on time (for example, one hour), or length (2000 sensor samples).
They can be sliding windows (for example, inspect only the 10 latest sensor events and produce a result whenever a new event arises), or batch windows (for example, produce an event only at the end of the window). Windowing is good for rules and for counting events. For instance, you could look for the number of temperature spikes in the last hour and resolve that a defect will occur on some machine.

Joins: These combine multiple data streams into a new single stream. A scenario where this applies is a logistics example, say a shipping company tracks their shipments with asset-tracking beacons and their fleet of trucks, planes, and facilities have geolocation information streaming, as well. There are initially two streams of data: one for the package and one for a given truck. When a truck picks up a package, those two streams become joined.

Errors: Millions of sensors will generate missing data, garbled data, and data that is out of sequence. This is important in IoT cases with multiple streams of asynchronous and independent data. For example, data may be lost in a cellular WAN if a vehicle enters an underground parking garage. This analytic pattern correlates data within its own stream to attempt to find these error conditions.

Databases: The analytics package will need to interact with some data warehouse. For example, if data is streaming in from a number of sensors. An example of this could be Bluetooth asset tags tracking whether an item is stolen or lost. A database of missing tag IDs would be referenced from all the gateways streaming in Bluetooth tag IDs to the system.

Temporal events and patterns: This is most often used with the window pattern mentioned
previously. Here, a series or sequence of events constitutes a pattern of interest. You can think of this as a state machine. Say we are monitoring the health of a machine based on temperature, vibrations, and noise. A temporal event sequence could be as follows:
1. Detect whether the temperature exceeds 100° C.
2. Then detect whether vibrations exceed 1 m/s.
3. Next, detect whether the machine is emitting noise at 110 dB.
4. If those events take place in that sequence, only then raise an alert.

Tracking: Tracking involves when or where something exists or an event has occurred, or when something doesn't exist where it should. A very basic example is geolocation of service trucks where a company may need to know exactly where a truck has been and when it was last there. This has application in agriculture, human movement, tracking patients, tracking high-value assets, luggage systems, smart city garbage, snow removal, and so on.

Trends: This pattern is particularly useful for predictive maintenance. Here, a rule is designed to detect an event based on time-correlated series data. This is similar to temporal events but differs in the sense that temporal events have no notion of time, only sequence order. This model uses time as a dimension in the process. A running history of time-correlated data could be used to find patterns like a livestock sensor does in farming. Here, a head of cattle may wear a sensor that detects the animal movement and temperature. An event sequence can be constructed to see whether the cattle moved in the last day. If there was no movement, the cattle may be sick or dead.

Batch queries: Batch processing typically is more comprehensive and deeper than real-time stream processing. A well-designed streaming platform can fork analysis and call into a batch processing system. We will talk about this later in the form of lambda processing.

Deep analytics pathway: In real-time processing, we make a decision on the fly that some event has occurred. Whether that event really should signal an alarm may require further processing that will not operate in real time. An example is a video surveillance system. Say a smart city issues an amber alert for a lost child. The smart city can issue a simple feature extraction and classification model for the real-time streaming engines. The model would detect license plates for a vehicle the child may be in, or potentially a logo on the child's shirt. The first step would be to image-capture vehicles' license plate numbers or logos on pedestrians and send them to the cloud. The analysis package may identify a plate of interest or a logo out of millions of image samples as a first-level pass. That positively identified frame (and surrounding video frames) would be passed to a deeper analytics package that resolves the image with deeper object recognition algorithms (image fusion, super-resolution, machine learning) to eliminate false positives.

Models and training: The first-level model described previously may, in fact, be an inference engine for a machine learning system. These machine learning tools are built on trained models that can be used for in-flight, real-time analysis.

Signaling: It is often the case that an action needs to propagate back to the edge and sensor. A typical case is factory automation and safety. For example, if the temperature rises beyond a certain limit on a machine, log the event but also send a signal back to the edge device to slow the machine down. The system must be bidirectional in communication.

Control: Finally, we need a way to control these analysis tools. Whether that is starting, stopping, reporting, logging, or debugging, facilities need to be in place to manage this system.

Now, we will concentrate on how to build a cloud-based analytics architecture that must ingest unpredictable and unstoppable streams of data and deliver interpretations of that data as close to real time as possible.

Top-level cloud pipeline

The following diagram is a typical flow of data from a sensor to a dashboard. Data will transit through several mediums (WPAN links, broadband, cloud storage in the form of a data lake, and so on). When we consider the following architectures to build a cloud analytics solution, we have to consider the effects of scaling. Choices made early in the design that are suitable to 10 IoT nodes and a single cloud cluster may not scale effectively when the number of endpoint IoT devices grows to the thousands and are based in multiple geographies:

A screenshot of a cell phone Description automatically generated

Figure 1: A typical IoT pipeline from sensor to cloud

The analytics (predict-respond) portion of the cloud can take on several forms:

Rules engines: These simply define an action and produce an outcome.
Stream processing: These are where events like sensor readings are injected into the stream processor. The processing path is a graph where nodes in the graph represent operators and send events to other operators. The nodes contain the code for that portion of the processing and a path to connect to the next node in the graph. This graph can be replicated and executed in parallel on a cluster so it is amenable to scaling up to hundreds of machines.
Complex event processing: This is based on queries like SQL and written in that higher-level language. It is based on event processing and tuned for low latency.
Lambda architecture: This model attempts to balance throughput and latency by performing batch processing and stream processing in parallel on massive data sets.

The reason we talk about real-time analytics is that the data is streaming nonstop from millions of nodes simultaneously and asynchronously with various errors, format issues, and timings. New York City has 250,000 street lights (http://www.nyc.gov/html/dot/html/infrastructure/streetlights.shtml). Say each light is smart, meaning it monitors whether there is movement nearby, and if so, it brightens the light; otherwise, it remains dimmed to save power (2 bytes). Each light may also check whether there is a problem with the light that needs maintenance (1 byte). Additionally, each light is monitoring temperature (1 byte) and humidity (1 byte) to help generate microclimate weather predictions. Finally, the data also contains the light ID and a timestamp (8 bytes). The aggregate of all the lights nominally produces 250,000 messages a second and can peak at 325,000 due to periods of rush hour, crowds, tourist sites, holidays, and so on. All in all, say our cloud service can process 250,000 messages per second; that implies a backlog of up to 75,000 events/second. If rush hour is truly one hour, then we backlog 270,000,000 events/hour. Only if we provide more processing in the cluster or reduce the incoming stream will the system ever catch up. If the incoming stream drops to 200,000 messages/second during a quiet time, the cloud cluster will take 1.1 hours to resolve and consume 585 MB of memory (270 million backlogged messages at 13 bytes per message). Typically, you will have an autoscaling cloud backend to grow with demand and the length of the message queue.

To formalize the process and anticipate the demands you will place on a cloud backend, the following equations can help model the capacity:

Where:

R_Event = Event Rate

T_Burst = Time of Burst Events

T_c = Time to Complete Backlog

M_Backlog = Message Backlog (size)

M_size = Message Size

Rules engines

A rules engine is simply a software construct that executes actions on events. For example, if the humidity in a room exceeds 50%, send an SMS message to the owner. These are also called business rule management systems (BRMSs).

Rules engines may or may not have state and be called stateful. That is, they may have a history of the event and take different actions depending on the order, the amount, or the patterns of events as they occurred historically. Alternatively, they may not maintain state and only inspect the current event (stateless):

Figure 2: Simple rules engine example

In our example of a rules engine, we will look at Drools. It is a BRMS developed by Red Hat and licensed under the Apache 2.0 license. JBoss Enterprise is a production version of the software. All objects of interest reside in the Drools working memory. Think of the working memory as the set of IoT sensor events of interest to compare to satisfy a given rule. Drools can support two forms of chaining: forward and backward. Chaining is a method of inference taken from game theory.

Forward chaining takes in available data until a rule chain is satisfied. For example, a rule chain may be a series of if/then clauses, as shown in the preceding diagram. Forward chaining will continuously search to satisfy one of the if/then paths to infer from an action. Backward chaining is the converse. Rather than starting with the data to be inferred from, we start with the action and work backward. The following pseudocode demonstrates a simple rules engine:

Smoke Sensor = Smoke Detected Heat Sensor = Heat Detected

if (Smoke_Sensor == Smoke_Detected) && (Heat_Sensor == Heat_Detected) then Fire
if (Smoke_Sensor == !Smoke_Detected) && (Heat_Sensor == Heat_Detected) then Furnace_On
if (Smoke_Sensor == Smoke_Detected) && (Heat_Sensor == !Heat_Detected) then Smoking
if (Fire) then Alarm
if (Furnace_On) then Log_Temperature
if (Smoking) then SMS_No_Smoking_Allowed

Let us assume that:

Smoke_Sensor: Off
Heat_Sensor: On

Forward chaining would resolve the antecedent of the second clause and infer that temperatures are being logged.

Backward chaining tries to prove that the furnace is on and works backward in a series of steps:

Can we prove the temperatures are being logged? Take a look at this code:
```
if (Furnace_On) then Log_Temperature
```

Since the temperatures are being logged, the antecedent (Furnace_On) becomes the new goal:

if (Smoke_Sensor == !Smoke_Detected) && (Heat_Sensor == Heat_Detected) then Furnace_On

Since the furnace is proven to be on, the new antecedent comes in two parts: Smoke_Sensor and Heat_Sensor. The rules engine now breaks it up into two goals:
```
Smoke_Sensor off
Heat_Sensor on
```

The rules engine now attempts to satisfy both the subgoals. Upon doing so, the inference is complete.

Forward chaining has the advantage of responding to new data as it arrives, which can trigger new inferences.

Drools' semantic language is intentionally simple. Drools is composed of the following basic elements:

Sessions, which define the default rules
Entry points, which define the rules to use
when statements, the conditional clause
then statements, the action to take

A basic Drools rule is shown in the following pseudocode. The insert operation places a modification in the working memory. You normally make a change to working memory when a rule evaluates to true.

rule "Furnace_On" when
Smoke_Sensor(value > 0) && Heat_Sensor(value > 0) then
insert(Furnace_On()) end

After all the rules in Drools execute, the program can query working memory to see which rules evaluated to true using syntax like the following:

query "Check_Furnace_On"
$result: Furnace_On() end

A rule has two patterns:

Syntactic: It has the format of data, parity, hash, and range of values.

Semantic: Values must belong to a set in a list; the count of high-temperature values must not exceed 20 in 1 hour. Essentially, these are meaningful events.

Drools supports the creation of very complex and elaborate rules to the point that a database of rules may be needed to store them. The semantics of the language allows for patterns, range evaluation, salience, times when a rule is in effect, type matching, and work on collections of objects.

Ingestion – streaming, processing, and data lakes

An IoT device is usually associated with some sensor or a device whose purpose is to measure or monitor the physical world. It does so asynchronously with respect to the rest of the IoT technology stack. That is, a sensor is always attempting to broadcast data, whether or not a cloud or fog node is listening. This is important, because the value of a corporation is in the data.

Even if most of the data produced is redundant, there is always the opportunity that a significant event can occur. This is the data stream.

The IoT stream from a sensor to a cloud is assumed to be:

Constant and never-ending
Asynchronous
Unstructured, or structured
As close to real time as possible

We discussed the cloud latency problem earlier in Chapter 11, Cloud and Fog Topologies. We also learned about the need for fog computing to help resolve the latency issue, but even without fog computing nodes, efforts are taken to optimize the cloud architecture to support IoT real-time needs. To do this, clouds need to maintain a flow of data and keep it moving. Essentially, data moving from one service to another in the cloud must do so as a pipeline, without the need to poll for data. The alternative form of processing data is called batch processing. Most hardware architectures treat data flow the same way, moving data from one block to another, and the process of data arrival triggers the next function.

Additionally, careful use of storage and filesystem access is critical to reducing overall latency.

For this reason, most streaming frameworks will support in-memory operations and avoid the cost of temporary storage to a mass filesystem altogether. Michael Stonebraker called out the importance of data streaming in this fashion. (See Michael Stonebraker, Ugur Çetintemel, and Stan Zdonik. 2005. The 8 Requirements of Real-time Stream Processing." SIGMOD Rec. 34, 4, December 2005, 42-47.) A well-designed message queue assists with this pattern. To build a successful architecture in a cloud that scales from hundreds of nodes to millions needs consideration.

The data stream will also not be perfect. With hundreds to thousands of sensors streaming asynchronous data, more often than not, data will be missing (sensor lost communication), data will be poorly formed (error in transmission), or data will be out of sequence (data may flow to the cloud from multiple paths). At a minimum, a streaming system must:

Scale with event growth and spikes
Provide a publish/subscribe API to interface
Approach near-real-time latency
Provide scaling of processing of rules
Support data lakes and data warehousing

Apache provides several open source software projects (under the Apache 2 license) that assist with building a stream processing architecture. Apache Spark is a stream processing framework that processes data in small batches. It is particularly useful when memory size is constrained on a cluster in the cloud (for example, < 1TB). Spark is built on in-memory processing, which has the advantages of reducing filesystem dependency and latency, as mentioned previously. The other advantage of working on batch data is that it is particularly useful when dealing with machine learning models, which will be covered later in this chapter. Several models, such as convolutional neural networks (CNNs), can work on data in batches. An alternative from Apache is Storm. Storm attempts to process data as close to real time as possible in a cloud architecture. It has a low-level API versus Spark and processes data as large events rather than dividing them up into batches. This has the effect of being low latency (sub-second performance).

To feed the stream processing frameworks, we can use Apache Kafka or Flume. Apache Kafka is an MQTT on the ingest from various IoT sensors and clients, and it connects to Spark or Storm on the outbound side. MQTT doesn't buffer data. If thousands of clients are communicating to the cloud over MQTT, some system will be needed to react to an incoming stream and provide the buffering needed. This allows Kafka to scale on demand (another important cloud attribute) and can react well to spikes in events. A stream of 100,000 events per second can be supported with Kafka. Flume, on the other hand, is a distributed system to collect, aggregate, and move data from one source to another, and it is slightly easier to use out of the box. It is also tightly integrated with Hadoop. Flume is slightly less scalable than Kafka, since adding more consumers means changing the Flume architecture. Both investors could stream in memory without ever storing it. Generally, however, we don't want to do that; we want to take the raw sensor data and store it in as raw a form as possible with all the other sensors streaming in simultaneously.

When we think of IoT deployments in the thousands or millions of sensors and end nodes, a cloud environment may make use of a data lake. A data lake is essentially a massive storage facility holding raw unfiltered data from many sources. Data lakes are flat filesystems. A typical filesystem will be organized hierarchically, with volume, directories, files, and folders in a basic sense. A data lake organizes elements in its storage by attaching metadata element (tags) to each entry. The classic data lake model is Apache Hadoop, and nearly all cloud providers use some form of data lake underneath their services.

Data lake storage is particularly useful in IoT, as it will store any form of data whether it is structured or unstructured. A data lake also assumes that all data is valuable and will be kept permanently. This bulk persistent mass of data is optimal for data analytics engines. Many of those algorithms function better based on how much data they are fed, or how much data is used to train their models.

A conceptual architecture using traditional batch processing and stream processing is illustrated in the following diagram. In the architecture, the data lake is fed by a Kafka instance. Kafka could provide the interface to Spark in batches and send data to a data warehouse.

There are several ways to reconfigure the topology in the following diagram, as the connectors between components are standardized:

Figure 3: Basic diagram of cloud ingestion engine to a data warehouse. Spark acts as the stream channel service.

Complex event processing

Complex event processing (CEP) is another analytics engine that is often used for pattern detection. From its roots in discrete event simulation and stock market volatility trading in the 1990s, it is by nature a method capable of analyzing a live feed of streaming data in near real time. As hundreds and thousands of events enter the system, they are reduced and distilled into higher-level events. These are more abstract than raw sensor data. CEP engines have the advantage of a fast turnaround time in real-time analysis over a stream processor. A stream processor can resolve an event in the millisecond time frame. The downside is a CEP engine doesn't have the same level of redundancy, or dynamic scaling, as Apache Spark.

CEP systems use SQL-like queries, but rather than using a database backend, they search an incoming stream for the pattern or rule you suggest. A CEP system consists of the tuple: discrete data element with a timestamp. A CEP system makes use of the different analytics patterns described at the beginning of this chapter and works well with a sliding window of events. Since it is SQL-like in semantics, and it is designed to be appreciably faster than a regular database query, all the rules and data reside in memory (usually a multi-GB database). Additionally, they need to be fed from a modern stream messaging system such as Kafka.

CEP has operations like sliding windows, joins, and sequence detection. Additionally, CEP engines can be based on forwarding or backward chaining as rules engines are. An industry-standard CEP system is the Apache WSO2 CEP. WSO2 coupled with Apache Storm can process over 1 million events per second, with no storage events needed. WSO2 is a CEP system using an SQL language but can be scripted in JavaScript and Scala. The additional benefit is that it can be extended with a package called Siddhi to enable services such as:

Geolocation
Natural language processing
Machine learning
Time series correlation and regression
Mathematical operations
String and RegEx

Streams of data can be queried as in the following Siddhi QL code:

define stream SensorStream (time int, temperature single); @name('Filter Query')
from SensorStream[temperature &gt; 98.6' select *
insert into FeverStream;

This all operates as discrete events allowing for sophisticated rules to be applied to millions of events transpiring simultaneously.

Now that we have described CEP, it is time that you understood where a CEP engine and a rules engine should be used. If the evaluation is a simple state, such as two ranges of temperatures, then the system is stateless, and a simple rules engine should be used. If the system maintains a temporal notion or a series of states, then a CEP engine should be used.

Lambda architecture

A lambda architecture attempts to balance latency with throughput. Essentially, it mixes batch processing with stream processing. Similar to the general cloud topology of OpenStack or other cloud frameworks, lambda ingests and stores to an immutable data repository. There are three layers of the topology:

Batch layer: The batch layer is usually based on Hadoop clusters. The batch layer is significantly slower in processing than the stream layer. By sacrificing latency, it maximizes throughput and accuracy.

Speed layer: This is the real-time in-memory data stream. The data can be erroneous, missing, and out of order. Apache Spark, as we have seen, is very good at providing a stream processing engine.

Service layer: The service layer is where the recombination of batch and stream results are stored, analyzed, and visualized. Typical components of the service layer are Druid, which provides facilities for combining batch and speed layers; Apache Cassandra for scalable database management; and Apache Hive for data warehousing.

A close up of a device Description automatically generated

Figure 4: Complexities of a Lambda architecture. Here, a batch layer migrates data to the HDFS storage, and the speed layer is delivered directly to a real-time analysis package via Spark.

Lambda architectures are, by nature, more complex than the other analytics engines. They are hybrid and add additional complexity and resources to run successfully.

Sector use cases

We will now try to consider the typical use cases in a variety of industries adopting IoT and cloud analytics. When architecting the solution, we need to consider the scale, the bandwidth, real-time needs, and types of data to derive the correct cloud architecture, as well as the correct analytics architecture.

These are generalized examples – it is imperative to understand the entire flow and future scale/capacity when drawing a similar table:

Industry

Use cases

Cloud services

Typical bandwidth

Real time

Analytics

Manufacturing

Operational technology

Brownﬁeld

Asset tracking

Factory automation

Dashboards

Bulk storage

Data lakes

SDN

Low latency

500 GB/day/factory part produced

2 TB/minute mining operations

Less than 1s

RNN

Bayesian networks

Logistics and transport

Geolocation tracking

Asset tracking

Equipment sensing

Dashboards

Logging

Storage

Vehicles: 4 TB/day/vehicle (50 sensors)

Aircraft: 2.5 to 10 TB/day (6000 sensors)

Assets tracking: 1 MB/day/beacon

Less than 1s (real time)

Daily (batch)

Rules engines

Healthcare

Asset tracking

Patient tracking

Home health monitoring

Wireless health equipment

Reliability and HIPPA

Private cloud option

Storage and archival

Load balancing

1 MB/day/sensor

Less than 1s: life critical

Non-life critical: on each change

RNN

Decision trees

Rules engines

Agriculture

Livestock health and location tracking

Soil chemistry analysis

Bulk storage – archiving

Cloud-to-cloud provisioning

512 KB/day/livestock head

1000 to 10000 head of cattle per feedlot

1 second (real time)

10 minutes (batch)

Rules engines

Energy

Smart meters

Remote energy monitoring (solar, natural gas, oil)

Failure prediction

Dashboards

Data lakes

Bulk storage for historical rate prediction

SDN

Low latency

100-200 GB/day/wind turbine

1 to 2 TB/day/oil rig

100 MB/day/smart meter

Less than 1s: energy production

1 minute: smart meters

RNN

Bayesian networks

Rules engines

Consumer

Real-time health logging

Presence detection

Lighting and heating/AC

Security

Connected home

Dashboards

PaaS

Load balancing

Bulk storage

Security camera: 500 GB/day/camera

Smart device: 1-1000 KB/day/sensor-device

Smart home: 100 MB/day/home

Video: less than 1s

Smart home: 1s

CNN (image sensing)

Rules engines

Retail

Cold chain sensing

POS machines

Security systems

Beaconing

SDN

Micro-segmentation

Dashboards

Security: 500 GB/day/camera

General: 1-1000 MB/day/device

POS and credit transaction: 100ms

Beaconing: 1s

Rules engines

CNN for security

Smart city

Smart parking

Smart trash pickup

Environmental sensors

Dashboards

Data lakes

Cloud-to-cloud services

Energy monitors: 2.5 GB/day/city (70K sensors)

Parking spots: 300 MB/day (80,000 sensors)

Waste monitors: 350 MB/day (200,000 sensors)

Noise monitors: 650 MB/day (30,000 sensors)

Electric meters: 1 minute

Temperature: 15 minutes

Noise: 1 minute

Waste: 10 minutes

Parking spots: every change

Rules engine

Decision trees

Machine learning in IoT

Machine learning is not a new computer science development. On the contrary, mathematical models for data fitting and probability go back to the early 1800s, and Bayes' theorem and the least squares method of fitting data. Both are still widely used in machine learning models today, and we will briefly explore them later in the chapter.

A brief history of AI and machine learning milestones

It wasn't until Marvin Minsky (MIT) produced the first neural network devices called perceptrons in the early 1950s that computing machines and learning were unified. He later wrote a paper in 1969 that was interpreted as a critique of the limitations of neural networks. Certainly, during that period, computational horsepower was at a premium. The mathematics were beyond the reasonable resources of IBM S/360 and CDC computers. As we will see, the 1960s introduced much of the mathematics and foundations of artificial intelligence in areas such as neural nets, support vector machines, fuzzy logic, and so on.

Evolutionary computation such as genetic algorithms and swarm intelligence became a research focus in the late 1960s and 1970s, with work from Ingo Rechenberg, Evolutionsstrategie (1973). It gained some traction in solving complex engineering problems. Genetic algorithms are still used today in mechanical engineering, and even automatic software design.

The mid-1960s also introduced the concept of hidden Markov models as a form of probabilistic AI, like Bayesian models. It had been applied to research in gesture recognition and bioinformatics.

Artificial intelligence research lulled with government funding drying up until the 1980s and the advent of logic systems. This started the field of AI known as logic-based AI and supporting programming languages called Prolog and LISP, which allowed programmers to easily describe symbolic expressions. Researchers found limitations with this approach to AI: principally logic-based semantics didn't think like a human. Attempts at using anti-logic or scruffy models to try to describe objects didn't work well either. Essentially, one cannot describe an object precisely using loosely coupled concepts. Later in the 1980s, expert systems took root. Expert systems are another form of logic-based systems for a well-defined problem trained by experts in that particular domain. One could think of them as a rule-based engine for a control system. Expert systems proved successful in corporate and business settings and became the first commercially available AI systems sold. New industries started to form around expert systems. These types of AI grew, and IBM used the concept to build Deep Thought to defeat chess grandmaster Garry Kasparov in 1997.

Fuzzy logic first manifested itself in research by Lotfi A. Zadeh at UC Berkeley in 1965, but it wasn't until 1985 that researchers at Hitachi demonstrated how fuzzy logic could be applied successfully to control systems. That sparked significant interest in Japanese automotive and electronics firms to adopt fuzzy systems into actual products. Fuzzy logic has been used successfully in control systems, and we will discuss it formally later in this chapter.

While expert systems and fuzzy logic seemed to be the mainstay for AI, there was a growing and noticeable gap between what it could do and what it would never be able to do. Researchers in the early 1990s saw that expert systems, or logic-based systems, in general, could never emulate the mind. The 1990s brought on the advent of statistical AI in the form of hidden Markov models and Bayesian networks. Essentially, computer science adopted models commonly used in economics, trade, and operations research to make decisions.

Support vector machines were first proposed by Vladimir N. Vapnik and Alexey Chervonenkis in 1963, but became popular after the AI winter of the 1970s and early 1980s. Support vector machines (SVMs) became the foundation for linear and nonlinear classification by using a novel technique to find the best hyperplanes to categorize data sets. This technique became popular with handwriting analysis. Soon, this evolved into uses for neural networks.

Recurrent neural networks (RNNs) also became a topic of interest in the 1990s. This type of network was unique and different from deep learning neural networks such as convolutional neural networks, because it maintained state and could be applied to a problem involving the notion of time, such as audio and speech recognition. RNNs have a direct impact on IoT predictive models today, which we will discuss later in this chapter.

A seminal event occurred in 2012 in the field of image recognition. In a competition, teams around the globe competed on a computer science task of recognizing the object in a 50-pixel by 30-pixel thumbnail. Once the object was labeled, the next task was to draw a box around it. The task was to do this for 1 million images. A team from the University of Toronto built the first deep convolutional neural network to process images to win this competition. Other neural networks had attempted this machine vision exercise in the past, but the team developed an approach that identified images with more accuracy than any approach before, with an error rate of 16.4%. Google developed another neural net that brought the error rate down to 6.4%. It was also around this time that Alex Krizhevsky developed AlexNet, which introduced GPUs to the equation to greatly speed up training. All these models were built around convolutional neural networks and had processing requirements that were prohibitive until the advent of GPUs.

Today, we find AI everywhere, from self-driving cars, to speech recognition in Siri, to tools emulating humans in online customer service, to medical imaging, to retailers using machine learning models to identify consumer interest in shopping and fashion as they move about a store:

Figure 5: The spectrum of artiﬁcial intelligence algorithms

What does this have to do with IoT at all? Well, IoT opens up the spigot to a massive amount of constantly streaming data. The value of a system of sensors is not what one sensor measures, but what a collection of sensors measure and tell us about a much larger system. IoT, as mentioned earlier, will be the catalyst to generate a step function in the amount of data collected. Some of that data will be structured: time-correlated series. Other data will be unstructured: cameras, synthetic sensors, audio, and analog signals. The customer wants to create useful decisions for their business based on that data—for example, in a manufacturing plant that is planning to optimize operational expenses and potentially capital expenses by adopting IoT and machine learning (at least, that's what they were sold on). When we think about a factory IoT use case, the manufacturer will have many interdependent systems. They may have some assembly tool to produce a widget, a robot to cut parts out of metal or plastic, another machine to perform some type of injection molding, conveyor belts, lighting and heating systems, packaging machines, supply and inventory control systems, robots to move material around, and various levels of controls systems. In fact, this company may have many of these spaces spread across a campus or geography. A factory like this has adopted all the traditional models of efficiency, and its managers have read W. Edwards Deming's literature; however, the next industrial revolution will come in the form of IoT and machine intelligence.

Specialized individuals know what to do when an erratic event occurs. For example, a technician who has been operating one of the assembly machines for years knows when the machine needs service based on how that machine is behaving. It may start creaking in a certain way. Perhaps it's worn out its ability to pick and place parts and dropped a few in the last couple of days. These simple behavioral effects are things that machine learning can see and predict even before a human can. Sensors can surround such devices and monitor actions both perceived and inferred. An entire factory could be perceived in such a case to understand how that factory is performing at that very instant based on a collection of millions or billions of events from every machine and every worker in that system.

With that amount of data, only a machine learning appliance can sift through the noise and find what is relevant. These are not human-manageable problems, but the manageable problem of big data and machine learning.

Machine learning models

We will now focus on specific machine learning models that have applicability to IoT. There is no single winner for you to choose to sift through a collection of data. Each model has a particular strength and use cases it serves best. The goal of any machine learning tool is to arrive at a prediction or inference of what a set of data is telling you. You want to be better than the 50% outcome of flipping a coin.

There are two types of learning systems to consider, which are as follows:

Supervised learning: It simply implies that the training data provided to the model has an associated label with each entry. For example, a set may be a collection of pictures each labeled with the content of that image: for example, cat, dog, banana, car. Many machine learning models today are supervised. Supervised learning involves a human (or groups of humans). It can be a lengthy process to train a model to a high degree of accuracy.
Supervised learning allows for classification and regression problems to be solved. We will discuss classification and regression later in this chapter.

Unsupervised learning: It has no label for the training data. Obviously, this type of learning cannot resolve an image of a dog to the label dog. This type of learning model uses mathematical rules to reduce redundancy. A typical use case is to find clusters of similar things.

There also exists a hybrid of both models, called semi-supervised learning, which mixes labeled data and unlabeled data. The goal is to force the machine learning model to organize data as well as make inferences.

The three fundamental uses of machine learning are:

Classification
Regression
Anomaly detection

There are dozens of machine learning and AI constructs that could be talked about with application to IoT, but that would extend far beyond the scope of this book. We will concentrate on a small set of models to understand where they fit in relation to each other, what they target, and what their strengths are. We want to explore the uses and limitations of statistical, probabilistic, and deep learning, as they are the prevalent areas applicable for IoT artificial intelligence.

Within each of these large segments, we will generalize and dive into the following:

Random forests: Statistical models (fast models, good for systems with many attributes needed for anomaly detection)
Bayesian networks: Probabilistic models
Convolutional neural networks (CNNs): Deep learning model for unstructured image data
Recurrent neural networks (RNNs): Deep learning model for time series analysis

Some models are not applicable anymore in the artificial intelligence space, at least for the IoT use cases we consider. So, we will not focus on logic-based models, genetic algorithms, or fuzzy logic.

We will first talk through some initial nomenclature around classifiers and regression.

Classification

Classification is a form of supervised learning where the data is used to pick a name, value, or category—for example, using a neural network to scan images to find pictures of a shoe. In this field, there are two variants of classification:

Binary or binomial classification: When you are picking between one of two groups or categories. For example: coffee versus tea, black versus white.

Multiclass classification: When there are more than two groups or categories. For example, classifying fruit may include the set of oranges, bananas, grapes, and apples. A fruit may be an apple or a banana but not both.

We use the Stanford linear classifier tool to help understand the concept of hyperplanes (http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/). The following diagram shows a trained learning system's attempts to find the best hyperplane to divide colored balls. We can see that after several thousand iterations, the division is somewhat optimal, but there still are issues with the top-right region, where the corresponding hyperplane includes a ball that belongs to the top hyperplane. Shown below is an example of a less than optimal classification.

Here, hyperplanes are used to create artificial segments. The top right shows a single ball that should be classified with the other two balls at the top but was classified to belong to the bottom-right set. The top left also shows a red ball that should be classified with the red balls to the top right, but the hyperplane is incorrectly forcing it to the green cluster.

Figure 6: Less than optimal classiﬁcation

Notice in the preceding example from Stanford that the hyperplane is a straight line. This is called a linear classifier, and it includes such constructs as support vector machines (which attempt to maximize the linearity) and logistic regression (which can be used for binomial class and multiclass fitting). The following graph shows a binomial linear classification of two data sets: circles and diamonds.

Here, a line attempts to form a hyperplane to divide two distinct regions of variables. Notice the best linear relationship does include errors:

Figure 7: Linear classiﬁer

Nonlinear relationships are also common in machine learning when using a linear model would cause severe error rates. One issue with the nonlinear model is the tendency to overfit the test series. As we will see later, this has the propensity to make the machine learning tool accurate when executed on the training test data, but useless in the field. The following figure is a comparison of a linear versus nonlinear classifier:

Figure 8: Here, an nth order polynomial curve attempts to build a much more precise model of the set of data points. Highly precise models tend to ﬁt a known training set well but have failed when presented with real-world data.

Regression

Classification is concerned with predicting a discrete value (circle or diamonds), whereas regression models are used to predict a continuous value. For example, regression analysis would be used to predict the average selling price for your home based on the selling prices of all the homes in your neighborhood and surrounding neighborhoods.

Several techniques exist to form regression analysis: the least squares method, linear regression, and logistic regression.

The least squares method is the most frequently used method of standard regression and data fitting. Simply put, the method minimizes the sum of the square of all the errors in a set of data. For example, in a two-dimensional x,y plot, the curve fitting of a series of points attempts to minimize the error of all the points on the graph:

A screenshot of a computer Description automatically generated

Figure 9: Linear regression method. Here we attempt to reduce the error in a curve fitting equation by squaring and summing each error value.

Least squares methods are subject to outlier data that may skew the results incorrectly. It is recommended to scrub the data of outliers. In edge and IoT use cases, this can and most likely should be performed close to the sensors to avoid moving erroneous data in the first place.

Linear regression is a very common method used in data science and statistical analysis where the relationship between two variables is modeled by fitting a linear equation to them. One variable assumes the role of an explanatory variable (also known as an independent variable), and the other is a dependent variable. Linear regression attempts to find the best-fitting straight line through a set of points. The best-fitting straight line is called the regression line. To compute the regression line a simple slope equation can be used: y=mx + b.

However, we can use a statistical approach where M_x is the mean value of the x variable, M_y is the mean of the y variable, S_x is the standard deviation of x, S_y is the standard deviation of y, and r is the correlation between x and y. The slope then becomes:

The intercept (A) becomes:

Logistic regression, also known as a sigmoid function, is a form of linear algebra used to model the probability of a class or an event. For example, you could model the probability of a turbine failing depending on the heat surrounding the engine. We are essentially modeling the probability that an input X belongs to class Y=1. Written another way:

The probability represented here is binary (0 or 1). This is a key difference from linear regression. The value b₀ represents the intercept, and b₁ represents the coefficient to be learned. These coefficients must be found through estimation and minimization methods. The best coefficients for logistic regression would have a value close to 1 for the default class and 0 for all other classes.

Applying this as a probabilistic function:

An example of a logistic regression is to predict whether a cold storage refrigerator can keep food frozen based on outside air temperature. Assume we used a minimization estimation procedure to calculate the b coefficients and found b₀ to be -10 and b₁ to be 0.5. If the outside temperature is 24 degrees Celsius, the result will be 99.5%. Plugging the values into the equation:

Random forest

A random forest is a subset of another machine learning model called the decision tree. A decision tree, as the diagram at the start of this section shows, is a group of learning algorithms that are part of the statistical set. A decision tree simply takes several variables into consideration and produces a single output that classifies the elements. Each element evaluated is called the set. The decision tree produces a set of probabilities that a path has taken based on the input. One form of a decision tree is the Classification and Regression Tree (CART), developed by Leo Breiman in 1983.

We now introduce the notion of bootstrap aggregating or bagging. When you have a single decision tree being trained, it is susceptible to noise injected into it and can form a bias. If, on the other hand, you have many decision trees being trained, we can reduce the chance of biasing the result. Each tree will pick a random set of the training data or samples.

The output of the random forest training processes a decision tree based on a random selection of the training data and a random selection of variables:

Figure 10: Random forest model. Here, two forests are constructed to pick a random set, but not the whole set, of variables.

Random forests extend bagging by not only selecting a random sample set, but also a subset of the number of features being qualified. This can be seen in the preceding image. This is counterintuitive, since you want to train on as much data as possible. The rationale is that:

Most trees are accurate and provide correct predictions for most data
Errors in the decision tree may happen at different places, in different trees

This is the rule of group think, also called majority decisions. If the outcomes of several trees agree with each other even though they arrived at that decision through different paths and a single tree is an outlier, one will naturally side with the majority. This creates a model with low variance, compared to a single decision tree model, which can be extremely biased. We can see the following example with four trees in a random forest. Each has been trained on a different subset of data, and each has chosen random variables. The result of the flow is that three of the trees produce a result of 9, while the fourth tree produces a different result.

Regardless of what the fourth tree produced, the majority agreed by a different data set, different variable, and different tree structures that the result of the logic should be a 9:

Figure 11: Majority decision of a random forest. Here, several trees based on a random collection of variables arrive at 9 as a decision. Arriving at a similar answer based on different input generally reinforces the model.

Bayesian models

Bayesian models are based on Bayes' theorem from 1812. Bayes' theorem describes the probability that an event will occur based on prior knowledge of the system. For example, what is the probability that a machine will fail based on the temperature of the device?

Bayes' theorem is expressed as:

A and B are the events of interest. P(A|B) asks, what is the probability that event A will occur, given event B has occurred? They have no relation to each other and are mutually exclusive.

The equation can be rewritten using the theorem of total probability, which replaces P(B). We can also extend this to i number of events. P(B|A) is the probability that event B will occur, given event A has occurred. This is the formal definition of Bayes' theorem:

In this case, we are dealing with a single probability and its complement (pass/fail). The equation can be rewritten as:

An example follows. Supposing we have two machines producing identical parts for a widget. Say a machine can fail if its temperature exceeds a certain value. Machine A will fail 2% of the time if the temperature exceeds a certain temperature. Machine B will fail 4% of the time if it exceeds a certain temperature. Machine A produces 70% of the parts, and machine B produces the remaining 30%. If I pick up a random part and it fails, what is the probability it was produced by machine A and what is the probability it was produced by machine B?

In this case, A is an item produced by machine A, and B is an item produced by machine B. F represents the failed chosen part. We know:

P(A) = 0.7
P(B) = 0.3
P(F|A) = 0.02
P(F|B) = 0.04

Therefore, the probability you pick a defective part from machine A or B is:

Replacing the values:

Therefore, P(A | F) = 53%, and P(B | F) is the complement (1 - 0.53) = 47%.

A Bayesian network is an extension of Bayes' theorem in the form of a graphical probability model, specifically a directed acyclic graph. Notice the graph flows one way, and there are no loopbacks to previous states; this is a requirement of the Bayesian network:

A picture containing electronics Description automatically generated

Figure 12: Bayesian network model

Here, the various probabilities of each state come from expert knowledge, historical data, logs, trends, or combinations of these. This is the training process for a Bayesian network. These rules can be applied to a learning model in an IoT environment. As sensor data streams in, the model could predict machine failures. Additionally, the model could be used to make inferences. For example, if the sensors are reading an overheating condition, one could infer that there is a probability it may be related to the speed of the machine, or an obstruction.

There are variants of Bayesian networks that go beyond the scope of this book, but have benefits for certain types of data and problem sets:

Naive Bayes
Gaussian naive Bayes
Bayesian belief networks

A Bayesian network is good for environments in IoT that can't be completely observed. Additionally, in a situation where the data is unreliable, Bayesian networks have an advantage. Poor sample data, noisy data, and missing data have less of an effect on Bayesian networks than other forms of predictive analytics. The caveat is that the number of samples will need to be very large. Bayesian methods also avoid the overfitting problem, which we will discuss later when we look at neural networks. Additionally, Bayesian models fit well with streaming data, which is a typical use case in IoT. Bayesian networks have been deployed to find aberrations in signals and time-correlated series from sensors and also to find and filter malicious packets in networking.

Convolutional neural networks

A CNN is a form of artificial neural network in machine learning. We will first inspect the CNN and then move on to the RNN. A CNN has proven to be very reliable and accurate at image classification, and is used in IoT deployments for visual recognition, especially in security systems. It is a good starting point to understand the process and mathematics behind any artificial neural network. Any data that can be represented as a fixed bitmap (say, a 1024 x 768-pixel image in three planes). A CNN attempts to classify an image to a label (for example, cat, dog, fish, bird) based on an additive set of decomposable features. The primitive features that compose image content are built up from small sets of horizontal lines, vertical lines, curves, shades, gradient directions, and so on.

First layer and filters

This basic set of features in the first layer of a CNN would be feature identifiers such as small curves, tiny lines, color splotches, or small distinguishing features (in the case of image classifiers). The filters will convolve around the image looking for similarities. The convolution algorithm will take the filter and multiply-sum the resultant matrix values. Filters activate when a specific feature results in a high-activation value:

Figure 13: The first layer of a CNN. Here, large primitives are used to pattern match input.

Max pooling and subsampling

The next layer will typically be a pooling or max pooling layer. This layer will take as input all the values derived from the last layer. It then returns the maximum value for a set of neighbor neurons, which is used as input to a single neuron in the next convolution layer. This is essentially a form of subsampling. Typically, the pooling layer will be a 2x2 subregion matrix as the result:

Figure 14: Max pooling. Attempts to ﬁnd the maximum value in a sliding window across an image.

Pooling has several options: maximizing (as shown in the preceding diagram), averaging, and other sophisticated methods. The purpose of max pooling is to state that a particular feature was found within a region of the image. We don't need to know the exact position, just a general locality. This layer also recurs the dimensions we have to deal with, which ultimately affect the neural network performance, memory, and CPU usage. Max pooling also controls overfitting. Researchers have learned that if the neural network becomes finely tuned to images without this type of subsampling, it will work well on the training set of data it was programmed with but will fail miserably with real-world images.

The fundamental deep learning model

The second convolutional layers use the results of the first layer as input. Remember, the input to the first layer was the original bitmap. The output of the first layer actually represents the location in the 2D bitmap where a specific primitive was seen. The features of the second layer are more comprehensive than the first. The second layer would have composite structures such as splines and curves. Here, we will describe the role of the neuron and the computations necessary to force an output from a neuron.

The role of the neuron is to input the sum of all the weights entering it against the pixel values. In the following graph, we see the neuron accepting inputs from the previous layer in the form of weights and bitmap values.

The role of the neuron is to sum the weights and values and force them through an activation function as input to the next layer:

Figure 15: CNN basic element. Here, the neuron is a basic unit of computing with weights and other bitmap values taken as input. The neuron ﬁres (or not) based on the activation function.

The equation for the neuron function is:

This can be a very large matrix multiplication problem. The input image is flattened into a one-dimensional array. The bias provides a method to influence the output without interacting with the real data. In the following diagram, we see an example of a weight matrix multiplied by a flattened one-dimensional image and with the added bias. Note that in actual CNN devices you can add the bias to the weight matrix and add a single 1.0 value to the bottom of the bitmap vector as a form of optimization. Here, the second value in the result matrix of 29.6 is the value chosen:

A close up of a clock Description automatically generated

Figure 16: Matrix relationship of CNN. Here, the weights and bitmap are matrix multiplied and added to a bias.

The input values are multiplied by the weighting on each entering the neuron. This is a simple linear transform in matrix math. That value needs to pass through an activation function to determine whether the neuron should fire. A digital system built on transistors takes voltages as input, and if the voltage meets a threshold value, the transistor switches on.

The biological analog is the neuron that behaves nonlinearly to the input. Since we are modeling a neural network, we attempt to use nonlinear activation functions. Typical activation functions that can be chosen include:

Logistic (sigmoid)
tanH
Rectified linear unit (ReLU)
Exponential linear unit (ELU) sinusoidal

The sigmoid activation function is:

Without a sigmoid (or any type of activation function) layer, the system would be a linear transformation function and have substantially less accuracy for image or pattern recognition.

CNN examples

A close up of a screen Description automatically generated

Figure 17: Four-layer CNN. Here, an image is convolved to extract large features based on primitives and then use a max pool to scale the image down and feed it as input to the feature ﬁlters. The fully connected layer ends the CNN path and outputs the best guess.

An example of primitive features that compose the layers comes courtesy of TensorFlow (http://playground.tensorflow.org). The TensorFlow example of a system with six features on input layer 1, followed by 33 hidden layers of four neurons, followed by two neurons and ending with two more is shown in the following diagram. In this model, the features attempt to classify the color groupings of dots.

Here, we attempt to find the optimal set of features that describe a spiral of two colored balls. The initial feature's primitives are basically lines and stripes. These will combine and be strengthened through the trained weightings to describe the next layer of blobs and splotches. As you move to the right, more detailed and composite representations form.

This test ran several thousand epochs in an attempt to show the regions that describe a spiral on the right. You can see the output curve on the upper right, which indicates the amount of error in the training process. Errors actually spiked in the middle of the training run, as chaotic and random effects were seen during backpropagation. The system then healed and optimized to the final result. The lines between neurons indicate the strength of the weight in describing the spiral pattern:

A close up of a map Description automatically generated

Figure 18: Example CNN in TensorFlow Playground. Courtesy of Daniel Smilkov and TensorFlow Playground under Apache License 2.0.

In the preceding image, a CNN is modeled using a learning tool called TensorFlow Playground. Here, we see the training of a four-layer neural network whose goal is to classify a spiral of different colored balls. The features on the left are the initial primitives, such as horizontal color changes or vertical color changes. The hidden layers are trained through backpropagation. The weighting factor is illustrated by the thickness of a line to the next hidden layer. The result is shown at the right, after several minutes of training.

The last layer is a fully connected layer, so called because it is required that every node in the final layer is connected to every node in the preceding level. The role of the fully connected layer is to finally resolve the image to a label.

It does this by inspecting the output and features of the last layer and determining that the set of the features corresponds to a particular label, such as a car. A car will have wheels, glass windows, and so on; meanwhile a cat will have eyes, legs, fur, and so forth.

Vernacular of CNNs

The use of CNNs includes a litany of terms and constructs. TensorFlow Playground is a good tool to understand the behavior and effect of different models, feature identifiers, and the role batch sizes and epochs play in training a model. The following image is a mark-up on TensorFlow Playground describing the different terminology and parameters that compose a CNN model.

Figure 19: The different parameters of CNN deep learning models. In particular note the effects of batch sizes, epochs, and learning rate.

Forward propagation, CNN training, and backpropagation

We have seen the process of feedforward propagation as a CNN executes. Training a CNN relies on the process of backpropagation of errors and gradients, deriving a new result, and correcting errors over and over.

The same network, including all pooling layers, activation functions and matrices, is used as the backward propagation flows through the network in attempts to optimize or correct the weighting:

Figure 20: CNN forward propagation during training and inference

Backpropagation is short for "back propagation of errors." Here an error function will calculate the gradient of an error function based on the neural network weights. The calculation of the gradient is forced backward through all the hidden layers. Shown below is the backpropagation process:

Figure 21: CNN backward propagation during training

We will now explore the training process. First, we must provide a training set for the network to normalize to. The training set and feature parameters are crucial in developing a well-behaving system in the field. The training data will have an image (or just bitmap data) and a known label. This training set is iterated against using the backpropagation technique to ultimately build a neural network model that produces the most accurate classifications or predictions. Too small a training set will produce poor results. For example, if you were building a device to classify all brands of shoes, you would need more than a single image of a particular shoe brand. You want the set to include different shoes, different colors, different brands, and different images using various lighting and angles.

Second, the neural network is composed of identical initial values or random values for each weight on each neuron that needs to be trained. The first forward pass results in substantial errors that go into a loss function:

Here, the new weights are based on the previous weight W(t - 1) minus the partial derivative of the error E over the weight W (loss function). This is also called the gradient. In the equation, lambda refers to the learning rate. This is up to the designer to tune. If the rate is high (greater than 1), the algorithm will use larger steps in the trial process. This may allow the network to converge to an optimal answer faster, or it could produce a poorly trained network that will never converge to a solution. Alternatively, if lambda is set low (less than 0.01), the training will take very small steps and much longer to converge, but the accuracy of the model may be better. In the following example, the optimal convergence is the very bottom of the curve representing error and weights. This is called the gradient descent. If the learning rate is too high, we could never reach the bottom and have to settle for a near bottom toward one of the sides:

Figure 22: Global minimum. This illustration shows the basis of a learning function. The goal is to ﬁnd the minimal value through a gradient descent. The accuracy of the learning model is proportional to the number of steps (time) taken to converge to a minimum.

Finding a global minimum of an error function isn't guaranteed. That is, a local minimum may be found and resolved as a false global minimum. The algorithm often will have trouble escaping the local minimum once found. In the next graph, you see the correct global minimum and how a local minimum may be settled upon:

Figure 23: Errors in training. We see the true global minimum and maximum. Depending on factors such as the training step size or even the initial starting point of the descent, a CNN could be trained to a false minimum.

While a neural network trains and attempts to find the global minimum, a problem arises called the vanishing gradient problem. As the weights in the neural network are updated the gradient may become artificially very small. This has the effect that the weight may not change its value. It may even stop a neural network from further training altogether. The TensorFlow Playground example uses an activation function whose values range from -1 to 1. When the neural network completes an epoch and backpropagates the errors to recalculate the weights, you may reach a point where the error signal (gradient) decreases exponentially and the system may train extremely slowly. The following example from TensorFlow Playground illustrates how the neural network stopped training due to the vanishing gradient problem after 1300 epochs. Further training of the model achieved no further accuracy.

To alleviate the problem, techniques such as long short-term memory (covered in the Recurrent neural networks section) may help; fast hardware and fine-tuning the correct features and training parameters can be useful.

We can see that there is still some degree of error that the training could not resolve:

Figurev 24: TensorFlow training example. The first image is the result of 100 epochs using a batch size of 10. The second image is the result after 400 epochs. The final result is after 1316 epochs and 10 minutes of training on a 3 GHz i7 processor. Note the final result shows incorrectly classified "balls" on the bottom left and top right area of the spiral. Courtesy of Daniel Smilkov and TensorFlow Playground under Apache License 2.0.

Here, we see the training progress (from left to right) with more accuracy. The left illustration clearly shows the heavy influence of the horizontal and vertical primitive features. After a number of epochs, the training starts converging on the true solution. Even after 1316 epochs, there are still some error cases where the training didn't converge on the correct answer.

Loss will be especially heavy during the initial runs of the network. We can visualize that with TensorFlow Playground. Here again, we are training a neural network to identify spirals. At first in the training, the loss is heavy at 0.516. After 1531 epochs, we arrive at this network's weights and a loss of 0.061.

It is good to understand the difference between batches and epochs during the training process:

Batch: This refers to the number of training samples that are processed before a prediction is made and the weights are adjusted in the model.

Epoch: This refers to the number of iterations the training runs through the entire training data set. This is usually very high for well-trained models.

Learning rate: This is a parameter that controls the gradient. It can be adjusted, but the risk is the gradient descent problem.

For example, say you have 200 images of different shoes that you are using to train a deep learning model to detect shoe brands. You start by training using 1000 epoch iterations to manage product schedules and maintain shipping dates.

If you set a batch size of 5, then you will iterate through five images before correcting the model.

The 5 images result in 40 training sets out of the 200 images. This implies that there are 40 batches of images to process the entire training set. Each time you go through the training set, you complete one epoch.

Therefore, there are 40 batches * 1000 epochs, or 40,000 training sessions in total.

Training can deliver unpredictable results. It takes training to understand how the various parameters affect the results. It also is a balance between the learning rate and the number of epochs to achieve the best model (e.g. the least loss on a training set). Reducing the learning rate or increasing the number of epochs doesn't necessarily mean you will achieve the best model. The following figure illustrates this point.

A very high learning rate will often produce the worst model.

A good (balanced) learning rate may not produce the best results compared to a high learning rate within a short number of epochs. However, over time it will generally train with the best results.

Figure 25: Example of learning rates and epochs as a function of accuracy (loss) in deep learning training. A balanced approach to training and learning rate is usually best for CNN models.

Recurrent neural networks

RNNs, or recurrent neural networks, constitute a separate field of machine learning and are extremely important and relevant to IoT data. The big difference between an RNN and a CNN is that a CNN processes input on fixed-size vectors of data. Think of them as two-dimensional images—that is, a known-size input. CNNs also pass from layer to layer as fixed-sized units of data. An RNN has similarities but is fundamentally different: instead of taking a fixed-size chunk of image data, it has as input a vector and as output another vector. At its heart, the output vector is influenced not by that single input we just fed it, but by the entire history of inputs it was fed. That implies an RNN understands the temporal nature of things or can be said to maintain state. There is information to be inferred from the data, but also from the order the data was sent.

RNNs are of particular value in the IoT space, especially in time-correlated series of data, such as describing a scene in an image, describing the sentiment of a series of text or values, and classifying video streams. Data may be fed to an RNN from an array of sensors that contain a (time:value) tuple. That would be the input data to send to the RNN. In particular, such RNN models can be used in predictive analytics to find faults in factory automation systems, evaluate sensor data for abnormalities, evaluate timestamped data from electric meters, and even to detect patterns in audio data. Signal data from industrial devices is another great example. An RNN could be used to find patterns in an electrical signal or wave. A CNN would struggle with this use case. An RNN will run ahead and predict what the next value in a sequence will be if the value falls out of the predicted range, which could indicate a failure or significant event:

Figure 26: The main difference between an RNN and a CNN is the reference to time or sequence order

If you were to examine a neuron in an RNN, it would look as if it were looping back on itself. Essentially, the RNN is a collection of states going back in time. This is clear if you think of unrolling the RNN at each neuron:

Figure 27: RNN neuron. This illustrates the input from previous step xn-1 feeding the next step xn as the basis of the RNN algorithm.

The challenge with RNN systems is that they are more difficult to train over a CNNs or other models. Remember, CNN systems use backpropagation to train and reinforce the model. RNN systems don't have the notion of backpropagation. Anytime we send input into the RNN, it carries a unique timestamp. This leads to the vanishing gradient problem discussed earlier, which reduces the learning rate of the network to be useless. A CNN is also exposed to a vanishing gradient, but the difference with an RNN is that the depth of the RNN can go back many iterations, whereas a CNN traditionally has only a few hidden layers. For example, an RNN resolving a sentence structure like A quick brown fox jumped over the lazy dog will extend back nine levels. The vanishing gradient problem can be thought of intuitively: if weights in the network are small, then the gradient will shrink exponentially leading the vanishing gradient. If the components of the weights are large, then the gradient will grow exponentially and possibly explode, causing a NaN (not a number error). Exploding leads to an obvious crash, but the gradient is usually truncated or capped before that occurs. A vanishing gradient is harder for a computer to deal with.

One method to overcome this effect is to use the ReLU activation function mentioned in the CNN section. This activation function delivers a result of 0 or 1, so it isn't prone to vanishing gradients. Another option is the concept of long short-term memory (LSTM), which was proposed by researchers Sepp Hochreiter and Juergen Schmidhuber. (Long Short-Term Memory, Neural Computation, 9(8):1735-1780, 1997.) LSTM solved the vanishing gradient issue and allowed an RNN to be trained.

Here, the RNN neuron consisted of three or four gates. These gates allow the neurons to hold state information and are controlled by logistic functions with a value between 0 and 1:

Keep gate K: Controls how long a value will remain in memory
Write gate W: Controls how much a new value will affect memory
Read gate R: Controls how much a value in memory is used to create the output activation function

You can see that these gates are somewhat analog in nature. The gates vary how much information will be retained. The LSTM cell will trap errors in the memory of the cell. This is called the error carousel and allows the LSTM cell to backpropagate errors over long time periods. The LSTM cell resembles the following logical structure where the neuron is essentially the same, in terms of outward appearances, as a CNN, but internally it maintains state and memory. The LSTM cell of the RNN is illustrated as follows:

Figure 28: LSTM cell. Here is the RNN basic algorithm using internal memory to process arbitrary sequences of inputs.

An RNN builds up memory in the training process. This is seen in the diagram as the state layer under a hidden layer. An RNN is not searching for the same patterns across an image or bitmap like a CNN; rather, it is searching for a pattern across multiple sequential steps (which could be time).

The hidden layer and state layer complement are shown in the following diagram:

Figure 29: Hidden layers are fed from previous steps as additional inputs to the next step

One can see the amount of computation in training with the LSTM logistical math, as well as how the regular backpropagation is heavier than a CNN. The process of training involves backpropagating gradients through the network all the way to time zero. However, the contribution of a gradient from far in the past (say, time zero) approaches zero and will not contribute to the learning.

A good use case to illustrate an RNN is a signal analysis problem. In an industrial setting, you can collect historical signal data and attempt to infer from it whether a machine was faulty or there were runaway thermals in some component. A sensor device would be attached to a sampling tool, and a Fourier analysis performed on the data. The frequency components could then be inspected to see if a particular aberration was present. In the following graph, we have a simple sine wave that indicates normal behavior, perhaps of a machine using cast rollers and bearings. We also see two aberrations introduced (the anomaly). A fast Fourier transform (FFT) is typically used to find aberrations in a signal based on the harmonics. Here, the defect is a high-frequency spike similar to a Dirac delta or impulse function.

Figure 30: RNN use case. Here, a waveform with an aberration from audio analysis could be used as input to an RNN.

We see the following FFT registers only the carrier frequency and doesn't see the aberration:

Figure 31: High frequency spike via an FFT

An RNN specifically trained to identify the time-series correlation of a particular tone or audio sequence is a straightforward application. In this case, an RNN could replace an FFT, especially when multiple sequences of frequencies or states are used to classify a system, making it ideal for sound or speech recognition.

Industrial predictive maintenance tools rely on this type of signal analysis to find thermal and vibration-based failures of different machines. This traditional approach has limits, as we see. A machine learning model (especially an RNN) can be used to inspect the incoming stream of data for particular feature (frequency) components and could find point failures as shown in the preceding graph. Raw data, shown in the previous graph, is arguably never as clean as a sine wave. Usually, the data is quite noisy with periods of loss.

Another use case is around sensor fusion in healthcare. Healthcare products, like glucose monitors, heart rate monitors, fall indicators, respiratory meters, and infusion pumps, will transmit periodic data samples, or may send a stream of data. All these sensors are independent of each other, but together comprise a picture of patient health. They also are time-correlated. An RNN can bridge this unstructured data in aggregate and predict patient health, all dependent on the patient's activity throughout the day. This can be useful for home health monitoring, sports training, rehabilitation, and geriatric care.

You must be careful with RNNs. While they can make good inferences on time series data and predict oscillations and wave behaviors, they may behave chaotically and are very difficult to train.

Training and inference for IoT

While neural networks offer significant advantages of having a machine behave closer to a human in the areas of perception, pattern recognition, and classification, it is primarily up to the training to develop a model that works well with low loss, no overfitting, and adequate performance. In the IoT world, latency is a large issue, especially for safety-critical infrastructure. Resource constraints is another factor. Most of the edge computing devices that exist today do not have hardware accelerators such as General-Purpose Computation on Graphics Hardware (GPGPU) and field-programmable gate arrays (FPGAs) at their disposal to assist with the heavy matrix math and floating point surrounding neural networks. Data could be sent to the cloud, but that may have significant latency effects as well as bandwidth costs. The OpenFog group is provisioning a framework where edge fog nodes may be provisioned with additional compute resources and pulled on demand to provide assistance with the heavy lifting of these algorithms.

Training, for now, should be the realm of the cloud, where the computing resources are available and test sets can be created. The edge devices should report to the cloud parent when a training model fails, or if new data appears that requires a retraining effort. The cloud allows for a train once-deploy many concept, which is a strength. Alternatively, it is wise to consider training on a regional basis with bias.

The concept here is that fog nodes in particular regions may be more sensitive to certain patterns that are environmentally different. For example, monitoring temperature and humidity on equipment in the field in the Arctic will differ significantly to a tropical region.

The following table illustrates the CPU processing required for training. Generally, thousands to millions of images are needed to successfully train a model. The processors and GPUs shown come with a substantial cost and power demand that don't necessarily make sense to run at the edge.

Processor	TensorFlow speed of training (images/second)
AMD Opteron 6168 (CPU)	440
Intel i7 7500U (CPU)	415
Nvidia GeForce 940MX (GPU)	1190
Nvidia GeForce 1070	6500
Nvidia RTX2080	17000

The edge is more adept at running trained models in an inference mode. However, deploying inference engines needs to be architected well. Some CNN networks such as AlexNet have 61 million parameters, consume 249 MB of memory, and perform 1.5 billion floating point operations to classify a single image.

Reduced precision, pruning, and other techniques to perform first-run heuristics on image data are better suited for edge devices. Additionally, preparing data for upstream analytics can also help. Examples include:

Sending upstream: Only data that meets a specific condition (time, event of interest)
Data scrubbing: Reduce, crop, and clip a data set to only relevant content
Segment: Force data into grayscale to reduce traffic and prepare it for a CNN

IoT data analytics and machine learning comparison and assessment

Machine learning algorithms have their place in IoT. The typical case is when there is a plethora of streaming data that needs to produce some meaningful conclusion. A small collection of sensors may only need a simple rules engine on the edge in a latency-sensitive application. Others may stream data to a cloud service and apply rules there for systems with less-aggressive latency demands.

When large amounts of data, unstructured data, and real-time analytics come into play, we need to consider the use of machine learning to solve some of the hardest problems.

In this section, we detail some tips and reminders in deploying machine learning analytics, and what use cases may warrant such tools.

Training phase:

For a random forest, use bagging techniques to create ensembles.
When using a random forest, ensure you maximize the number of decision trees.
Watch overfitting. Overfitting will lead to inaccurate field models. Techniques such as regularization and even injecting noise into a system will reinforce the mode.
Don't train on the edge.
Gradient descent will lead to errors. RNNs naturally are susceptible.

Model in field:

Update the model with new data sets as they become available. Keep the training set current.
Running models on the edge can be reinforced with larger and more comprehensive models in the cloud.

Neural network execution can be optimized in the cloud and at the edge with a minimum of loss by considering techniques such as pruning nodes and reducing precision.

Model

Best Application

Worst Fit and Side Effects

Resource Demands

Training

Random forests (statistical models)

Anomaly detection

Systems with thousands of choice points and hundreds of inputs

Regression and classiﬁcation

Handles mixed data types

Ignores missing values

Scales linearly with input

Feature extraction

Time and sequence analysis

Low

Training based on bagging techniques for maximum effectiveness

Training fairly resource-light

Mainly supervised

RNN

(temporal and sequence- based neural networks)

Prediction of an event based on a sequence

Streaming data patterns

Time-correlated series data

Maintains knowledge of past states to predict new states (electrical signals, audio, speech recognition)

Unstructured data

Input variables may or may not be dependent

Image and video analysis

Models that use thousands of features.

Very high for training

High for inference execution

Training more cumbersome than CNN backpropagation

Very hard to train

Supervised

CNN

(deep learning)

Prediction of an object based on surrounding values

Pattern and feature identiﬁcation

2D image recognition

Unstructured data

Input variables may or may not be dependent

Time-based and sequential predictions

Models that use thousands of features.

Very high for training (ﬂoating point precision, large training sets, large memory demands)

High for inference execution

Supervised and unsupervised

Bayesian networks (probabilistic models)

Noisy and incomplete data sets

Streaming data patterns

Time-correlated series

Structured data

Signal analysis

Models developed quickly

Assumes all input variables are independent

Performs poorly with high orders of data dimensions

Low

Little training data need with respect to other artiﬁcial neural networks

Summary

This chapter was a brief introduction to data analytics for IoT in the cloud and in the fog. Data analytics is where the value is extracted out of the sea of data produced by millions or billions of sensors. Analytics is the realm of the data scientist and consists of attempts to find hidden patterns and develop predictions from an overwhelming amount of data. To be valuable, all this analysis needs to be at or near real time to make life-critical decisions. You need to understand the problem being solved and the data necessary to reveal the solution. Only then can a data analysis pipeline be architected well. This chapter exposed several data analysis models as well as an introduction to the four relevant machine learning domains.

These analytics tools are the heart of value in IoT to derive meaning from the nuances of massive amounts of data in real time. Machine learning models can predict future events based on current and historical patterns. We see how RNN and CNN cases satisfy this context through proper training. As an architect, the pipeline, storage, models, and training all need to be considered.

In the next chapter, we will talk about the security of IoT from a holistic point of view, from the sensor to the cloud. We will examine specific real-world attacks on IoT in recent years, as well as methods to counter such attacks in the future.