Data Pipelines and the AI Engine

Marcus Burton Architect, Cloud TechnologyJune 16th 2020

If you follow marketing buzzword trends, you probably watched over the last years as big data shifted to become machine learning (ML) and AI. So now everyone everywhere is talking about ML/AI, but I fear that for many networking people (myself included), the big data hype trend left us scratching our heads in jargon overload. 

Every trend comes with an entire lexicon of new terms and concepts, but it’s hard to find a single framework that helps us navigate the entire topic. As our industry as a whole is still learning the data jargon, I thought it might be helpful to see how data initiatives fit into an end-to-end framework. So the focus of this article is to talk about holistic data systems (whether they’re ML/AI or not) as a data pipeline.

A data pipeline is effectively the architectural system for collecting, transporting, processing, transforming, storing, retrieving, and presenting data. These steps are the functional steps in achieving the end-goal of a data application. We often use the term data stack to refer to the software, systems, and tools that are used to deliver a data pipeline.

Conceptually, a data pipeline includes the steps illustrated below. Again, this is a conceptual data pipeline. Actual implementations may vary, but these are the general stages, which commonly occur in this order.

Collection

Data starts somewhere, and there are many potential collection points on the network. Collectors could be client-side apps, wireless or wired sniffers, infrastructure devices (APs, switches, routers, etc), or dedicated collection engines that pull data from endpoints or process traffic mirrored from data aggregation points in the network. Each type of collector has a unique perspective on the network, and also brings unique challenges in terms of deployment, processing, and storage capabilities. Every data initiative depends on this lynchpin, which provides the metrics and telemetry needed for data analysis. Collectors should provide the right data, frequently sampled, accurately measured, properly aggregated, packaged up, and ready to send.

Transport

After collection, the next step is to package and send the data to analysis systems for processing. This transport step covers a few important areas: serialization, tunneling and encryption, and message handling/queuing.

  • Serialization is the packaging process by which data is organized into structured formats (e.g. JSON, GPB, Avro, XML, etc) that can be read by the receivers.
  • Tunneling and encryption are familiar to networking folks—this is the collector’s way of securely sending the data to receiving endpoints.
  • Then, message queuing is the data middleman process whereby data is sent by producers (collectors) and received by the message queue/bus (there are historical differences, but I use the terms interchangeably). The queue then makes the data available to consumers for ingestion and processing. Message queues frequently use a publish/subscribe (pub/sub) model wherein many producers can publish data to the bus and many consumers can subscribe to different data topics of interest. One of our PLMs calls the message bus the circulatory system of a data-driven application because it is effectively responsible for carrying data (like blood) between many different systems and services within an application. You may have heard of Kafka, MQTT, RabbitMQ or other protocols and tools—this is where they fit.  

ETL

Data is then picked up from a message bus and ingested using an ETL process. ETL stands for extract, transform, and load, which is when data is retrieved from its current format (extract), manipulated or massaged to change the data in some way (transform), and then installed into a database (load). ETL can be used for a lot of data manipulation tasks, like joining datasets from two databases, converting data from simple file formats into others that are higher performance for querying, or calculating aggregations of data across common time slices, like 10-minute intervals.

The ETL process happens in two primary models: batch and stream. Batch processing is a scheduled process (usually) that picks up data in chunks over a period of time—for example, waiting every 24 hours for a nightly data cleanup/backup task, or waiting for 10 minutes of data to collect, and then performing calculations on the 10-minute batch at once. Stream processing is a real-time and ongoing ingestion of data, which is becoming more common with on-demand analytics workflows.

Compute

Compute is similar to ETL, but is a next step in data processing where machine learning models are built. Applications might pull data from a source, perform some lightweight ETL on the data, then pass it to a compute algorithm to build a model. A “model” just means that it will learn some patterns that describe the data in order to help the system perform some task (e.g. prediction, clustering, correlation, etc) with the data or other data. This is the step where the ML/AI magic happens.  

Storage

Storage is pretty straightforward, but applications increasingly leverage multiple types of data store in the pipeline to accomplish the end goals of the application. Some data is clearly very structured (like device inventory, configurations, and admin accounts), while other data is less structured (event data, time-series statistics, health metadata, etc)—these varieties of data drive the need for different storage formats. Likewise, some data stores are used for initial or raw storage (or long-term backups), while others are used for processed data after it’s gone through ETL and compute processes and is ready for query. For that reason, in modern architectures, it’s quite common to see 3 or more different types of databases, which solve for different storage formats, durations, and query requirements.

Query

After the data is stored, it is then available for query. There are many variations of query language, but most fit into two groups: SQL for structured databases and noSQL for unstructured data. Backend systems may also incorporate APIs to simplify the query process for the “user” or service fetching data. There are two general types of API: private and public. Private APIs are basically internal system interfaces used to integrate components of an application that need to interact. Then, public APIs are designed and made available for integrations with external entities, like 3rd party applications, systems, and users.

In either case, the API provides a software bridge between systems to exchange info, and often provides an abstraction (a way to reduce groups of concepts/details into simpler interactions and representations) to make life easier. In many systems, the data pipeline may stop here because the data is only shared between backend systems, and never makes it to a frontend for user interaction.

Visualize

As users, if there’s no frontend for data visualization, we never see the data at all. But, in most networking applications, it’s the juicy visualizations, charts, graphs, and interactions that make data meaningful. In my opinion, data visualization is still the most underrated part of a data pipeline because the presentation of data is often the mechanism for translating value.

While everyone wants to say machine learning (ML) and artificial intelligence (AI) for the marketing pizazz, you can still solve many problems by simply presenting data in an intuitive way, but that by itself depends on an effective end-to-end data pipeline.

Bringing It All Together

Sometimes I think network applications are a lot like a body. Some parts of our body do a lot of important work, while other parts get all the attention. While ML and AI algorithms are clearly very important members of a data-driven application, they are only a small part of the body of end-to-end systems that make the data machine work properly.

Despite the incredible attention we’re all giving to ML/AI, it’s essential to understand that the attractive face of ML/AI and its applications depend on reliable data guts in the architecture. We’re all learning to speak a new language as part of the ongoing data revolution, so this primer on data pipelines is just the beginning of the longer conversation on ML and AI. If nothing else, I hope this framework helps you understand how pieces fit into a broader whole. It certainly did for me.

And Some Exciting News

If this article captured your attention and you want more depth on these layers and the ML/AI topics surrounding it, I’m excited to let you in on a little secret…I’m writing a book about it. That’s right, we’re right around the corner from releasing a new book that walks through all the fun details of cloud architecture, service consumption and deployment models, data pipelines, machine learning, and so much more. Stay tuned for more details coming soon!!!