4 min read

Microsoft open sources Data Accelerator for Apache Spark

Welcome to Data Accelerator!

Data Accelerator for Apache Spark simplifies streaming big data using Spark. Data Accelerator has been used for two years within Microsoft for processing streamed data across many internal deployments handling data volumes at Microsoft scale. Offering an easy to use platform to learn and evaluate your streaming needs and requirements, we are excited to share this project with the wider community as open source.

A few of the ways Data Accelerator will make it easier to build a streaming pipeline on spark:

  • Plug and Play: Easily set up input sources and output sinks in order to establish a pipeline in minutes. Data Accelerator supports reading from both Eventhub and IoThub and supports sinking data to Azure blobs, CosmosDB, Eventhub, and more.
  • No-Code Experience: Set up alerts and data processing without writing any code. Through a rules designer experience you can specify simple and aggregate data processing, tagging, and alerts.
  • SQL queries: Write complex processing in SQL – no need to work in Scala. The built in extensibility model also supports User Defined Functions and leveraging Azure Functions – e.g., for ML mid-stream.
  • Live query: Validate your queries in seconds by running against a sample of incoming data, saving hours of work setting up and testing the processing of your pipeline.

Race car

Overview

Data Accelerator is an easy way to set up and run a streaming big data pipeline on Apache Spark. Within the Developer Tools group at Microsoft, we have used an instance of Data Accelerator to process events Microsoft scale since the fall of 2017.

Data Accelerator isn’t just a pipe between an EventHub and a database, however. It allows us to reshape incoming events while continuing to stream, then route different parts of the same event into different data stores, while providing health monitoring and alerting over the status of the whole pipeline. Data Accelerator also provides a configuration UI and rules/query designer experience that allows you to be up and running without needing to write any code. Also, anyone doing stream data processing will often have need for processing data with sliding windows, or to handle late arrival data, or to accumulate data over time. Data Accelerator enables and simplifies use of these advanced features.

And lastly, the dev-test loop supports a fast validation cycle where your query is run against events sampled locally – allowing the implementation to be finalized before the first deployment.  We think these capabilities will appeal to some of you and we hope that some find it useful enough to work with and even contribute back to the project.  We can’t wait to see what comes next!

When to use Data Accelerator

We built Data Accelerator to deal with data from many incoming data streams that needed to be combined and routed to many different output sinks in a way that promotes quick discovery of data insights. Naturally, normalization is a big deal here and anyone who has worked in a heterogeneous event environment probably recognizes the perils and potential for days spent capturing and tuning event parsers. That’s why we implicitly infer event schema from a sample of your event data.

But more than reading different sources, transforming events in the stream and writing them out is of critical importance. Through combination of event and schema, Data Accelerator can recognize and modify events or event parts as they continue streaming through the pipeline. Reshaped events can be split, merged with values based on reference data or algorithm, modified or dropped entirely. Complex queries and policies using different time window functions or accumulators can be set up easily. In our experience, the ability to instantly validate your queries using the Live Query feature, running against a sample of incoming data, saves hours of frustration when setting up processing on big data pipes. Finally, a lightweight health dashboard and alerting system rounds out the pipeline, standing up all the essential elements to evaluate a streaming big data pipeline on Spark end-to-end.

There are three main scenarios where you may want to leverage Data Accelerator:

  1. You are looking to set up a streaming data pipeline on Spark and want to see the end-to-end running before the afternoon is over.
  2. You are in a heavy development-exploration-prototype cycle and need the fast develop-debug experience.
  3. You are looking to get into Spark streaming, but don’t want to configure and glue together the interop to support all the individual components or learn Scala before committing.

Data Accelerator is useful in other situations as well (we’ve had an instance in production since late 2017), but the greatest advantages of the toolset show up before the production environment has settled down into a routine of maintenance and servicing updates.

How Data Accelerator supports your pipeline needs

Data Accelerator supports three tiers of engagement with your data pipeline.

  • First level: the entire pipeline can be setup and deployed without writing any code at all. Configuration UI and designers combine to enable setup, rules generation, and creation of alerts on data content, allowing you to roll out a prototype before you roll up your sleeves.
  • Second level: the developer experience pulls in helpful elements like editor quality of life following the VS Code example, syntax extensions supporting fast authoring of Spark SQL queries with additions like Live Query, time windowing, in-memory accumulators and more.
  • Third level: integrations with customer code written in Scala or via Azure Functions is available, for complete control over customizations.

To help learn about Data Accelerator we’ve created dozens of tutorials, a documentation wiki, and a couple of live samples that deploy via source, Azure ARM template, or Docker container on Linux, Mac, or Windows.

Dig In!

We are excited to share this tool with the wider community, to help others learn and evaluate streaming options when they are facing down a big data challenge on Apache Spark. You can find all the tutorials, supporting documentation, and deployment options in our GitHub repository. Docker-deploy one of the samples for your platform of choice and start exploring today.

Questions or feedback? Let us know in the comments below.