{"id":73685,"date":"2018-07-09T10:00:25","date_gmt":"2018-07-09T17:00:25","guid":{"rendered":""},"modified":"2025-01-27T13:11:04","modified_gmt":"2025-01-27T21:11:04","slug":"how-to-data-processing-apache-kafka-spark","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/","title":{"rendered":"How to process streams of data with Apache Kafka and Spark"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Data is produced every second, it comes from millions of sources and is constantly growing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Have you ever thought how much data you personally are generating every day?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-direct-result-of-our-actions\">Data: direct result of our actions<\/h2>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-1024x519.webp\" alt=\"An image of pendulum balls with a person dragging one red ball backward.\" class=\"wp-image-73686 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-1024x519.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-300x152.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-768x389.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-330x167.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-800x406.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-400x203.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1.webp 1306w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-1-1024x519.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">There\u2019s data generated as a direct result of our actions and activities:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Browsing twitter<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Using mobile apps<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Performing financial transactions<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Using a navigator in your car<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Booking a train ticket<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Creating an online document<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Starting a YouTube live stream<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Obviously, that\u2019s not it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-produced-as-a-side-effect\">Data: produced as a side effect<\/h2>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-1024x643.webp\" alt=\"dominos in an S shape\" class=\"wp-image-73687 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-1024x643.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-300x188.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-768x483.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-330x207.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-800x503.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-400x251.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2.webp 1146w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-2-1024x643.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For example, performing a purchase where it seems like we\u2019re buying just one thing \u2013 might generate hundreds of requests that would send and generate data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Saving a document in the cloud doesn\u2019t mean storing it on one server, it means replicating it across multiple regions for fault-tolerance and availability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Performing a financial transaction doesn\u2019t mean just doing the domain specific operation.<br>It also means storing logs and detailed information about every single micro step of the process, to be able to recover things if they go wrong.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It also means analyzing peripheral information about it to determine if the transaction is fraudulent or not.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We log tons of data. We cache things for faster access. We replicate data and setup backups. We save data for future analysis.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-continuous-metrics-and-events\">Data: continuous metrics and events<\/h2>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-1024x412.webp\" alt=\"unlabeled bar graph\" class=\"wp-image-73688 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-1024x412.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-300x121.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-768x309.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-330x133.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-800x322.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-400x161.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3.webp 1386w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-3-1024x412.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">There\u2019s data we track that is being constantly produced by systems, sensors and IoT devices.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It can be regular system metrics that capture state of web servers and their load, performance, etc. Or data that instrumented applications send out.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Heart rate data, or blood sugar level data. Airplane location and speed data \u2013 to build trajectories and avoid collisions. Real-time camera monitors that observe spaces or objects to determine anomalies in process behavior.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-1024x317.webp\" alt=\"Reaction options: urgent, not so urgent, and flexible\" class=\"wp-image-73689 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-1024x317.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-300x93.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-768x238.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-330x102.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-800x248.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-400x124.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4.webp 1246w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-4-1024x317.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If we think about it \u2013 some of the data is collected to be stored and analyzed later. And some of the data is extremely time sensitive.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Analyzing logs of a regular web site isn\u2019t super urgent when we are not risking anyone\u2019s life.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Keeping track of credit card transactions is much more time sensitive because we need to take action immediately to be able to prevent the transaction if it\u2019s malicious.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Then there\u2019s something much more critical, like monitoring health data of patients, where every millisecond matters.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-1024x490.webp\" alt=\"reaction options labeled based on time - urgent is 'real time', not-so-urgent is 'near real-time', and flexible is 'batch'\" class=\"wp-image-73690 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-1024x490.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-300x144.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-768x368.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-330x158.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-800x383.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-400x191.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5.webp 1270w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-5-1024x490.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">All of these real-life criteria translate to technical requirements for building a data processing system:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Where and how the data is generated<\/li>\n\n\n\n<li class=\"wp-block-list-item\">What is the frequency of changes and updates in the data<\/li>\n\n\n\n<li class=\"wp-block-list-item\">How fast we need to react to the change<\/li>\n<\/ul>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-1024x428.webp\" alt=\"event ingestion and processing reaction diagram\" class=\"wp-image-73691 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-1024x428.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-300x125.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-768x321.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-330x138.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-800x335.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-400x167.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6.webp 1406w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-6-1024x428.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We need to be able to build solutions that can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Receive data from a variety of sources<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Perform specific computation and analysis on data on the fly<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Possibly perform an action as a result<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We can divide how we think about building such architectures into two conventional parts:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Data ingestion and decoupling layer between sources of data and destinations of data<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Data processing and reaction<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s look at some challenges with the first part.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-ingestion-producers-and-consumers\">Data ingestion: producers and consumers<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When we, as engineers, start thinking of building distributed systems that involve a lot of data coming in and out, we have to think about the flexibility and architecture of how these streams of data are produced and consumed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In earlier stages, we might have just a few components, like a web application that produces data about user actions, then we have a database system where all this data is supposed to be stored. At this point we usually have a 1 to 1 mapping between data producer (our web app) and consumer (database) in this case.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--1024x356.webp\" alt=\"a close up of two hands\" class=\"wp-image-73692 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--1024x356.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--300x104.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--768x267.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--330x115.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--800x278.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--400x139.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7-.webp 1310w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-7--1024x356.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">However, when our application grows &#8211; infrastructure grows, you start introducing new software components, for example, cache, or an analytics system for improving users flow, which also requires that web application to send data to all those new systems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What if we introduce a mobile app in addition, now we have two main sources of data with even more data to keep track of.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Eventually we grow and end up with many independent data producers, many independent data consumers, and many different sorts of data flowing between them.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"there-are-challenges\">There are challenges!<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">How do we make sure that our architecture doesn\u2019t become cumbersome with so many producers and consumers? Especially when same data should be available for some consumers after being read by other consumers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How to prepare for the need to scale based on changes in rates of events coming in?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How to ensure data is durable and we won\u2019t ever lose any important messages?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"apache-kafka\">Apache Kafka<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Apache Kafka is an open-source streaming system.<\/p>\n\n\n<figure class=\"wp-block-image alignright size-thumbnail\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-8-150x150.webp\" alt=\"logo\" class=\"wp-image-73693 webp-format\" style=\"object-fit:cover\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-8-150x150.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Kafka is used for building real-time streaming data pipelines that reliably get data between many independent systems or applications.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It allows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Publishing and subscribing to streams of records<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Storing streams of records in a fault-tolerant, durable way<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">It provides a unified, high-throughput, low-latency, horizontally scalable platform that is used in production in thousands of companies.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To understand how Kafka does these things, let\u2019s explore a few concepts.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-1024x636.webp\" alt=\"Diagram of Kafka brokers and Zookeeper servers\" class=\"wp-image-73694 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-1024x636.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-300x186.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-768x477.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-330x205.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-800x497.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-400x248.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9.webp 1130w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-9-1024x636.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Kafka is run as a cluster on one or more servers that can span multiple datacenters. Those servers are usually called brokers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Kafka uses Zookeeper to store metadata about brokers, topics and partitions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"kafka-topics\">Kafka Topics<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The core abstraction Kafka provides for a stream of records \u2014 is the topic.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-1024x393.webp\" alt=\"kafka partition diagram\" class=\"wp-image-73695 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-1024x393.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-300x115.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-768x295.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-330x127.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-800x307.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-400x154.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10.webp 1338w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-10-1024x393.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You can think of a topic as a distributed, immutable, append-only, partitioned commit log, where producers can write data, and consumers can read data from.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each record in a topic consists of a key, a value, and a timestamp.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A topic can have zero, one, or many consumers that subscribe to the data written to it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Kafka cluster durably persists all published records using a configurable retention period \u2014 no matter if those records have been consumed or not.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"partitions\">Partitions<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Each partition in a topic is an ordered, immutable sequence of records that is continually appended to a structured commit log.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-1024x515.webp\" alt=\"diagram\" class=\"wp-image-73696 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-1024x515.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-300x151.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-768x386.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-330x166.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-800x402.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-400x201.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11.webp 1344w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-11-1024x515.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The records in the partitions each have an offset &#8211; number that uniquely identifies each record within the partition.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each individual partition must fit on the broker that hosts it, but a topic may have many partitions distributed over the brokers in the Kafka cluster so it can handle an arbitrary amount of data. Each partition can be replicated across a configurable number of brokers for fault tolerance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each partition has one broker which acts as the &#8220;leader\u201d that handles all read and write requests for the partition, and zero or more brokers which act as &#8220;followers\u201d that passively replicate the leader. Each broker acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"kafka-producers-and-consumers\">Kafka Producers and Consumers<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Producers publish data to the topics of their choice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Consumers can subscribe to topics and receive messages. Consumers can act as independent consumers or be a part of some consumer group.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-1024x398.webp\" alt=\"Kafka cluster diagram\" class=\"wp-image-73697 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-1024x398.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-300x117.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-768x299.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-330x128.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-800x311.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-400x156.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12.webp 1398w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-12-1024x398.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">So Kafka not only helps with ingesting big amounts of data, but also works really well for small data in the environment with numerous systems that exchange data in a many to many fashion, allows flexibility in pace for consumers and producers, scales really well and is really fast.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, sometimes we need a system that is able to process streams of events as soon as they arrive, on the fly and then perform some action based on the results of the processing, it can be an alert, or notification, something that has to happen in real time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"systems-for-stream-processing\">Systems for stream processing<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Even though this article is about Apache Spark, it doesn\u2019t mean it\u2019s the best for all use cases. The choice of a streaming platform depends on latency guarantees, community adoption, interop with libraries and ecosystem you&#8217;re using, and more.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When considering building a data processing pipeline, take a look at all leader-of-the-market stream processing frameworks and evaluate them based on your requirements.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-medium\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13-300x145.webp\" alt=\"words Kafka, Storm, Spark, Flink\" class=\"wp-image-73698 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13-300x145.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13-768x371.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13-330x159.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13-800x386.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13-400x193.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13.webp 998w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/Image-13-300x145.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For example, Storm is the oldest framework that is considered a &#8220;true&#8221; stream processing system, because each message is processed as soon as it arrives (vs in mini-batches). It provides low latency, though it can be cumbersome and tricky to write logic for some advanced operations and queries on data streams. It has a rather big community.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Kafka Streams is a pretty new and fast, lightweight stream processing solution that works best if all of your data ingestion is coming through Apache Kafka.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Flink is another great, innovative and new streaming system that supports many advanced things feature wise. It has a passionate community that is a bit less than community of Storm or Spark, but has a lot of potential.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Spark is by far the most general, popular and widely used stream processing system. It is primarily based on micro-batch processing mode where events are processed together based on specified time intervals. Since Spark 2.3.0 release there is an option to switch between micro-batching and experimental continuous streaming mode.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"apache-spark\">Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Spark is an open source project for large scale distributed computations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can use Spark to build real-time and near-real-time streaming applications that transform or react to the streams of data.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-medium\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-14-300x158.webp\" alt=\"logo for Apache Spark\" class=\"wp-image-73699 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-14-300x158.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-14-600x315.webp 600w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-14-330x174.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-14-400x211.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-14.webp 640w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-14-300x158.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Spark is similar to Map Reduce, but more powerful and much faster, as it supports more types of operations than just map or reduce, it uses Directed Acyclic Graph execution model and operates primarily in-memory.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As of the latest Spark release it supports both micro-batch and continuous processing execution modes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Spark can be used with the variety of schedulers, including Hadoop Yarn, Apache Mesos, and Kubernetes, or it can run in a Standalone mode.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can use Spark SQL and do batch processing, stream processing with Spark Streaming and Structured Streaming, machine learning with Mllib, and graph computations with GraphX.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-spark-works\">How Spark works<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We can submit jobs to run on Spark. On a high level, when we submit a job, Spark creates an operator graph from the code, submits it to the scheduler. There, operators are divided into stages of tasks, that correspond to some partition of the input data.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-1024x617.webp\" alt=\"spark workers and cluster manager diagram\" class=\"wp-image-73700 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-1024x617.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-300x181.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-768x463.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-330x199.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-800x482.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-400x241.webp 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15.webp 1418w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-15-1024x617.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Spark has physical nodes called workers, where all the work happens.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A driver coordinates workers and overall execution of tasks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Executor is a distributed agent that is responsible for executing tasks. They run on worker nodes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Task is the smallest individual unit of execution.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"apache-kafka-spark-ftw\">Apache Kafka + Spark FTW<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Kafka is great for durable and scalable ingestion of streams of events coming from many producers to many consumers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Spark is great for processing large amounts of data, including real-time and near-real-time streams of events.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How can we combine and run Apache Kafka and Spark together to achieve our goals?<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"example-processing-streams-of-events-from-multiple-sources-with-apache-kafka-and-spark\">Example: processing streams of events from multiple sources with Apache Kafka and Spark<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">I\u2019m running my Kafka and Spark on Azure using services like Azure Databricks and HDInsight. This means I don\u2019t have to manage infrastructure, Azure does it for me.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You\u2019ll be able to follow the example no matter what you use to run Kafka or Spark.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Note:<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Previously, I&#8217;ve written about <a href=\"https:\/\/lenadroid.github.io\/posts\/kafka-hdinsight-and-spark-databricks.html\">using Kafka and Spark on Azure<\/a> and <a href=\"https:\/\/lenadroid.github.io\/posts\/sentiment-analysis-streaming-data.html\">Sentiment analysis on streaming data using Apache Spark and Cognitive Services<\/a>. These articles might be interesting to you if you haven&#8217;t seen them yet.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are many detailed instructions on how to create Kafka and Spark clusters, so I won\u2019t spend time showing it here. Instead, we\u2019ll focus on their interaction to understand real-time streaming architecture.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Existing infrastructure and resources:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Kafka cluster (HDInsight or other)<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Spark cluster (Azure Databricks workspace, or other)<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Peered Kafka and Spark Virtual Networks<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Sources of data: Twitter and Slack<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We are not looking at health data tracking, or airplane collision example, or any life-or-death kind of example, because there are people who might use the example code for real life solutions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Instead, we are going to look at a very atomic and specific example, that would be a great starting point for many use cases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Main points it will demonstrate are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">How to build a decoupling event ingestion layer that would work with multiple independent sources and receiving systems<\/li>\n\n\n\n<li class=\"wp-block-list-item\">How to do processing on streams of events coming from multiple input systems<\/li>\n\n\n\n<li class=\"wp-block-list-item\">How to react to outcomes of processing logic<\/li>\n\n\n\n<li class=\"wp-block-list-item\">How to do it all in a scalable, durable and simple fashion<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"example-story\">Example story\u2026<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine that you\u2019re in charge of a company. You have quite a few competitors in your industry. You want to make sure your products and tools are top quality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As an example, I am using Azure for this purpose, because there&#8217;re a lot of tweets about Azure and I&#8217;m interested in what people think about using it to learn what goes well and to make it better for engineers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-sources-external-internal-and-more\">Data sources: external, internal, and more<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We can look at statements about Azure everywhere on the internet: it can be Twitter, Facebook, Github, Stackoverflow, LinkedIn, anywhere. There are hundreds of potential sources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are many things we can do with statements about Azure in terms of analysis: some of it might require reaction and be time sensitive, some of it might not be.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because I really want the example to be concise, atomic and general, we\u2019re going to analyze user feedback coming from a public data source, like Twitter, and another data source, like Slack \u2013 where people can also share their thoughts about potential service.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a result, we will be watching and analyzing the incoming feedback on the fly, and if it\u2019s too negative \u2013 we will need to notify certain groups to be able to fix things ASAP.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When it\u2019s positive \u2013 we\u2019d like to track it, but don\u2019t really need an immediate response.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our input feedback data sources are independent and even through in this example we\u2019re using two input sources for clarity and conciseness, there could be easily hundreds of them, and used for many processing tasks at the same time.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this situation what we can do is build a streaming system that would use Kafka as a scalable, durable, fast decoupling layer that performs data ingestion from Twitter, Slack, and potentially more sources. It would also analyze the events on sentiment in near real-time using Spark and that would raise notifications in case of extra positive or negative processing outcomes!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"feedback-from-slack-azure-databricks-notebook-1\">Feedback from Slack&nbsp;(Azure Databricks Notebook #1)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I&#8217;m using Azure Databricks interactive notebooks for running code as a great environment for demonstrations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First part of the example is to be able to programmatically send data to Slack to generate feedback from users via Slack.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Slack Bot API token is necessary to run the code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-api-token\">Slack API Token<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval superSecretSlackToken = \"xoxb-...\"\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"define-slack-functions\">Define Slack functions<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport com.ullink.slack.simpleslackapi._\nimport com.ullink.slack.simpleslackapi.impl.SlackSessionFactory\nimport java.io.IOException\ndef instantiateSlackSession (token: String): SlackSession = {\n  val session = SlackSessionFactory.createWebSocketSlackSession(token)\n  session.connect()\n  return session\n}\ndef sendMessageToSlack (message: String, channel: SlackChannel, slackSession: SlackSession)\n{\n  slackSession.sendMessage(channel, message)\n}\nval session = instantiateSlackSession(superSecretSlackToken)\nval feedbackChannel = session.findChannelByName(\"feedback\")\nval hiChannel = session.findChannelByName(\"hi\")\n<\/pre><\/div>\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nsendMessageToSlack(\"This is a message from bot. Hi!\", hiChannel, session)\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"listener-for-new-slack-messages-azure-databricks-notebook-2\">Listener for new Slack messages (Azure Databricks Notebook #2)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When users are sending statements that might be potential feedback, we need to be able to notice and track these messages. For example, we can check if a message is under specific Slack channel and focused on a particular topic, and send it to a specific Kafka topic when it meets our &#8220;feedback&#8221; conditions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can do so by overwriting an &#8220;onEvent&#8221; method of &#8220;SlackMessagePostedListener&#8221; from Slack API, and implementing the logic inside of it, including sending qualifying events to a Kafka topic. After defining the listener class, we have to register an instance of it. We can also un-register it when we&#8217;d like to stop receiving feedback from Slack.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-session\">Slack Session<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport com.ullink.slack.simpleslackapi._\nimport com.ullink.slack.simpleslackapi.impl.SlackSessionFactory\nimport java.io.IOException\ndef instantiateSlackSession (token: String) : SlackSession = {\n  val session = SlackSessionFactory.createWebSocketSlackSession(token)\n  session.connect()\n  return session\n}\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-session\">Slack API Token<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval superSecretSlackToken = \"xoxb-...\"\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-session\">Initialize Slack Channel<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval session = instantiateSlackSession(superSecretSlackToken)\nval feedbackChannel = session.findChannelByName(\"feedback\")\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-session\">Preparing Kafka producer to send Slack messages<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport java.util.Properties\nimport org.apache.kafka.clients.producer.Producer\nimport org.apache.kafka.clients.producer.KafkaProducer\nimport org.apache.kafka.clients.producer.ProducerRecord\ndef buildKafkaProducerForSlack () : KafkaProducer[String, String] = {\n  val kafkaBrokers = \"10.0.0.11:9092,10.0.0.15:9092,10.0.0.12:9092\"\n  val props = new Properties()\n  props.put(\"bootstrap.servers\", kafkaBrokers)\n  props.put(\"acks\", \"all\")\n  props.put(\"key.serializer\", \"org.apache.kafka.common.serialization.StringSerializer\")\n  props.put(\"value.serializer\", \"org.apache.kafka.common.serialization.StringSerializer\")\n  val producer = new KafkaProducer[String, String](props)\n  return producer\n}\nval kafka = buildKafkaProducerForSlack()\nval feedbackTopic =\"azurefeedback\"\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-session\">Defining a Listener for new Slack messages in #feedback channel<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport com.ullink.slack.simpleslackapi.listeners._\nimport com.ullink.slack.simpleslackapi.SlackSession\nimport com.ullink.slack.simpleslackapi.impl.SlackSessionFactory\nimport com.ullink.slack.simpleslackapi.events.SlackMessagePosted\nimport java.io._\nclass MessageToKafkaTopicWriter extends SlackMessagePostedListener {\n  override def onEvent(event: SlackMessagePosted, session: SlackSession)\n  {\n     val channelOnWhichMessageWasPosted = event.getChannel()\n     val messageContent = event.getMessageContent()\n     val messageSender = event.getSender()\n     val targetChannel = session.findChannelByName(\"feedback\")\n     if (targetChannel.getId().equals(channelOnWhichMessageWasPosted.getId())\n         && messageContent.contains(\"Azure\"))\n     {\n       sendEventToKafka(messageContent, feedbackTopic)\n     }\n  }\n  def sendEventToKafka(message: String, topicName: String) = {\n    val key = java.util.UUID.randomUUID().toString()\n    kafka.send(new ProducerRecord[String, String](topicName, key, message))\n  }\n}\ndef registeringSlackToKafkaListener(session: SlackSession): MessageToKafkaTopicWriter = {\n  val messagePostedListener = new MessageToKafkaTopicWriter\n  session.addMessagePostedListener(messagePostedListener)\n  return messagePostedListener\n}\ndef removeSlackToKafkaListener (session: SlackSession, l: MessageToKafkaTopicWriter)\n{\n  session.removeMessagePostedListener(l)\n}\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-session\">Register listener instance<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval slackListener = registeringSlackToKafkaListener(session)\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"slack-session\">Remove listener instance registration<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nremoveSlackToKafkaListener(session, slackListener)\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"sending-twitter-feedback-to-kafka-azure-databricks-notebook-3\">Sending Twitter feedback to Kafka (Azure Databricks Notebook #3)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The majority of public feedback will probably arrive from Twitter. We&#8217;d need to get latest tweets about specific topic and send them to Kafka to be able to receive these events together with feedback from other sources and process them all in Spark.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"preparing-kafka-producer-to-send-tweets\">Preparing Kafka Producer to send tweets<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport java.util.Properties\nimport org.apache.kafka.clients.producer.Producer\nimport org.apache.kafka.clients.producer.KafkaProducer\nimport org.apache.kafka.clients.producer.ProducerRecord\n\/\/ Configuration for Kafka brokers\nval kafkaBrokers = \"10.0.0.11:9092,10.0.0.15:9092,10.0.0.12:9092\"\nval topicName = \"azurefeedback\"\nval props = new Properties()\nprops.put(\"bootstrap.servers\", kafkaBrokers)\nprops.put(\"acks\", \"all\")\nprops.put(\"key.serializer\", \"org.apache.kafka.common.serialization.StringSerializer\")\nprops.put(\"value.serializer\", \"org.apache.kafka.common.serialization.StringSerializer\")\nval producer = new KafkaProducer[String, String](props)\ndef sendEvent(message: String) = {\n  val key = java.util.UUID.randomUUID().toString()\n  producer.send(new ProducerRecord[String, String](topicName, key, message))\n  System.out.println(\"Sent event with key: '\" + key + \"' and message: '\" + message + \"'\\n\")\n}\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"preparing-kafka-producer-to-send-tweets\">Twitter Configuration Keys<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval twitterConsumerKey=\"\"\nval twitterConsumerSecret=\"\"\nval twitterOauthAccessToken=\"\"\nval twitterOauthTokenSecret=\"\"\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"preparing-kafka-producer-to-send-tweets\">Getting latest #Azure tweets and sending them to Kafka topic<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport java.util._\nimport scala.collection.JavaConverters._\nimport twitter4j._\nimport twitter4j.TwitterFactory\nimport twitter4j.Twitter\nimport twitter4j.conf.ConfigurationBuilder\nval cb = new ConfigurationBuilder()\n  cb.setDebugEnabled(true)\n  .setOAuthConsumerKey(twitterConsumerKey)\n  .setOAuthConsumerSecret(twitterConsumerSecret)\n  .setOAuthAccessToken(twitterOauthAccessToken)\n  .setOAuthAccessTokenSecret(twitterOauthTokenSecret)\nval twitterFactory = new TwitterFactory(cb.build())\nval twitter = twitterFactory.getInstance()\nval query = new Query(\" #Azure \")\nquery.setCount(100)\nquery.lang(\"en\")\nvar finished = false\nwhile (!finished) {\n  val result = twitter.search(query)\n  val statuses = result.getTweets()\n  var lowestStatusId = Long.MaxValue\n  for (status <- statuses.asScala) {\n    if(!status.isRetweet()){\n      sendEvent(status.getText())\n      Thread.sleep(4000)\n    }\n    lowestStatusId = Math.min(status.getId(), lowestStatusId)\n  }\n  query.setMaxId(lowestStatusId - 1)\n}\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"preparing-kafka-producer-to-send-tweets\">Result of the code output<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nSent event with key: 'ffaead11-1967-48f1-a3f5-7c12a2d69123' and message: 'The latest Microsoft #Azure & #AzureUK Daily! https:\/\/t.co\/lyrK0zLvbo #healthcare'\nSent event with key: 'd4f3fb86-5816-4408-bfa5-17ca03eb01de' and message: 'Image Classification using CNTK inside Azure Machine Learning Workbench | Microsoft Docs https:\/\/t.co\/ZTFU6qJ5Uk #azure #azureml #ml'\nSent event with key: 'c78c7498-146f-43c8-9f14-d789da26feeb' and message: 'On the phone with Apple support. Their solutions generally involve some level of resetting settings\/phone and resto\u2026 https:\/\/t.co\/hxAAsS2khR'\nSent event with key: 'a26e8922-f857-4882-a36e-88b40570fe13' and message: 'Create happy customers. #Azure delivers massive scalability and globally consistent experiences to support stellar\u2026 https:\/\/t.co\/7pwtLhR49p'\nSent event with key: 'd243cd59-4336-4bc9-a826-2abbc920156d' and message: 'Microsoft introduces Self-Deploying mode and Windows Autopilot reset to their Windows AutoPilot Deployment Program\u2026 https:\/\/t.co\/jTf7tDbu6S'\nSent event with key: '60dd5224-57c5-44ed-bfdc-1ab958bb30b0' and message: 'Do you know the #Microsoft #Security Story? https:\/\/t.co\/nC43z3n88B #virtualization #msft #aad #azure'\nSent event with key: 'd271ccda-9342-46e6-9b5a-cbcce26a899d' and message: 'RT @brunoborges: Say hello to #Jetty support on VS @code!\n#Azure\nhttps:\/\/t.co\/9ht5VNtZyF #Developer #Coder\u2026 https:\/\/t.co\/XY41wq68Lu'\nSent event with key: 'bad66873-b311-45f2-9c5a-8c51349221c4' and message: 'Creating #VirtualMachine (VM) On #Microsoft #Azure by @AkhilMittal20 cc @CsharpCorner https:\/\/t.co\/yucpiOjf6G https:\/\/t.co\/ULwTkGQ8ss'\nSent event with key: 'f7f152a3-74e6-4652-a6e8-f6f317b0c51c' and message: 'Microsoft Azure Stack expands availability to https:\/\/t.co\/trBEqbqnDi #azureblogfeed #Azure #AzureStack #Cloud #CloudOps #Microsoft @Azure'\nSent event with key: '9011daef-c2a1-4e89-99a9-f88c9f7cc775' and message: 'December 2017 Meeting: Serverless Computing \u2013 N https:\/\/t.co\/C2JgBBdOPM #pastmeetings #Azure #AzureStack #Cloud #CloudOps #Microsoft @Azure'\nSent event with key: 'be13d0f6-bf21-40e4-9036-4f63324d4ce3' and message: 'Azure Multi-Factor Authentication: Latency or for more .. https:\/\/t.co\/FboIo3CfWt #azure #Multi-FactorAuthenticat\u2026 https:\/\/t.co\/rvT61KcB1y'\nSent event with key: 'cb2eb344-3d88-4cb7-a3ae-e3fd960e0dc6' and message: 'Top story: David Armour on Twitter: \"Learn how the #Azure cloud brings efficien\u2026 https:\/\/t.co\/h7eBIEeGBz, see more https:\/\/t.co\/dzNIo8XPyk'\nSent event with key: '128340ad-4e2a-4162-a6dd-dbeb5560cf0a' and message: 'That's it! My first blog post just got published!\nLearn how to build a nodeless CI\/CD pipeline with Jenkins, Kubern\u2026 https:\/\/t.co\/oZaKu29ykk'\nSent event with key: '03ae4371-b6c7-407b-b942-41506c631b22' and message: 'RT @pzerger: Azure AD Conditional Access support for blocking legacy auth is in Public Preview! https:\/\/t.co\/l8uJo3A8Ek #azuread #azure'\nSent event with key: 'a98f7941-08e6-4875-b4be-f83275671b6c' and message: 'working with one of our great Ecosystem partners STMicroelectronics showing how to @MicrosfotIoT #Azure integrates\u2026 https:\/\/t.co\/FbhkKKFYA0'\nSent event with key: '471e0244-f667-41f8-aa4d-d6075b6e7168' and message: 'Azure\u2005Marketplace new offers: May 1-15 by Azure https:\/\/t.co\/tUFTeGYqLG #azure via DotNetKicks'\nSent event with key: '357ac31d-627f-42c5-a8c0-316ca53dfead' and message: 'Microsoft doubles Azure Stack's footprint, embiggens Azure VMs https:\/\/t.co\/ywDPO4jc7N by @TheRegister #Cloud #Tech #Microsoft #Azure #VM'\nSent event with key: 'ef59abe6-0cf7-4faa-ac4d-16cb56963808' and message: 'Project Natick: Microsoft's\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"analyzing-feedback-in-real-time-azure-databricks-notebook-4\">Analyzing feedback in real-time (Azure Databricks&nbsp;Notebook #4)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Kafka is now receiving events from many sources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now we can proceed with the reaction logic. We're going to do sentiment analysis on incoming Kafka events, and when sentiment is less than 0.3 - we'll send a notification to \"#negative-feedback\" Slack channel for review. When sentiment is more than 0.9 - we'll send a message to #positive-feedback channel.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"reading-a-stream-of-feedback-data-from-kafka\">Reading a stream of feedback data from Kafka<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval kafkaBrokers = \"10.0.0.11:9092,10.0.0.15:9092,10.0.0.12:9092\"\n\/\/ Setup connection to Kafka\nval kafka = spark.readStream\n  .format(\"kafka\")\n  .option(\"kafka.bootstrap.servers\", kafkaBrokers)\n  .option(\"subscribe\", \"azurefeedback\")\n  .option(\"startingOffsets\", \"latest\")\n  .load()\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"reading-a-stream-of-feedback-data-from-kafka\">Transforming binary data into readable representation<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport org.apache.spark.sql.types._\nimport org.apache.spark.sql.functions._\nimport org.apache.spark.sql.functions.{explode, split}\nval kafkaData = kafka\n  .withColumn(\"Key\", $\"key\".cast(StringType))\n  .withColumn(\"Topic\", $\"topic\".cast(StringType))\n  .withColumn(\"Offset\", $\"offset\".cast(LongType))\n  .withColumn(\"Partition\", $\"partition\".cast(IntegerType))\n  .withColumn(\"Timestamp\", $\"timestamp\".cast(TimestampType))\n  .withColumn(\"Value\", $\"value\".cast(StringType))\n  .select(\"Value\")\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"reading-a-stream-of-feedback-data-from-kafka\">What data is coming in?<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nkafkaData.writeStream.outputMode(\"append\").format(\"console\").option(\"truncate\", false).start().awaitTermination()\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"reading-a-stream-of-feedback-data-from-kafka\">Cognitive Services API Key<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval accessKey = \"...\"\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"reading-a-stream-of-feedback-data-from-kafka\">Sentiment Analysis Logic<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport java.io._\nimport java.net._\nimport java.util._\nimport javax.net.ssl.HttpsURLConnection\nimport com.google.gson.Gson\nimport com.google.gson.GsonBuilder\nimport com.google.gson.JsonObject\nimport com.google.gson.JsonParser\nimport scala.util.parsing.json._\nclass Document(var id: String, var text: String, var language: String = \"\", var sentiment: Double = 0.0) extends Serializable\nclass Documents(var documents: List[Document] = new ArrayList[Document]()) extends Serializable {\n    def add(id: String, text: String, language: String = \"\") {\n        documents.add (new Document(id, text, language))\n    }\n    def add(doc: Document) {\n        documents.add (doc)\n    }\n}\nclass CC[T] extends Serializable { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }\nobject M extends CC[scala.collection.immutable.Map[String, Any]]\nobject L extends CC[scala.collection.immutable.List[Any]]\nobject S extends CC[String]\nobject D extends CC[Double]\nobject SentimentDetector extends Serializable {\n  val host = \"https:\/\/eastus.api.cognitive.microsoft.com\"\n  val languagesPath = \"\/text\/analytics\/v2.0\/languages\"\n  val sentimentPath = \"\/text\/analytics\/v2.0\/sentiment\"\n  val languagesUrl = new URL(host+languagesPath)\n  val sentimenUrl = new URL(host+sentimentPath)\n  def getConnection(path: URL): HttpsURLConnection = {\n    val connection = path.openConnection().asInstanceOf[HttpsURLConnection]\n    connection.setRequestMethod(\"POST\")\n    connection.setRequestProperty(\"Content-Type\", \"text\/json\")\n    connection.setRequestProperty(\"Ocp-Apim-Subscription-Key\", accessKey)\n    connection.setDoOutput(true)\n    return connection\n  }\n  def prettify (json_text: String): String = {\n    val parser = new JsonParser()\n    val json = parser.parse(json_text).getAsJsonObject()\n    val gson = new GsonBuilder().setPrettyPrinting().create()\n    return gson.toJson(json)\n  }\n  \/\/ Handles the call to Cognitive Services API.\n  \/\/ Expect Documents as parameters and the address of the API to call.\n  \/\/ Returns an instance of Documents in response.\n  def processUsingApi(inputDocs: Documents, path: URL): String = {\n    val docText = new Gson().toJson(inputDocs)\n    val encoded_text = docText.getBytes(\"UTF-8\")\n    val connection = getConnection(path)\n    val wr = new DataOutputStream(connection.getOutputStream())\n    wr.write(encoded_text, 0, encoded_text.length)\n    wr.flush()\n    wr.close()\n    val response = new StringBuilder()\n    val in = new BufferedReader(new InputStreamReader(connection.getInputStream()))\n    var line = in.readLine()\n    while (line != null) {\n        response.append(line)\n        line = in.readLine()\n    }\n    in.close()\n    return response.toString()\n  }\n  \/\/ Calls the language API for specified documents.\n  \/\/ Returns a documents with language field set.\n  def getLanguage (inputDocs: Documents): Documents = {\n    try {\n      val response = processUsingApi(inputDocs, languagesUrl)\n      \/\/ In case we need to log the json response somewhere\n      val niceResponse = prettify(response)\n      val docs = new Documents()\n      val result = for {\n            \/\/ Deserializing the JSON response from the API into Scala types\n            Some(M(map)) <- scala.collection.immutable.List(JSON.parseFull(niceResponse))\n            L(documents) = map(\"documents\")\n            M(document) <- documents\n            S(id) = document(\"id\")\n            L(detectedLanguages) = document(\"detectedLanguages\")\n            M(detectedLanguage) <- detectedLanguages\n            S(language) = detectedLanguage(\"iso6391Name\")\n      } yield {\n            docs.add(new Document(id = id, text = id, language = language))\n      }\n      return docs\n    } catch {\n          case e: Exception => return new Documents()\n    }\n  }\n  \/\/ Calls the sentiment API for specified documents. Needs a language field to be set for each of them.\n  \/\/ Returns documents with sentiment field set, taking a value in the range from 0 to 1.\n  def getSentiment (inputDocs: Documents): Documents = {\n    try {\n      val response = processUsingApi(inputDocs, sentimenUrl)\n      val niceResponse = prettify(response)\n      val docs = new Documents()\n      val result = for {\n            \/\/ Deserializing the JSON response from the API into Scala types\n            Some(M(map)) <- scala.collection.immutable.List(JSON.parseFull(niceResponse))\n            L(documents) = map(\"documents\")\n            M(document) <- documents\n            S(id) = document(\"id\")\n            D(sentiment) = document(\"score\")\n      } yield {\n            docs.add(new Document(id = id, text = id, sentiment = sentiment))\n      }\n      return docs\n    } catch {\n        case e: Exception => return new Documents()\n    }\n  }\n}\n\/\/ User Defined Function for processing content of messages to return their sentiment.\nval toSentiment = udf((textContent: String) => {\n  val inputDocs = new Documents()\n  inputDocs.add (textContent, textContent)\n  val docsWithLanguage = SentimentDetector.getLanguage(inputDocs)\n  val docsWithSentiment = SentimentDetector.getSentiment(docsWithLanguage)\n  if (docsWithLanguage.documents.isEmpty) {\n    \/\/ Placeholder value to display for no score returned by the sentiment API\n    (-1).toDouble\n  } else {\n    docsWithSentiment.documents.get(0).sentiment.toDouble\n  }\n})\n<\/pre><\/div>\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n\/\/ Preparing a dataframe with Content and Sentiment columns\nval streamingDataFrame = kafka.selectExpr(\"cast (value as string) AS Content\").withColumn(\"Sentiment\", toSentiment($\"Content\"))\n\/\/ Displaying the streaming data\nstreamingDataFrame.writeStream.outputMode(\"append\").format(\"console\").option(\"truncate\", false).start().awaitTermination()\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"reading-a-stream-of-feedback-data-from-kafka\">Slack API Token<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nval superSecretSlackToken = \"xoxb-...\"\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"reading-a-stream-of-feedback-data-from-kafka\">Feedback analysis and action on a streaming Data Frame<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nimport com.ullink.slack.simpleslackapi._\nimport com.ullink.slack.simpleslackapi.impl.SlackSessionFactory\nimport java.io.IOException\nobject SlackUtil extends Serializable {\n  def sendSlackMessage(content: String, channelName: String){\n    val session = SlackSessionFactory.createWebSocketSlackSession(superSecretSlackToken)\n    session.connect()\n    val channel = session.findChannelByName(channelName)\n    session.sendMessage(channel, content)\n  }\n}\nval toSlack = udf((textContent: String, sentiment: Double) => {\n   if (sentiment <= 0.3) {\n     SlackUtil.sendSlackMessage(textContent, \"negative-feedback\")\n     \"Sent for review to #negative-feedback.\"\n   } else if (sentiment >= 0.9)  {\n     SlackUtil.sendSlackMessage(textContent, \"positive-feedback\")\n     \"Logged as positive in #positive-feedback\"\n   } else {\n     \"Not logged.\"\n   }\n})\nval comments =\n  kafka\n  .selectExpr(\"cast (value as string) AS Content\")\n  .withColumn(\"Sentiment\", toSentiment($\"Content\"))\n  .withColumn(\"Status\", toSlack($\"Content\", $\"Sentiment\"))\ncomments\n  .writeStream\n  .outputMode(\"append\")\n  .format(\"console\")\n  .option(\"truncate\", false)\n  .start()\n  .awaitTermination()\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"result\">Result<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">What we have achieved here is building a streaming system where we watch can user feedback from numerous independent sources that produce data at their own pace, and get real-time insights about when we need to take action and pay attention to things that need fixed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can plug in many additional independent processing scenarios, because once we sent data to Kafka it\u2019s being retained and available for consumption many times.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can not worry about strikes in user activity and handling the load because Kafka and Spark allow us to scale out and handle the pressure easily because of partitioning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"more-things\">More things<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"spark-and-experimental-continuous-processing-mode\">Spark and experimental \"Continuous Processing\" mode<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Traditionally, Spark has been operating through the micro-batch processing mode.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the Apache Spark 2.3.0, Continuous Processing mode is an experimental feature for millisecond low-latency of end-to-end event processing. It works according to at-least-once fault-tolerance guarantees. Currently, not nearly all operations are supported yet (e.g. aggregation functions, current_timestamp() and current_date() are not supported), there're no automatic retries of failed tasks, and it needs ensuring there's enough cluster power\/cores to operate efficiently.<br>To use it, add a trigger:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\ntrigger(Trigger.Continuous(\"1 second\"))\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">A checkpoint interval of 1 second means that the continuous processing engine will record the progress of the query every second.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Currently it\u2019s best to use it with Kafka as the source and sink for best end-to-end low-latency processing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-is-it-different-from-micro-batch\">How is it different from micro-batch<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">With micro-batch processing, Spark streaming engine periodically checks the streaming source, and runs a batch query on new data that has arrived since the last batch ended<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This way latencies happen to be around 100s of milliseconds.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-1024x425.webp\" alt=\"a close up of a map\" class=\"wp-image-73701 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16.webp 1282w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-300x124.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-1024x425.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-768x319.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-330x137.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-800x332.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-400x166.webp 400w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16.png 1282w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-300x124.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-1024x425.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-768x319.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-330x137.png 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-800x332.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-16-400x166.png 400w\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Spark driver checkpoints the progress by saving record offsets to a write-ahead-log, which may be then used to restart the query.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Recording of offsets for next batch of records is happening before the batch started processing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This way, some records have to wait until the end of the current micro-batch to be processed, and this takes time.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-continuous-processing-mode-works\">How \"Continuous Processing\" mode works<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Spark launches a number of long-running tasks. They constantly read, process and write data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Events are processed as soon as they\u2019re available at the source.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-1024x425.webp\" alt=\"a close up of a logo\" class=\"wp-image-73702 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17.webp 1296w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-300x125.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-1024x425.webp 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-768x319.webp 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-330x137.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-800x332.webp 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-400x166.webp 400w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17.png 1296w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-300x125.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-1024x425.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-768x319.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-330x137.png 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-800x332.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-17-400x166.png 400w\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In distinction to micro-batch mode, processed record offsets are saved to the log after every epoch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We are capable of achieving end-to-end latency of just a few milliseconds because events are processed and written to sink as soon as they are available in the source, without waiting for other records.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, checkpointing is fully asynchronous and uses the Chandy-Lamport algorithm, so nothing interrupts tasks and Spark is able to provide consistent millisecond-level latencies.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"another-new-thing-event-hubs-kafka\">Another new thing: Event Hubs + Kafka = \u2764\ufe0f<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For those of you who like to use cloud environments for big data processing, this might be interesting.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18.webp\" alt=\"Kafka symbols\" class=\"wp-image-73703 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18.webp 710w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18-300x152.webp 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18-330x167.webp 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18-400x203.webp 400w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18.png 710w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18-300x152.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18-330x167.png 330w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2018\/07\/image-18-400x203.png 400w\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Azure now has an alternative to running Kafka on HDInsight. You can leave your existing Kafka applications as is and use Event Hubs as a backend through Kafka API.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Event Hubs is a service for streaming data on Azure, conceptually very similar to Kafka.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In other words, Event Hubs for Kafka ecosystems provides a Kafka endpoint that can be used by your existing Kafka based applications as an alternative to running your own Kafka cluster.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It supports Apache Kafka 1.0 and newer client versions, and works with existing Kafka applications, including MirrorMaker - all you have to do is change the connection string and start streaming events from your applications that use the Kafka protocol into Event Hubs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Functionally, of course, Event Hubs and Kafka are two different things. But this feature can be useful if you already have services written to work with Kafka, and you'd like to not manage any infrastructure and try Event Hubs as a backend without changing your code.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I also wrote a tutorial on how to use Spark and Event Hubs <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/azure-databricks\/databricks-stream-from-eventhubs?WT.mc_id=sparkeventhubs-blog-alehall\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"kubernetes-environment\">Kubernetes environment<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Many businesses lean towards Kubernetes for big data processing as it allows to reduce costs, offers a lot of flexibility and is convenient when a lot of your existing services are already running on it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Existing Kubernetes abstractions like Stateful Sets are great building blocks for running stateful processing services, but are most often not enough to provide correct operation for things like Kafka or Spark.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I recommend looking into the topic of Operators, as they are important to help us help Kubernetes be aware of nuances of correct functioning of our frameworks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Confluent has announced the upcoming Kafka operator coming out in a month or two, and I'm looking forward to trying it. More updates coming.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For those of you interested in running Spark on Kubernetes, it has an experimental (not production) native Kubernetes support since Spark 2.3. I wrote about it <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/aks\/spark-job?WT.mc_id=sparkaks-blog-alehall\">here<\/a>, and <a href=\"https:\/\/github.com\/lenadroid\/show-episode-recommender\">here<\/a> and <a href=\"https:\/\/github.com\/lenadroid\/goto-cassandra-spark\">here<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"thank-you-for-reading\">Thank you for reading!<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I'd be happy to know if you liked the article or if it was useful to you. Follow me on Twitter <a href=\"https:\/\/twitter.com\/lenadroid\">@lenadroid<\/a> or on <a href=\"https:\/\/www.youtube.com\/channel\/UC93BPd533EKzu_L3LeMo6Gw\">YouTube<\/a>. Always happy to connect, feel free to reach out! For more on <a href=\"https:\/\/lenadroid.github.io\/posts\/distributed-data-streaming-action.html\">this tutorial<\/a> and other projects I'm working on, please check out <a href=\"https:\/\/lenadroid.github.io\/posts.html\">my website<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data is produced every second, it comes from millions of sources and is constantly growing. Have you ever thought how much data you personally are generating every day? Data: direct result of our actions There\u2019s data generated as a direct result of our actions and activities: Obviously, that\u2019s not it.<\/p>\n","protected":false},"author":5562,"featured_media":95475,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","_alt_title":"","ms-ems-related-posts":[],"footnotes":""},"tags":[89,158,2272,166,212],"programming-languages":[],"content-type":[361],"job-role":[],"topic":[2239,2249,2241,2242],"coauthors":[2340],"class_list":["post-73685","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-apache","tag-kubernetes","tag-microsoft","tag-azure","tag-spark","content-type-project-updates","topic-analytics","topic-big-data","topic-cloud","topic-containers","review-flag-1593580428-734","review-flag-1593580415-931","review-flag-1593580419-521","review-flag-1593580771-946","review-flag-1-1593580432-963","review-flag-2-1593580437-411","review-flag-3-1593580442-169","review-flag-4-1593580448-609","review-flag-8-1593580468-572","review-flag-9-1593580473-997","review-flag-alway-1593580310-39","review-flag-anywh-1593580318-567","review-flag-free-1593619513-693","review-flag-iot-1680213327-385","review-flag-machi-1680214156-53","review-flag-ml-1680214110-748","review-flag-new-1593580248-669","review-flag-publi-1593580761-124","review-flag-vm-1593580807-312"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>How to process streams of data with Apache Kafka and Spark<\/title>\n<meta name=\"description\" content=\"Tutorial for how to process streams of data with Apache Kafka and Spark, including ingestion, processing, reaction, and examples.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to process streams of data with Apache Kafka and Spark\" \/>\n<meta property=\"og:description\" content=\"Tutorial for how to process streams of data with Apache Kafka and Spark, including ingestion, processing, reaction, and examples.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2018-07-09T17:00:25+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-27T21:11:04+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO24-Azure-Retail-025.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1170\" \/>\n\t<meta property=\"og:image:height\" content=\"640\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Lena Hall\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Lena Hall\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"20 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/\"},\"author\":[{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/lena-hall\\\/\",\"@type\":\"Person\",\"@name\":\"Lena Hall\"}],\"headline\":\"How to process streams of data with Apache Kafka and Spark\",\"datePublished\":\"2018-07-09T17:00:25+00:00\",\"dateModified\":\"2025-01-27T21:11:04+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/\"},\"wordCount\":3619,\"commentCount\":3,\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/CLO24-Azure-Retail-025.webp\",\"keywords\":[\"Apache\",\"Kubernetes\",\"Microsoft\",\"Microsoft Azure\",\"Spark\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/\",\"name\":\"How to process streams of data with Apache Kafka and Spark\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/CLO24-Azure-Retail-025.webp\",\"datePublished\":\"2018-07-09T17:00:25+00:00\",\"dateModified\":\"2025-01-27T21:11:04+00:00\",\"description\":\"Tutorial for how to process streams of data with Apache Kafka and Spark, including ingestion, processing, reaction, and examples.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#primaryimage\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/CLO24-Azure-Retail-025.webp\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/CLO24-Azure-Retail-025.webp\",\"width\":1170,\"height\":640},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2018\\\/07\\\/09\\\/how-to-data-processing-apache-kafka-spark\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to process streams of data with Apache Kafka and Spark\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/OpenAtMicrosoft\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/person\\\/4d7e7cd8266dc319e43a6de1e173495f\",\"name\":\"Teri Dormer\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g98331fbdc1fedab03f83292cd9dfa932\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g\",\"caption\":\"Teri Dormer\"},\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/teridormer\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to process streams of data with Apache Kafka and Spark","description":"Tutorial for how to process streams of data with Apache Kafka and Spark, including ingestion, processing, reaction, and examples.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/","og_locale":"en_US","og_type":"article","og_title":"How to process streams of data with Apache Kafka and Spark","og_description":"Tutorial for how to process streams of data with Apache Kafka and Spark, including ingestion, processing, reaction, and examples.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2018-07-09T17:00:25+00:00","article_modified_time":"2025-01-27T21:11:04+00:00","og_image":[{"width":1170,"height":640,"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO24-Azure-Retail-025.png","type":"image\/png"}],"author":"Lena Hall","twitter_card":"summary_large_image","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Lena Hall","Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/lena-hall\/","@type":"Person","@name":"Lena Hall"}],"headline":"How to process streams of data with Apache Kafka and Spark","datePublished":"2018-07-09T17:00:25+00:00","dateModified":"2025-01-27T21:11:04+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/"},"wordCount":3619,"commentCount":3,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO24-Azure-Retail-025.webp","keywords":["Apache","Kubernetes","Microsoft","Microsoft Azure","Spark"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/","url":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/","name":"How to process streams of data with Apache Kafka and Spark","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO24-Azure-Retail-025.webp","datePublished":"2018-07-09T17:00:25+00:00","dateModified":"2025-01-27T21:11:04+00:00","description":"Tutorial for how to process streams of data with Apache Kafka and Spark, including ingestion, processing, reaction, and examples.","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO24-Azure-Retail-025.webp","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO24-Azure-Retail-025.webp","width":1170,"height":640},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2018\/07\/09\/how-to-data-processing-apache-kafka-spark\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"How to process streams of data with Apache Kafka and Spark"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]},{"@type":"Person","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/person\/4d7e7cd8266dc319e43a6de1e173495f","name":"Teri Dormer","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g98331fbdc1fedab03f83292cd9dfa932","url":"https:\/\/secure.gravatar.com\/avatar\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g","caption":"Teri Dormer"},"url":"https:\/\/opensource.microsoft.com\/blog\/author\/teridormer\/"}]}},"bloginabox_animated_featured_image":null,"bloginabox_display_generated_audio":false,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/73685","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/5562"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=73685"}],"version-history":[{"count":4,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/73685\/revisions"}],"predecessor-version":[{"id":97047,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/73685\/revisions\/97047"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/95475"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=73685"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/tags?post=73685"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=73685"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=73685"},{"taxonomy":"job-role","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/job-role?post=73685"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=73685"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=73685"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}