Why Spark Streaming
Many important applications must process large streams of live data and provide results in near-real-time
- Social network trends
- Website statistics
- Intrusion detection systems etc
Requires low latencies for faster processing
What is Spark Streaming
Framework for large scale stream processing
- Scales to 100s of nodes
- Can achieve second scale latencies
- Integrates with Spark’s batch and interactive processing
- Provides a simple batch-like API for implementing complex algorithm
- Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
Framework for Spark Streaming
Drawbacks of Spark Streaming
- Handling state for long time and efficiently is a challenge in these systems
- Lambda architecture forces the duplication of efforts in stream and batch
- As the API is limited, doing any kind of complex operation takes lot of effort
- No clear abstractions for handling stream specific interactions like late events, event time, state recovery etc.
Structured Spark Streaming
- In structured streaming, a stream is modelled as an infinite table aka infinite Dataset
- As we are using structured abstraction, it’s called structured streaming API
- All input sources, stream transformations and output sinks modelled as Dataset
- As Dataset is underlying abstraction, stream transformations are represented using SQL and Dataset DSL
- Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
- Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
- Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage
Handles event time and late data
Fault tolerant semantics using source and sink
Window operations on Event time
In structured streaming, data is processed window by window instead of batch by batch
This handles out of order data with the help of timestamps attached with original data
Handling late data using watermarking
Watermark tracks a point in time before which it is assumed no more late events are supposed to arrive
It handles late data
Spark supports the following different types of joins
- Static - Static : Inner, left outer, right outer and full outer. All are supported.
- Stream joins with static data : Only inner joins are supported
- Stream-Stream joins : Full outer join is not supported
We will do a deeper dive into stream stream joins in the following slides
Stream Stream - Inner Joins
Simple inner join - Need to buffer all past events as state, a match can come on the other stream any time in the future . This can be time consuming.
Inner join with watermark and time constraints – Time constraints need to be provided so that buffered events can be dropped. We take the example of ad monetisation.
We need 3 watermarks in total. One for each stream, and third for associating each click with impression. As, its not necessary that a user will click on the ad as soon as it appears.
Time constraints and watermarks are necessary for left, outer and right joins
In case of right outer join, its necessary for us to give a watermark for the left stream, vice versa for left outer join.
Overview of Stream-Stream Joins
- Inner – Optional to specify watermark on both sides + time constraints for state cleanup
- Left Outer – Watermark must be specified on the right + time constraints for correct results, optionally specify watermark on left for all state cleanup
- Right Outer – Watermark must be specified on the left + time constraints for correct results, optionally specify watermark on right for all state cleanup
- Full Outer - Not supported
About the author
Sonali Patro completed her BTech in Computer Science from NIT Rourkela in 2018. She has been working as a Data Engineer in Whiteklay since july 2018.