Amazon IoT and Kinesis: Introducing Zeppelin

In my last post, IoT with Amazon Kinesis and Spark Streaming I discussed connecting Spark streaming with Amazon IoT and Kinesis.  Today I would like to show how to add Apache Zeppelin into the mix.

Apache Zeppelin is a web based tool for running notebooks. It allows Data Scientists easy access to running Big Data tools such as Spark and Hive. It also provides an integration point for using javascript visualization tools such as D3 and Plotly via its Angular interpreter. In addition Zeppelin has some built in visualizations that can be leveraged for quick and dirty dashboards.

Notebooks make it possibly to build interactive visualizations without needing to deploy code onto a big data platform. In order to follow along with this demonstration please make sure you have setup your environment and are able to run the demo from my last post as we will again be using the Simple Beer Simulator. (SBS).

The code for this demo is slightly different than the last example.   Instead of outputting the data to the file system we will load the data into a data frame that can be rendered using Spark Sql.

 

As data is received we will use Zeppelin’s run command to update the notebook paragraph containing our visualization.

 

Update the interpreter dependencies.

In order to run our sample application we must add the Amazon client jars to the notebook class path.  This is done via the dependency section in the interpreter settings. Find the spark interperter and at the bottom you will find the Dependencies.

Add : org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0

Run the SBS

As before we will post sample data to Kinesis using the SBS

Start Zeppelin

Today we will be using a integrated docker container that contains both Zeppelin and Spark.

In a real world situation it would be better to use a distributed install of Zeppelin so as to leverage the capacity of a multiple node cluster.  Since we are only dealing with one Kinesis shard we can easily support this use case on a container running on a laptop.

Login to Zeppelin

http://localhost:8080

Import the notebook

If you wish to use my notebook import it into Zeppelin from the GitHub repo or copy the relevant sections into new paragraphs.

Run the notebook paragraphs.

The first time through I recommend running each paragraph step by step.  This way you can troubleshoot any issues before running the notebook top to bottom. The first paragraph you run may take a bit of time as the spark interperter for Zepplin is lazily started the first time it is needed.

Import the Dependencies

Set some variables

Make sure to add a valid AWS key and secret here.

Setup the streaming portion

execute

Visualize

Update

Add a z.run() statement to the streaming portion to force the graph to refresh each time a new event set is processed. You can get the paragraphId for the graph by selecting the gear icon and copying the value to the clipboard.

Conclusion

At the end of the day this illustrates how to use Zeppelin to visualize data as it is being retrieved by Spark streaming. It is a rather contrived example as you most likely would want to use some sort of persistent storage to save the results before they are queried.  With the above setup you can only query the data from the last batch interval. One possible solution is to drop the data from the streaming service into an elastic cache backed by Elastic Search or Redis and then query your chart over the desired time window.

 

References

Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming

Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming

 

You may also like...