StreamDM: Advanced data science with Spark Streaming


An overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei’s Noah’s Ark Lab and Télécom ParisTech.

StreamDM’s tools and algorithms are specifically designed for data streaming. Due to the large amount of data that is created—and must be processed—in real-time streams, such methods need to be extremely time efficient while using very small amounts of memory. StreamDM is the first library to include advanced stream-mining algorithms for Spark Streaming and is intended to be the open source gathering point for the research and implementation of data streams, while also allowing practical deployments on real-world datasets.

This library contains methods for classification, regression, clustering, and frequent pattern mining. Heitor and Albert explain how these advanced methods work in practice, discuss some big data analytics applications in telecommunication networks, compare them with the methods available in MLlib and Spark ML, and demonstrate their ease of use and extensibility.

More information here