An overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei’s Noah’s Ark Lab and Télécom ParisTech.

Machine Learning for Spark Streaming with StreamDM

Published: July 08, 2018

The main goal of this tutorial is to introduce attendees to big data stream mining theory and practice. We will use the StreamDM framework to illustrate concepts and also to demonstrate how data stream mining pipelines can be deployed using StreamDM.

Big Data Stream Mining using Spark Streaming

Published: September 10, 2018

The volume of data is rapidly increasing due to the development of the technology of information and communication. This data comes mostly in the form of streams. Learning from this ever-growing amount of data requires flexible learning models that self-adapt over time. In addition, these models must take into account many constraints: (pseudo) real-time processing, high-velocity, and dynamic multi-form change such as concept drift and novelty. The tutorial was combined with a workshop on the same topic.

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM

Published: September 13, 2018

The main difference between batch machine learning implementations in Spark (MLlib and Spark ML) and StreamDM is that the latter focus on algorithms that can be trained and adapted incrementally. This can be a huge advantage in some domains as it enables automatically updating the learning models. StreamDM is currently under development by Huawei Noah’s Ark Lab and Télécom ParisTech.

Streaming Random Forest Learning in Spark and StreamDM

Published: October 03, 2018

We present how to build random forest models from streaming data. This is achieved by training, predicting and adapting the model in real-time with evolving data streams. The implementation is on the open source library StreamDM, built on top of Apache Spark.

Machine learning for streaming data: Practical insights

Published: September 26, 2019

In many domains, data is generated at a fast pace. A clear example is the Internet of Things (IoT) applications, where connected sensors yield large amount of data in short periods. To build predictive models from this data, you need to either settle for traditional offline learning or attempt to learn from the data incrementally. A significant setback with the offline learning approach is that it’s slow to react to changes in the domain, and these changes can have a catastrophic impact on the model predictive performance, since the patterns in which the model was trained on are no longer valid. An online approach where the model is trained incrementally can potentially fix this; however, the untold story is that the existing challenges for offline learning are still present (and are even maximized) when processing the data online. These challenges include, but are not limited to, raw data preprocessing, efficient incremental updates to models, algorithms to detect changes and react to them, and dealing with lots of unlabeled and delayed-labeled data.

Streaming Random Patches for Evolving Data Stream Classification

Published: November 15, 2019

Presented the paper “Streaming Random Patches for Evolving Data Stream Classification.”

Lecture at the IOT Stream Data Mining course

Published: November 20, 2019

Lecture at the IOT Stream Data Mining course (Paris, France) as part of the Data and Knowledge 2nd year Master Program of Université Paris Saclay 2019-2020.

Machine learning for data streams with scikit-multiflow

Published: January 07, 2021

In this tutorial we introduced attendees to data stream mining procedures and examples of big data stream mining applications with examples using the scikit-multiflow. The tutorial was scheduled for IJCAI 2020 but ended up being presented early 2021 due to the pandemic. Website: https://streamlearningtutorial2020.netlify.app/

Machine learning for streaming data: state of the art, challenges, and opportunities

Published: January 07, 2021

Presentation at the first Artificial Intelligence Researchers Association of New Zealand meetup. Conference website My presentation video

Machine Learning for Data Streams Master Class

Published: April 25, 2021

Guest lecture for the Latin America Masterclasses organised by the New Zealand Education (ENZ)

Semi-Supervised Learning for Delayed Partially Labelled Data Streams

Published: April 25, 2021

Presentation at the machine learning seminar series at the University of Waikato

Machine Learning for Streaming Data

Published: May 17, 2021

Invited talk at the Joint Machine Learning Seminar hosted collaboratively by Cardiff University and the University of Waikato

Aprendizagem de Máquina para fluxos de dados contínuos (Machine learning for continuous data streams)

Published: September 09, 2021

Invited talk at the Itau data science meeting.

Introduction to MOA

Published: November 26, 2021

Talk at the International WEKA user conference introducing MOA Program website My presentation video

Learning for Delayed Partially Labelled Data Streams

Published: December 07, 2021

Keynote at the IncrLearn workshop in ICDM 2021. Program: https://incrlearn.sciencesconf.org/resource/page/id/7 Slides: https://incrlearn.sciencesconf.org/data/Gomes_IncrLearn21.pdf

Machine Learning for Streaming Data

Published: September 22, 2022

Guest Lecture entitled “Machine Learning for Streaming Data” for undergrad and graduate students at Michigan Technological University (USA)

Semi-Supervised Learning for Delayed Partially Labelled Data Streams

Published: November 28, 2022

Presentation at the Artificial Intelligence Researchers Association of New Zealand meetup. Conference website My presentation video

Concept Drift Detection and Applications

Published: March 31, 2023

Guest Lecture concerning Concept Drift Detection and Applications for data scientists from Stats NZ

Machine Learning for Streaming Data

Published: April 25, 2023

Guest Lecture concerning Machine Learning for Streaming Data for graduate students at China University of Mining and Technology

SSL and Delayed Labelling

Published: May 18, 2023

Guest Lecture concerning SSL and Delayed Labelling for undergraduate students at the University of Waikato (Hamilton, NZ)

PAKDD’24 tutorial: Machine Learning for Streaming Data

Published: May 08, 2024

Machine learning for data streams (MLDS) has been a significant research area since the late 90s, with increasing adoption in industry over the past few years. Despite commendable efforts in opensource libraries, a gap persists between pioneering research and accessible tools, presenting challenges for practitioners, including experienced data scientists, in implementing and evaluating methods in this complex domain. Our tutorial addresses this gap with a dual focus. We discuss advanced research topics such as partially delayed labeled streams while providing practical demonstrations of their implementation and assessment using Python. By catering to both researchers and practitioners, this tutorial aims to empower users in designing, conducting experiments, and extending existing methodologies.

IJCAI’24 tutorial: Machine Learning for Streaming Data

Published: August 03, 2024

The field of Machine Learning for Data Streams has seen growing interest and adoption in recent years, particularly in industry. Despite this progress, there remains a noticeable gap between cutting-edge research and the practical tools available, making it difficult for even experienced data scientists to apply and evaluate these techniques in real-world scenarios. Our tutorial is designed to address this issue by focusing on two key areas. We explore advanced topics, such as handling streams with partially delayed labels, while providing practical, Python-based demonstrations for implementation and evaluation. This dual approach aims to empower both researchers and practitioners to develop new experiments and extend current methodologies.

KDD’24 tutorial: Practical Machine Learning for Streaming Data

Published: August 26, 2024

Our tutorial aims to bridge this gap with a dual focus. We discuss important research topics, such as partially delayed labeled streams, while providing practical demonstrations of their implementation and assessment using CapyMOA, an open-source library that provides efficient algorithm implementations through a high-level Python API. This tutorial also included exercises and many examples. Link to publication.

teaching

COMPX523: Data Stream Mining

Undergraduate and MSc course, University of Waikato, School of Computing & Mathematical Sciences, 2020

This paper is an introduction to stream data mining. Data streams are everywhere, from F1 racing over electricity networks to news feeds. Data stream mining relies on incremental algorithms that process streams under strict resource limitations. This paper focuses on, as well as extends the methods implemented in MOA (Java) and scikit-multiflow (Python), two open-source stream mining software suites currently being developed by the Machine Learning group at the University of Waikato. More information.

COMP307: Fundamentals of Aritificial intelligence

Undergraduate, Victoria University of Wellington, ECS, 2020

The lectures cover following main topics: search techniques, machine learning including basic learning concepts and algorithms, neural networks and evolutionary learning, reasoning under uncertainty, planning and scheduling, knowledge based systems and AI Philosophy. The course includes a substantial amount of programming. The course will cover both science and engineering applications. More information.

Heitor Murilo Gomes

Posts by Collection

portfolio

publications

talks

teaching