# Open projects and theses

On this webpage, we collect concepts for BSc/MSc projects or theses, but this list is typically a bit incomplete, since we always have new ideas as time goes by and we don't always update the website right away.
If you are interested in any of the projects listed here or other projects related to automated machine learning and deep neural networks, please get in touch with us **following the instructions on our webpage** about Hiwi positions. If you have more questions about a specific project you can also contact the person(s) in charge of the project.

All projects listed here are flexible in the sense that they can, in principle, be tackled in a BSc or MSc project or thesis (or even a a project followed by a thesis). In neither of these projects, there is a risk of running out of interesting questions - the only challenge is to identify an interesting part of the problem with the right scale for BSc/MSc project/thesis.

In general, our projects can be grouped into a few categories on which
our research focuses. Those are **Algorithm Configuration and Selection**,
**Hyperparameter Optimization**, **Automated Machine Learning**,
and **Deep Learning** (with applications to the BrainLinks-BrainTools initiative).
We have a compute cluster available for the empirical studies required in our
research and some of the projects below.

Generally speaking, **Algorithm Configuration** describes the problem of
finding the best configuration of an algorithm for a given problem class,
usually defined by a set of representative instances. This problem arises
whenever an algorithm is repeatedly applied to similar instances. Examples
include scheduling, timetabling, resource allocation, and optimization, each
with many real-world applications. The related task of **Algorithm Selection**
consists in selecting an algorithm from a list of candidates in order to
achieve the best performance on a specific problem instance.

A very important special case of algorithm configuration is **Hyperparameter
Optimization** (HPO). In the context of machine learning, HPO refers to the
problem of choosing the hyperparameters of a learning algorithm. But instead
of trying to optimize the algorithm's performance on the training data, the
performance is measured on an independent data set.

Machine learning has achieved considerable successes in recent years and an
ever-growing number of disciplines rely on it. However, this success crucially
relies on human machine learning experts to perform tasks like data preprocessing,
feature selection, algorithm selection, hyperparameter optimization and analysis
of results. The field of **Automatic Machine Learning** (AutoML) aims to automate
some and eventually all of these steps.

## Bayesian Optimization Algorithms

Bayesian optimization has been the state-of-the art for optimizing expensive black-box function, e.g. complex machine learning algorithms. With a growing interest, also in the light of big data, many different algorithms have been proposed in recent years. Unfortunately, a comprehensive study comparing them on several benchmarks does not exists, as the source code often is not open-source, and there is no established set of benchmarks in the community.

In this (potentially team-) project, the student(s) should implement (and improve upon) a Bayesian optimization algorithm from the literature into our Python libraryRoBO. Together with extensive experiments on various benchmarks, this project will help to better compare the available methods, and gain insights into their strengths and weaknesses.

** Requirements **

- Basics in Bayesian optimization
- Python programming skills

** Literature **
A Review of Bayesian Optimization

**Contact:** Stefan Falkner, Aaron Klein

## Warm-Starting of Algorithm Configuration

Often, users optimize the (hyper-)parameters of their algorithm not only on one instance set (or data set, resp.), but on different ones. Starting from scratch every time, wastes time and does not reuse information from previous evaluations. The goal of this project is to look into warm-starting of algorithm configuration / Bayesian optimization (in particular SMAC) such that we can reuse data collected in earlier optimization runs on new instance sets or datasets, resp.

** Requirements **

- Basics in Bayesian optimization
- Strong Python programming skills

** Literature **

- Warm Starting Bayesian Optimization
- Initializing Bayesian Hyperparameter Optimization via Meta-Learning

**Contact:** Marius Lindauer

## Multi-Armed Bandits for Algorithm Configuration

In many algorithm configuration scenarios, the number of instances prohibits evaluating many configuration on all instances in any reasonable amount of time. Therefore, allocating available resources to promising candidates while cutting poorly performing ones early is crucial. Two prominent algorithm configurations, SMAC and I-RACE, tackle this problem by comparing a handful of configurations on a growing subset of the instances. By using some heuristics, seemingly bad configurations are identified and no longer considered.

This procedure shares many properties with the so-called Multi-Armed Bandit problem. More specifically, it is closely related to the pure exploration setting. This projects investigates the potential of various bandit algorithms to be used in algorithm configuration by implementing several algorithms from the literature into our bandit library and evaluating them on algorithm configuration scenarios.

** Requirements **

- Basics in Statistics and Machine Learning
- C++ and Python experience

** Literature **
See links in the text.

**Contact:** Stefan Falkner

## Random Forests and co.

Random Forests still remain one of the best off the shelf machine learning algorithm that perform robustly, scale well to large data sets, and can handle integer or categorical parameters. This is why we still use them in our research. In recent years, more and more variants have been introduces tackling different problems. For example, the so called Mondrain Forests can be incrementally trained, i.e. the model does not have to be retrained from scratch if a data point is added to the data. This makes them a good candidate in sequential optimization, as we do in SMAC.

The project consists of an implementation of this algorithm into our Random Forest library together with an extensive evaluations in scenarios relevant for our applications. Other potential variants for this project include Bayesian Forests, Canonical Correlation Forests, and Random Survival Forests.

** Requirements **
* Basics in Statistics
* good C++ knowledge (or will to develop it)

** Literature **
List of tree-based methods and recent applications

**Contact:** Stefan Falkner

## Approximate Ranking Trees

Meta-learning is the field of research that attempts to learn what kind of Machine Learning algorithms work well on what data. A common problem is the Algorithm Selection Problem: given a dataset, which algorithm and hyperparameter settings should be used to model the dataset? The current state of the art is an algorithm that builds so-called Approximate Ranking Trees. It uses the Spearman Correlation Coefficient to split the individual trees. Although this algorithm performs very well in practise, the effectiveness of the Spearman Correlation Coefficient is open for debate. In this project, the task is to design an alternative algorithm for constructing Approximate Ranking Trees, and improve upon the current state of the art.

** Requirements **

- Basics in Statistics and Machine Learning
- Strong programming skills (any language)

** Literature ** Pairwise meta-rules for better meta-learning-based algorithm ranking

**Contact:** Jan van Rijn

## Massively Collaborative Meta-learning

Meta-learning is the field of research that attempts to learn what kind of Machine Learning algorithms work well on what data. A common problem is the Algorithm Selection Problem: given a dataset, which algorithm and hyperparameter settings should be used to model the dataset? Currently, all meta-learning research is performed on a well constructed set of algorithms and datasets. However, this paradigm is suboptimal, as it makes strong assumptions about the prior knowledge of the users. In this project, we will use OpenML (an online experiment database) to do meta-learning on all datasets that it contains. This involves novel challenges, as the (meta-)data is very sparse and contains datasets from various domains.

** Requirements **

- Basics in Machine Learning
- Strong python programming skills

** Literature **
Toward a Justification of Meta-learning: Is the No Free Lunch Theorem a Show-stopper?
OpenML: networked science in machine learning

**Contact:** Jan van Rijn

## Mythbusting Data Mining Urban Legends

Data mining researchers and practitioners often refer to many rules of thumb and urban legends such as: 'data preparation is more important than the modeling algorithm used' or 'non linear models typically perform better', however large scale experimental evidence is sometimes lacking. We will leverage OpenML, a repository with over 25000 data sets and close to 1.5 million experimental results to try and bust or prove some of these data mining urban myths.

** Requirements **

- Basics in Machine Learning
- Decent programming skills (any language)

** Literature **
A few useful things to know about machine learning
OpenML: networked science in machine learning

**Contact:** Jan van Rijn

## Algorithm Configuration in the cloud

** Introduction **
A recent trend is to migrate everything into a cloud (including services and computations). Since algorithm configuration is also a computational intensive task, algorithm configuration in the cloud would be a perfect fit for users who do not have a huge compute cluster. We already started to look into this topic and found that in particular algorithm configuration on running time (as a performance metric) can lead to different results on different hardware. The goal of this project is come up with protocol to run algorithm configuration in the cloud and on different hardwares.

** Requirements **

- Python programming skills
- [beneficial] Basics in cloud computing

**Contact:** Marius Lindauer

## BrainLinks-BrainTools: Practical Tuning of a Brain Signal Processing Pipeline

Decoding EEG data is a challenging task and has recently been approached with deep learning techniques. As a project within BrainLinks-BrainTools we want to optimize such a pipeline and improve upon the baseline using our tools. In this project you will use an EEG decoding pipeline using deep learning on motor imagery datasets. This project is a **group project** and can therefore only be taken by a group of students.

See also here.

** Requirements for subtasks **

- Knowledge of Python
- Knowledge of Deep Learning

**Contact:** Katharina Eggensperger, Ilya Loshchilov

### Structure Search

It is well known that the performance of deep learning is very sensitive to its hyperparameters, such as architecture and learning parameters. In this project you will explore the configuration space of different architectures using novel techniques in hyperparameter optimization.

** Literature **

- Hyperband
- Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves

### Improve Training Procedure

Besides having a good architecture also the training procedure of the artificial neural network is crucial to achieve peak performance. We want our networks to converge faster, but also to achieve higher accuracy. In this project you will apply novel techniques achieving state-of-the-art performance in computer vision to the brain signal processing pipeline.

** Literature **

- Online Batch Selection for Faster Training of Neural Networks
- CMA-ES for Hyperparameter Optimization of Deep Neural Networks

### Transfer Learning

It is not possible to use the same decoder network for different patients, but nevertheless information can be transferred between patients. One way to tackle this multi-task machine learning problem is to train artificial neural network across all patients.

** Literature **
Progressive Neural Networks

### Data Augmentation

EEG data to train a model is usually scarce as one recording contains less than 1000 samples. One way to overcome this limitation is to do data augmentation by applying transformations to the training data and make the resulting model invariant to these transformations. Well performing transformations, such as rotation and rescaling are known to improve generalization performance for image classification, but for EEG data it is not clear which transformations will work. In this project you will first work on data augmentation as a hyperparameter optimization problem for computer vision datasets and then move on to EEG data.