Bachelor theses
Cross Platform Interoperability of AI Algorithms on Computational Cluster
Author
Tomáš Pajurek
Year
2016
Type
Bachelor thesis
Supervisor
Ing. Ondřej Stuchlík
Reviewers
Ing. Tomáš Borovička
Department
Summary
Data scientists and other researchers often need to find the best combination of an algorithm and its parameters. The number of these combinations can be very large and finding the best one is very computation intensive. The aim of this thesis is to design and implement a system that enables to launch multiple instances of machine learning algorithms from different platforms (Python, R, Weka and RapidMinder) in parallel on a computational cluster.
The system is written in Scala and built on top of the Apache Spark framework. Focus is given on developing a robust and high-quality software architecture. Important architectural decisions are based on performance measurements.
The resulting system meets all defined functional and non-functional requirements with small limitations related to specific platforms. The system encapsulates issues related to parallelization and different implementations of algorithms and exposes high-level interface. The researchers can use this interface to solve large numbers of classification, regression and clustering problems or even to execute entire custom machine learning pipelines.