doc. Ing. Daniel Langr, Ph.D.

Theses

Dissertation theses

Numerical methods for computer modeling of atomic nuclei

Level
Topic of dissertation thesis
Topic description

Symmetry-adapted no-core shell model (SA-NCSM) is a state-of-the-art method for ab-initio (from first principles) modeling of atomic nuclei. The principle of this model is based on computation of nuclear wave functions in the form of eigenvectors of matrix operators relevant to a given nucleus. In the current implementation of this model, elements of these matrix operators are explicitly stored in a computer memory. This limits its applicability to only relatively light nuclei, even when the largest contemporary existing supercomputers are utilized. The goal of this dissertation thesis topic is to research and develop suitable parallel and scalable numerical algorithms and corresponding data structures that would enable “on-the-fly” computation of matrix operators in SA-NCSM calculations. Specific research of interest is the development and implementation of scalable parallel algorithms for finding extremal eigenvalues and corresponding eigenvectors of large sparse symmetric matrices.

Bachelor theses

Graphical User Interface for Visualization of Very Large Sparse Matrices

Author
Ivo Kolář
Year
2014
Type
Bachelor thesis
Supervisor
doc. Ing. Daniel Langr, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.

Master theses

Survey of Sparse Matrix Storage Formats for Parallel Systems

Author
Radek Bittara
Year
2015
Type
Master thesis
Supervisor
Ing. Daniel Langr, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.
Summary
The goal of this thesis is to provide a overview of published sparse matrix storage formats in computer memory. This thesis contains a description of the principles and potential use in a sparse matrix-vector multiplication for modern parallel systems.

Performance analysis of the LSU3shell program

Author
Martin Kočička
Year
2019
Type
Master thesis
Supervisor
Ing. Daniel Langr, Ph.D.
Reviewers
Ing. Jiří Kašpar
Summary
Ab initio approaches to nuclear structure exploration are at the forefront of current nuclear physics research. LSU3shell is an implementation of the ab initio method called symmetry-adapted no-core shell model (SA-NCSM) optimized for distributed HPC systems. The goal of this thesis was the analysis of the LSU3shell program with focus primarily on dynamic memory allocation, research of methods that can be used to improve performance and memory usage, and their application on LSU3shell. The focus on dynamic memory allocation proved to be the right one, leading to many possible optimizations. I was able to reduce the run time by 41 % on average, thus potentially saving up to 1.4 million core-hours of our total BlueWaters allocation.

Parallel Sorting in C++11

Author
Klára Horáčková
Year
2019
Type
Master thesis
Supervisor
Ing. Daniel Langr, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.
Summary
A new parallel variant of the quicksort algorithm called C++11Sort is presen- ted in this thesis. C++11Sort is implemented based on the C++11 threads and the respective synchronization techniques, i.e. without the use of the non- standard C++ extensions and external libraries. In the search section, existing implementations of the parallel versions of the quicksort algorithm are descri- bed. These implementations are afterwards experimentally compared to the C++11Sort algorithm. C++11Sort achieves excellent results in those expe- riments. When sorting two billions of integers, it is faster than the fastest existing implementation by 28 %.

Efficient concurrent memoization system

Author
Viacheslav Kroilov
Year
2020
Type
Master thesis
Supervisor
Ing. Daniel Langr, Ph.D.
Reviewers
Ing. Jiří Kašpar
Summary
An automatic memoization system - also known as a software cache - stores a limited number of recently accessed elements and speeds up consequent accesses to them. Least-Recently-Used (LRU) is a popular replacement policy for hardware and software caches. However, its concurrent implementation suffers from high contention due to the list reordering performed both on lookup and insertion. A novel LRU-inspired concurrent software cache, called DeferredLRU, is presented in this thesis. The main goal of the design was to make it scalable and suitable for many-core systems. These properties were achieved by using a different approach to tracking item access order. This approach substantially decreases the number of list reinsertions --- the main factor of the contention in a regular LRU cache. DeferredLRU throughput and hit-rate are sensitive to the meta-parameter setting. By fine-tuning meta parameters for specific inputs, it was possible to achieve higher hit-rate than of a regular LRU cache for every tested input. The relative difference was up to 7.8%. DeferredLRU performance was compared to existing alternatives, including corresponding caches from Intel TBB and Facebook HHVM projects. Tested caches were evaluated with up to 32 threads (on a 16 HW cores CPU). In 32 threads evaluation, DeferredLRU was faster in all 16 tests. When accesses were distributed among multiple smaller caches for better parallelism (this approach is called binning), DeferredLRU was faster in 11 of 16 tests and was close to best-performing caches in 5 other tests. DeferredLRU with binning was up to 28.8 times faster on 32 threads compared to single-threaded performance.

Low-Latency Heap Profiling

Author
Jan Tománek
Year
2021
Type
Master thesis
Supervisor
doc. Ing. Daniel Langr, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.
Summary
Frequent heap allocations can have a significant impact on the overall performance of a program. The problem becomes even more critical in multithreaded applications, where too many interactions with heap can grow into the main bottleneck and effectively diminish scalability. Profiling such applications with existing heap profilers is almost impossible due to the enormous runtime overhead they add. This thesis describes the typical memory layout of a Linux process, briefly shows the internals of the GNU C memory allocator, analyzes available heap profiling tools, and guides the reader through the process of constructing a~custom heap profiler. Further, the thesis proposes a concept of low-latency heap profiling. Based on that, a new profiler Heaprof has been developed and compared with other heap profiling tools. Experimental results revealed that Heaprof was the only tool capable of profiling without further compromising the program's scalability. For 8-threaded benchmarks, Heaprof achieved up to 8 times profiling speedup.

Evaluation of thread pool implementations by resolution of webserver requests

Author
Martin Mucha
Year
2024
Type
Master thesis
Supervisor
doc. Ing. Daniel Langr, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.
Summary
Utilizing a pool of worker threads to scale web servers is a widely adopted approach employed by many server implementations. Individual thread pool implementations can significantly vary in scheduling algorithms and task encapsulation. In this thesis, we design and implement our own thread pool utilizing C++20 coroutines and work-stealing scheduling. To ensure an efficient scheduling, we utilize lock-free structures. Our implementation not only achieves significantly lower scheduling overhead compared to OpenMP implementations but also performs comparably to, and in one test outperforms, the oneTBB thread pool implementation. Finally, we conduct testing of various versions of our thread pool in a server environment to assess performance across different server requests types.

Eigensolver problem for large sparse symmetric matrices

Author
Matěj Razák
Year
2024
Type
Master thesis
Supervisor
doc. Ing. Daniel Langr, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.
Summary
This work is devoted to selected methods for finding eigenvalues and eigenvectors of symmetric sparse matrices over the MPI+OpenMP hybrid parallel programming model in the C++ language. The aim of the thesis is to improve the existing implementation of the one-vector and block Lanczos algorithm. Alternative implementations developed are single-vector and block versions of the Lanczos algorithm and the LOBPCG method. This is followed by testing and mutual comparisons of individual implementations. The scale of comparison is time efficiency and the number of iterations of the algorithm. A comparison with an existing implementation showed better efficiency of my solution.