Parallel CRC optimisations on the x64 architecture: a per-partes method
Authors
Year
2024
Published
International Journal of Parallel, Emergent and Distributed Systems. 2024, 39(3), 292-316. ISSN 1744-5760.
Type
Article
Departments
Annotation
Each generation of CPU provides more resources and new features. These increase the ability to perform algorithms faster and with a higher degree of parallelism. The article discusses methods used to optimise CRC generation algorithms for long data blocks with consideration of the capabilities of contemporary systems. We analysed known software CRC algorithms and combined all known principles into a solution scalable in multiple CPU cores on single and multi-socket systems. Various algorithms were evaluated on contemporary multicore systems with 1 × 4, 1 × 64, 2 × 12, and 4 × 26 cores. The results show how the performance is affected by the architecture of the memory subsystem. Compared to the original sequential Sarwate algorithm, our algorithms are 48.0, 51.1, 38.0, and 28.8 times faster.
A parallel algorithm for approximating the silhouette using a ball tree
Authors
Šimeček, I.; Kozický, C.
Year
2023
Published
Journal of Parallel and Distributed Computing. 2023, 173 115-125. ISSN 0743-7315.
Type
Article
Departments
Annotation
Clustering is widely used in many scientific fields. The contribution of enumerating the value of the silhouette is twofold: firstly it can help choosing a suitable cluster count and secondly it can be used to evaluate the quality of a clustering. Enumerating the silhouette exactly is an extremely time-consuming task, especially in big data applications; it is therefore common to approximate its value. This article presents an efficient shared-memory parallel algorithm for approximating the silhouette, which uses a ball tree. The process of initialising the ball tree and enumerating the silhouette are fully parallelised using the OpenMP API. The results of our experiments show that the proposed parallel algorithm substantially increases the speed of computing the silhouette whilst retaining necessary precision for real-world applications.
Design of a modern fast Fourier transform and cache effective bit-reversal algorithm
Authors
Šimeček, I.; Simek, A.
Year
2023
Published
International Journal of Parallel, Emergent and Distributed Systems. 2023, 38(3), 229-248. ISSN 1744-5760.
Type
Article
Departments
Annotation
This article deals with efficient vectorization of the fast Fourier transform algorithm while focusing on Cooley-Tukey versions with power-of-two radixes.
Aside from examples of optimizations for 256 and 512-bit vectors, this work also discusses relations between individual radix-based versions, vectorization and OpenMP threading.
Ideas are progressing into a timeless design of the FFT algorithm, which can work with any vector size and radix version through conversion into radix-2 output permutation.
Furthermore, the implementation of the Cache Optimized Bit-Reversal algorithm, which doubles the performance of its predecessor, is introduced.
Overview of ParaCell package for indexing in powder diffraction
Authors
Šimeček, I.; Rohlíček, J.; Zaloga, A.
Year
2023
Published
Journal of Applied Crystallography. 2023, 56(1), 293-301. ISSN 1600-5767.
Type
Article
Departments
Annotation
This article describes the crystallography package ParaCell that integrates several indexing methods. All methods share the basic infrastructure of the program that uses graphical processing units (GPU) to evaluate the correctness of the unit cell. The individually implemented indexing methods in the program are presented along with a comparison with other indexing software. The success and time requirements were tested on several datasets, including orthorhombic, monoclinic and triclinic examples.
Joint direct and transposed sparse matrix-vector multiplication for multithreaded CPUs
Authors
Kozický, C.; Šimeček, I.
Year
2021
Published
Concurrency and Computation: Practice and Experience. 2021, 33(13), 1-26. ISSN 1532-0634.
Type
Article
DOI
Departments
Annotation
Repeatedly performing sparse matrix‐vector multiplication (SpMV) followed by transposed sparse matrix‐vector multiplication (SpMᵀV) with the same matrix is a part of several algorithms, for example, the Lanczos biorthogonalization algorithm and the biconjugate gradient method. Such algorithms can benefit from combining parallel SpMV and SpMᵀV into a single operation we call ‘joint direct and transposed sparse matrix‐vector multiplication’ (SpMMᵀV). In this article, we present a parallel SpMMᵀV algorithm for shared‐memory CPUs. The algorithm uses a sparse matrix format that divides the stored matrix into sparse matrix blocks and compresses the row and column indices of the matrix. This sparse matrix format can be also used for SpMV, SpMᵀV, and similar sparse matrix‐vector operations. We expand upon existing research by suggesting new variants of the parallel SpMMᵀV algorithm and by extending the algorithm to efficiently support symmetric matrices. We compare the performance of the presented parallel SpMMᵀV algorithm with alternative approaches, which use state‐of‐the‐art sparse matrix formats and libraries, using sparse matrices from real‐world applications. The performance results indicate that the median performance of our proposed parallel SpMMᵀV algorithm is up to 45% higher than of the alternative approaches.
A new parallel version of a dichotomy based algorithm for indexing powder diffraction data
Authors
Šimeček, I.; Trdlička, J.; Zaloga, A.
Year
2020
Published
Zeitschrift für Kristallographie - Crystalline Materials. 2020, 235(6-7), 203-212. ISSN 2194-4946.
Type
Article
Departments
Annotation
One of the key parts of the crystal structure solution process from powder diffraction data is the determination of the lattice parameters from experimental data shortly called indexing. The successive dichotomy method is the one of the most common ones for this process because it allows an exhaustive search. In this paper, we discuss several improvements for this indexing method that significantly reduce the search space and decrease the solution time. We also propose a combination of this method with other indexing methods: grid search and TREOR. The effectiveness and time-consumption of such algorithm were tested on several datasets, including orthorhombic, monoclinic, and triclinic examples. Finally, we discuss the impacts of the proposed improvements.
Space-Efficient k-d Tree-Based Storage Format for Sparse Tensors
Authors
Year
2020
Published
Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. New York: Association for Computing Machinery, 2020. p. 29-33. HPDC ’20. ISBN 978-1-4503-7052-3.
Type
Proceedings paper
Departments
Annotation
Computations with tensors are widespread in many scientific areas. Usually, the used tensors are very large but sparse, i.e., the vast majority of their elements are zero. The space complexity of sparse tensor storage formats varies significantly. For overall efficiency, it is important to reduce the execution time and additional space requirements of the initial preprocessing (i.e., converting tensors from common storage formats to the given internal format).
The main contributions of this paper are threefold. Firstly, we present a new storage format for sparse tensors, which we call the succinct k-d tree-based tensor (SKTB) format. We compare the space complexity of common tensor storage formats and of the SKTB format and demonstrate the viability of using a tree as a data structurefor sparse tensors. Secondly, we present a parallel space-efficient algorithm for converting tensors to the SKTB format. Thirdly, we demonstrate the computational efficiency of the proposed format in sparse tensor-vector multiplication.
Analysis of Memory Footprints of Sparse Matrices Partitioned Into Uniformly-Sized Blocks
Authors
Year
2018
Published
Scalable Computing: Practice and Experience. 2018, 19(3), 275-291. ISSN 1895-1767.
Type
Article
Departments
Annotation
The presented study analyses memory footprints of 563 representative benchmark sparse matrices with respect to their partitioning into uniformly-sized blocks. Different block sizes and different ways of storing blocks in memory are considered and statistically evaluated. Memory footprints of partitioned matrices are then compared with their lower bounds and CSR, index-compressed CSR, and EBF storage formats. The results show that block-based storage formats may significantly reduce memory footprints of sparse matrices arising from a wide range of application domains. Additionally, measured consistency of results is presented and discussed, benefits of individual formats for storing blocks are evaluated, and an analysis of best-case and worst-case matrices is provided for in-depth understanding of causes of memory savings of block-based formats.
Improved Corners with Multi-Channel Signed Distance Fields
Authors
Chlumský, V.; Sloup, J.; Šimeček, I.
Year
2018
Published
COMPUTER GRAPHICS FORUM. 2018, 37(1), 273-287. ISSN 0167-7055.
Type
Article
Annotation
We propose an extension to the state-of-the-art text rendering technique based on sampling a 2D signed distance field from a texture. This extension significantly improves the visual quality of sharp corners, which is the most problematic feature to reproduce for the original technique. We achieve this by using a combination of multiple distance fields in conjunction, which together provide a more thorough representation of the given glyph's (or any other 2D shape's) geometry. This multi-channel distance field representation is described along with its application in shader-based rendering. The rendering process itself remains very simple and efficient, and is fully compatible with previous monochrome distance fields. The introduced method of multi-channel distance field construction requires a vector representation of the input shape. A comparative measurement of rendering quality shows that the error in the output image can be reduced by up to several orders of magnitude.
Multilayer Approach for Joint Direct and Transposed Sparse Matrix Vector Multiplication for Multithreaded CPUs
Authors
Šimeček, I.; Langr, D.; Kotenkov, I.
Year
2018
Published
Parallel Processing and Applied Mathematics Part I.. Cham: Springer International Publishing AG, 2018. p. 47-56. Lecture Notes in Computer Science. vol. 10777. ISSN 0302-9743. ISBN 978-3-319-78023-8.
Type
Proceedings paper
Departments
Annotation
One of the most common operations executed on modern high-perfor\-mance computing systems is multiplication of a sparse matrix by a dense vector within a shared-memory computational node. Strongly related but far less studied problem is joint direct and transposed sparse matrix-vector multiplication, which is widely needed by certain types of iterative solvers. We propose a multilayer approach for joint sparse multiplication that balances the workload of threads. Measurements prove that our algorithm is scalable and achieve high computational performance for multiple benchmark matrices that arise from various scientific and engineering disciplines.
Effective construction of convex hull algorithms
Authors
Mitura, P.; Šimeček, I.; Kotenkov, I.
Year
2017
Published
Proceedings - 2017 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2017. Los Alamitos: Conference publishing Services, 2017. p. 105-112. ISSN 2470-8801. ISBN 978-1-5386-2626-9.
Type
Proceedings paper
Departments
Annotation
Finding the convex hull of a set of points in a plane is one of the most common problems in computational geometry. We survey known algorithms for solving
this problem and look into methods of their effective and parallel implementation. A simple generator of random input datasets is created, with an option to control the number of points on resulting hull. We implement all surveyed algorithms
along with their optimizations and compare them using our
generator. Our measurements show, that Quickhull algorithm using optimizations proposed by Hoang and Linh [1] performs best among the implemented methods and is faster than the current state of the art libraries.
On Memory Footprints of Partitioned Sparse Matrices
Authors
Year
2017
Published
Proceedings of the 2017 Federated Conference on Computer Science and Information Systems. Katowice: Polish Information Processing Society, 2017. p. 513-521. Annals of Computer Science and Information Systems. vol. 11. ISSN 2300-5963. ISBN 978-83-946253-7-5.
Type
Proceedings paper
DOI
Departments
Annotation
The presented study analyses 563 representative benchmark sparse matrices with respect to their partitioning into uniformly-sized blocks. The aim is to minimize memory footprints of matrices. Different block sizes and different ways of storing blocks in memory are considered and statistically evaluated. Memory footprints of partitioned matrices are additionally compared with lower bounds and the CSR storage format. The average measured memory savings against CSR in case of single and double precision are 42.3 and 28.7 percents, respectively. The corresponding worst-case savings are 25.5 and 17.1 percents. Moreover, memory footprints of partitioned matrices were in average 5 times closer to their lower bounds than CSR. Based on the obtained results, we provide generic suggestions for efficient partitioning and storage of sparse matrices in a computer memory.
AQsort: Scalable Multi-Array In-Place Sorting with OpenMP
Authors
Year
2016
Published
Scalable Computing: Practice and Experience. 2016, 17(4), 369-391. ISSN 1895-1767.
Type
Article
Departments
Annotation
A new multi-threaded variant of the quicksort algorithm called AQsort and its C++/OpenMP implementation are presented. AQsort operates in place and was primarily designed for high-performance computing (HPC) runtime environments. It can work with multiple arrays at once; such a functionality is frequently required in HPC and cannot be accomplished with standard C pointer-based or C++ iterator-based approach. An extensive study is provided that evaluates AQsort experimentally and compares its performance with modern multi-threaded implementations of in-place and out-of-place sorting algorithms based on OpenMP, Cilk Plus, and Intel TBB. The measurements were conducted on several leading-edge HPC architectures, namely Cray XE6 nodes with AMD Bulldozer CPUs, Cray XC40 nodes with Intel Hasswell CPUs, IBM BlueGene/Q nodes, and Intel Xeon Phi coprocessors. The obtained results show that AQsort provides good scalability and sorting performance generally comparable to its competitors. In particular cases, the performance of AQsort may be slightly lower, which is the price for its universality and ability to work with substantially larger amounts of data.
Block Iterators for Sparse Matrices
Authors
Langr, D.; Šimeček, I.; Tomáš Dytrych, T.D.
Year
2016
Published
Proceedings of the 2016 Federated Conference on Computer Science and Information Systems. New York: Institute of Electrical and Electronics Engineers, 2016. pp. 695-704. ISBN 978-83-60810-90-3.
Type
Proceedings paper
DOI
Departments
Annotation
Finding an optimal block size for a given sparse matrix forms an important problem for storage formats that partition matrices into uniformly-sized blocks. Finding a solution to this problem can take a significant amount of time, which, effectively, may negate the benefits that such a format brings into sparse-matrix computations. A key for an efficient solution is the ability to quickly iterate, for a particular block size, over matrix nonzero blocks. This work proposes an efficient parallel algorithm for this task and evaluate it experimentally on modern multi-core and many-core high performance computing (HPC) architectures.
Efficient parallel evaluation of block properties of sparse matrices
Authors
Year
2016
Published
Proceedings of the 2016 Federated Conference on Computer Science and Information Systems. New York: Institute of Electrical and Electronics Engineers, 2016. pp. 709-716. ISBN 978-83-60810-90-3.
Type
Proceedings paper
Departments
Annotation
Many storage formats for sparse matrices have been developed. Majority of these formats can be parametrized, so the algorithm for finding optimal parameters is crucial. For overall efficiency, it is important to reduce the execution time of this preprocessing. In this paper, we propose a new algorithm for the determination of the number of nonzero blocks of the given size in a sparse matrix. The proposed algorithm requires relatively a small amount of auxiliary memory. Our approach is based on the Morton reordering and bitwise manipulations. We also present a parallel (multithreaded) version and evaluate its performance and space complexity.
GPU solver for systems of linear equations with infinite precision
Authors
Khun, J.; Šimeček, I.; Lórencz, R.
Year
2016
Published
17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Los Alamitos: IEEE Computer Society, 2016. p. 121-124. ISBN 978-1-5090-0461-4.
Type
Proceedings paper
Departments
Annotation
In this paper, we would like to introduce a GPU accelerated solver for systems of linear equations with an infinite precision. The infinite precision means that the system can provide a precise solution without any rounding error. These errors usually come from limited precision of floating point values within their natural computer representation. In a simplified description, the system is using modular arithmetic for transforming an original SLE into dozens of integer SLEs that are solved in parallel via GPU. In the final step, partial results are used for a calculation of the final solution. The usage of GPU plays a key role in terms of performance because the whole process is computationally very intensive. The GPU solver can provide about one magnitude higher performance than a multithreaded one.
Parallel solver of large systems of linear inequalities using Fourier--Motzkin elimination
Authors
Year
2016
Published
Computing and Informatics. 2016, 35(6), 1307-1337. ISSN 1335-9150.
Type
Article
Departments
Annotation
Fourier-Motzkin elimination is a computationally expensive but powerful method to solve a system of linear inequalities. These systems arise e.g. in execution order analysis for loop nests or in integer linear programming. This paper focuses on the analysis, design and implementation of a~parallel solver for distributed memory for large systems of linear inequalities using the Fourier--Motzkin elimination algorithm. We also measure the speedup of parallel solver and prove that this implementation results in good scalability.
Space and execution efficient formats for modern processor architectures
Authors
Year
2016
Published
17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Los Alamitos: IEEE Computer Society, 2016. pp. 98-105. ISBN 978-1-5090-0461-4.
Type
Proceedings paper
Departments
Annotation
Sparse matrix-vector multiplication (shortly SpMV) and transposed SpMV (shortly SpMTV) are the most common routines in the numerical algebra. Sparse storage formats describe a way how sparse matrices are stored in a computer memory. Since the commonly used storage formats (like COO or CSR) are not sufficient for high-performance computations, extensive research has been conducted about maximal computational efficiency of these routines. For modern CPU architectures, the main bottleneck of these routines is the limited memory bandwidth. In this paper, we introduce a new approach for these routines for modern processor architectures using a space efficient hierarchical format, which can significantly reduce the amount of transferred data from memory for almost all types of matrices arising from various application disciplines. This format represents a trade-off between space and execution efficiency. The performance of these routines with this format seems to be very close to the hardware limits.
A new parallel and GPU version of a TREOR-based algorithm for indexing powder diffraction data
Authors
Šimeček, I.; Zahradnický, T.; Langr, D.; Rohlíček, J.
Year
2015
Published
Journal of Applied Crystallography. 2015, 48(1), 166-170. ISSN 0021-8898.
Type
Article
Departments
Annotation
One of the key parts of the crystal structure solution process from powder diffraction data is indexing - the determination of the lattice parameters from experimental data. This paper presents a modification of the TREOR indexing method that makes the algorithm suitable and efficient for execution on graphics processing units. The TREOR algorithm was implemented in its pure form, which can be simply described as a `brute-force' approach. The effectiveness and time consumption of such an algorithm was tested on several data sets including monoclinic and triclinic examples. The results show the potential of using GPUs for indexing powder diffraction data.
Acceleration of Le Bail fitting method on parallel platforms
Authors
Mařík, O.; Šimeček, I.
Year
2015
Published
Proceedings of Seminar Programs and Algorithms of Numerical Mathematics 17. Praha: Matematický ústav AV ČR, 2015. pp. 136-141. ISBN 978-80-85823-64-6.
Type
Proceedings paper
Departments
Annotation
Le Bail fitting method is procedure used in the applied crystallography mainly during the crystal structure determination. As in many other applications, there is a need for a great performance and short execution time. The reason for including this method here is that it uses several mathematical operations which can be accelerated using parallel computing. Therefore short description of this method is included along with the concepts it uses.
In this paper, we describe utilization of parallel computing for mathematical operations used in Le Bail fitting.
We present an algorithm implementing this method with highlighted possible approaches to its aforementioned parallelization. Then, we propose a sample parallel version using the OpenMP API and its performance results on the real multithreaded system. Further potential for the massive parallelization is also discussed.
Efficient Converting of Large Sparse Matrices to Quadtree Format
Authors
Year
2015
Published
16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Los Alamitos: IEEE Computer Society, 2015. pp. 122-129. ISBN 978-1-4799-8448-0.
Type
Proceedings paper
Departments
Annotation
Computations with sparse matrices are widespread in scientific projects. Used data format affects strongly the performance and also the space-efficiency. Commonly used storage formats (such as COO or CSR) are not suitable neither for some numerical algebra operations (e.g., The sparse matrix-vector multiplication) due to the required indirect addressing nor for I/O file operations with sparse matrices due to their high space complexities. In our previous papers, we prove that the idea of using the quad tree for these purposes is viable. In this paper, we present a completely new algorithm based on bottom-up approach for the converting matrices from common storage formats to the quad tree format. We derive the asymptotic complexity of our new algorithm, design the parallel variant of the classical and the new algorithm, and discuss their performance.
Parallelization of Artificial Immune Systems Using a massive parallel approach via modern GPUs
Authors
Khun, J.; Šimeček, I.
Year
2015
Published
Proceedings of Seminar Programs and Algorithms of Numerical Mathematics 17. Praha: Matematický ústav AV ČR, 2015. pp. 106-111. ISBN 978-80-85823-64-6.
Type
Proceedings paper
Departments
Annotation
Parallelization is one of possible approaches for obtaining better results in terms of algorithm performance and overcome the limits of the sequential computation. In this paper, we present a study of parallelization of the opt-aiNet algorithm which comes from Artificial Immune Systems, one part of large family of population based algorithms inspired by nature. The opt-aiNet algorithm is based on an immune network theory which incorporates knowledge about mammalian immune systems in order to create a~state-of-the-art algorithm suitable for the multimodal function optimization. The algorithm is known for a combination of local and global search with an emphasis on maintaining a stable set of distinct local extrema solutions. Moreover, its modifications can be used for many other purposes like data clustering or combinatorial optimization. The parallel version of the algorithm is designed especially for modern graphics processing units. The preliminary performance results show very significant speedup over the computation with traditional central processor units.
Utilization of GPU Acceleration in Le Bail Fitting Method
Authors
Šimeček, I.; Mařík, O.; Jelínek, M.
Year
2015
Published
Romanian Journal of Information Science and Technology (ROMJIST). 2015, 18(2), 182-196. ISSN 1453-8245.
Type
Article
Departments
Annotation
Le Bail fitting method is a process used in applied crystallography. It can be employed in several phases of crystal structure determination and as it is only one step in a more complex process, it needs to be as fast as possible. This article begins with a short explanation of crystallography terms needed to understand the Le Bail fitting, then continues with the description of the Le Bail fitting method itself and basic principles on which it is based. Then the parallelization method is explained, starting with a more general process, followed by specifics of GPU accelerated computing including short part on optimization. Finally, achieved results are presented along with comparison to sequential implementation and alternative parallelization approaches.
A new code transformation technique for nested loops
Authors
Year
2014
Published
COMSIS - Computer Science and Information Systems. 2014, 11(4), 1381-1416. ISSN 1820-0214.
Type
Article
Departments
Annotation
For good performance of every computer program, good cache utilization is crucial. In numerical linear algebra libraries, good cache utilization is achieved by explicit loop restructuring (mainly loop blocking), but it requires a complicated memory pattern behavior analysis. In this paper, we describe a new source code transformation called dynamic loop reversal that can increase temporal and spatial locality. We also describe a formal method for predicting cache behavior and evaluate results of the model accuracy by the measurements on a cache monitor. The comparisons of the numbers of measured cache misses and the numbers of cache misses estimated by the model indicate that the model is relatively accurate and can be used in practice.
Parallel Data Acquisition for Visualization of Very Large Sparse Matrices
Authors
Year
2014
Published
15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Los Alamitos: IEEE Computer Society, 2014. pp. 336-343. ISBN 978-1-4799-3035-7.
Type
Proceedings paper
Departments
Annotation
The problem of visualization of very large sparse matrices emerging on massively parallel computer systems is identified and a new method along with an accompanying algorithm for parallel acquisition of visualization data for such matrices are presented. The proposed method is based on downsampling a matrix into blocks for which the desired visualization data are saved into a file. This file is then supposed to be downloaded and processed into a final image on a personal computer. Experimental results for the evaluation of the performance and scalability of the proposed algorithm are further provided and discussed.
Space Efficient Formats for Structure of Sparse Matrices Based on Tree Structures
Authors
Year
2014
Published
15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Los Alamitos: IEEE Computer Society, 2014. pp. 344-351. ISBN 978-1-4799-3035-7.
Type
Proceedings paper
Departments
Annotation
Very large sparse matrices are often processed on massively parallel computer systems with distributed memory architectures consisting of tens or hundreds of thousands of processor cores. The problem occurs when we want or need to load/store these matrices from/to a distributed file system.
This paper deals with the design of new formats for storing very large sparse matrices suitable for parallel I/O systems. The first one is based on arithmetic coding and the second one is based on binary tree format. We compare the space complexity of common storage formats and our new formats and prove that the latter are considerably more space efficient.
The study of impact of matrix-processor mapping on the parallel sparse matrix-vector multiplication
Authors
Šimeček, I.; Langr, D.; Srnec, E.
Year
2014
Published
15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Los Alamitos: IEEE Computer Society, 2014. pp. 321-328. ISBN 978-1-4799-3035-7.
Type
Proceedings paper
Departments
Annotation
Sparse matrix-vector multiplication (shortly spMV) is one of the most common subroutines in the numerical linear algebra.
The parallelization of this task looks easy and straightforward, but it is not optimal in general case.
This paper discuss some matrix-processor mappings and their impact on parallel spMV execution on massively parallel systems. We try to balance the performance and the overhead of the required transformation. We also present algorithms for redistribution. We propose four quality measures and derive lower and upper bound for different mappings. Our $spMV$ algorithms are scalable for almost all matrices arising from various technical areas.
Tree-based Space Efficient Formats for Storing the Structure of Sparse Matrices
Authors
Year
2014
Published
Scalable Computing: Practice and Experience. 2014, 15(1), 1-20. ISSN 1895-1767.
Type
Article
Departments
Annotation
Sparse storage formats describe a way how sparse matrices are stored in a computer memory. Extensive research has been conducted about these formats in the context of performance optimization of the sparse matrix-vector multiplication algorithms, but memory efficient formats for storing sparse matrices are still under development, since the commonly used storage formats (like COO or CSR) are not sufficient. In this paper, we propose and evaluate new storage formats for sparse matrices that minimize the space complexity of information about matrix structure. The first one is based on arithmetic coding and the second one is based on binary tree format. We compare the space complexity of common storage formats and our new formats and prove that the latter are considerably more space efficient.
A new approach for indexing powder diffraction data suitable for GPGPU execution
Authors
Year
2013
Published
Soft Computing Models in Industrial and Environmental Applications. Heidelberg: Springer, 2013. pp. 409-416. ISSN 2194-5357. ISBN 978-3-642-32921-0.
Type
Proceedings paper
Departments
Annotation
Powder diffraction (based typically on X-ray usage) is a well-established method for a complete analysis and structure determination of crystalline materials. One of the key parts of the experimental data processing is the process of indexation - determination of lattice parameters. The lattice parameters are essential information required for phase identification as well as for eventual phase structure solution. Nowadays computer hardware with efficient implementation of indexation algorithms gives a chance to to solve several problematic situations not covered fully by existing methods. This paper deals with design of algorithm for indexing powder diffraction data suitable for massively parallel platforms such as GPUs.
Clock Math - a System for Solving SLEs Exactly
Authors
Hladík, J.; Lórencz, R.; Šimeček, I.
Year
2013
Published
Acta Polytechnica. 2013, 53(ě), 70-74. ISSN 1210-2709.
Type
Article
Departments
Annotation
In this paper, we present a GPU-accelerated hybrid system that solves ill-conditioned systems of linear equations exactly. Exactly means without rounding errors due to using integer arithmetics. First, we scale floating-point numbers up to integers, then we solve dozens of SLEs within different modular arithmetics and then we assemble sub-solutions back using the Chinese remainder theorem. This approach effectively bypasses current CPU floating-point limitations. The system is capable of solving Hilbert’s matrix without losing a single bit of precision, and with a significant speedup compared to existing CPU solvers.
Dynamic loop reversal - The new code transformation technique
Authors
Year
2013
Published
Proceedings of the 2013 Federated Conference on Computer Science and Information Systems. Los Alamitos: IEEE Computer Society, 2013. pp. 1587-1594. ISSN 2325-0348. ISBN 978-1-4673-4471-5.
Type
Proceedings paper
Departments
Annotation
In this paper, we describe a new source code transformation called dynamic loop reversal that can increase temporal and spatial locality. We also describe a formal method for predicting the cache behaviour and evaluation results of the accuracy of the model by measurements on a cache monitor. The comparisons of the numbers of measured cache misses and the numbers of cache misses estimated by the model indicate that model is relatively accurate and can be used in practice.
Storing sparse matrices to files in the adaptive-blocking hierarchical storage format
Authors
Year
2013
Published
Proceedings of the 2013 Federated Conference on Computer Science and Information Systems. Los Alamitos: IEEE Computer Society, 2013. pp. 479-486. ISSN 2325-0348. ISBN 978-1-4673-4471-5.
Type
Proceedings paper
Departments
Annotation
When there is a need to store a sparse matrix into a file system, is it worth to convert it first into some space-efficient storage format? This paper tries to answer such question for the adaptive-blocking hierarchical storage format (ABHSF), provided that the matrix is present in memory either in the coordinate (COO) or in the compressed sparse row (CSR) storage format. The conversion algorithms from COO and CSR to ABHSF are introduced and the results of performed experiments are then presented and discussed.
A New Approach for Indexing Powder Diffraction Data Based on Dichotomy Method
Authors
Year
2012
Published
Computational Science and Engineering (CSE), 2012 IEEE 15th International Conference on. Los Alamitos: IEEE Computer Society, 2012. pp. 124-129. ISBN 978-1-4673-5165-2.
Type
Proceedings paper
Departments
Annotation
Powder diffraction is a basic method in applied crystallography. One of the key parts of the experimental data processing is the process of indexation - determination of lattice parameters. The lattice parameters are essential information required for phase identification as well as for eventual phase structure solution. The most common methods for the indexation are dichotomy and TREOR algorithms. In this paper we represent some improvements to dichotomy and TREOR algorithms. Proposed algorithms improvements increase the robustness and speed of indexing powder diffraction data.
Adaptive-Blocking Hierarchical Storage Format for Sparse Matrices
Authors
Year
2012
Published
Federated Conference on Computer Science and Information Systems (FedCSIS 2012). New York: IEEE, 2012. pp. 545-551. ISBN 978-1-4673-0708-6.
Type
Proceedings paper
Departments
Annotation
Hierarchical storage formats (HSFs) can significantly reduce the space complexity of sparse matrices. They vary in storage schemes that they use for blocks and for block matrices. However, the current HSFs prescribe a fixed storage scheme for all blocks, which is not always space-optimal.We show that, generally, different storage schemes are space-optimal for different blocks. We further propose a new HSF that is based on this approach and compare its space complexity with current HSFs for benchmark matrices arising from different application areas.
Heterogeneous cluster for acceleration of linear algebra computations
Authors
Year
2012
Published
Proceedings of International Conference on Applied Mathematics (Aplimat 2012). Bratislava: Slovenská technická univerzita, Strojnícka fakulta, 2012. pp. 299-306. ISBN 978-80-89313-58-7.
Type
Proceedings paper
Departments
Annotation
Plenty of numerical algebra libraries have been developed in recent years. These libraries are tuned for the given CPU and its memory architecture, fully utilize its memory hierarchy and inner pipelines and achieve impressive computation power. There is also a new trend in the high-performance computing: GPU computing.
This paper deals with a new concept of the heterogeneous grid for acceleration of the numerical linear algebra computing. We design this grid with respect to maximal ratio between cost and computational power. It allows a parallelization of scientific codes with minimal programming effort. We also optimize grid concept to be less sensitive to network parameters.
Heterogeneous cluster for acceleration of linear algebra computations
Authors
Year
2012
Published
Aplimat - Journal of Applied Mathematics (CD ROM). 2012, 5(2), 225-232. ISSN 1337-6365.
Type
Article
Departments
Annotation
Plenty of numerical algebra libraries have been developed in recent years. These libraries are tuned for the given CPU and its memory architecture, fully utilize its memory hierarchy and inner pipelines and achieve impressive computation power. There is also a new trend in the high-performance computing: GPU computing.
This paper deals with a new concept of the heterogeneous grid for acceleration of the numerical linear algebra computing. We design this grid with respect to maximal ratio between cost and computational power. It allows a parallelization of scientific codes with minimal programming effort. We also optimize grid concept to be less sensitive to network parameters.
Minimal Quadtree Format for Compression of Sparse Matrices Storage
Authors
Year
2012
Published
Proceedings of 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2012). Los Alamitos: IEEE Computer Society, 2012. pp. 359-364. SYNASC'12. ISBN 978-0-7695-4934-7.
Type
Proceedings paper
Departments
Annotation
Computations with sparse matrices are widespread in scientific projects. Commonly used storage formats (such as COO or CSR) are not suitable for I/O file operations with sparse matrices due to their high space complexities. Memory-efficient formats are still under development.
In this paper, we present a new storage format called the Minimal quadtree (MQ) as well as algorithms for converting matrices from common storage formats to the MQ format. We compare the space complexity of common storage formats and of the MQ format and prove that the idea of using the quadtree as the data structure for sparse matrices is viable.
Modular Arithmetic for Solving Linear Equations on the GPU
Authors
Hladík, J.; Šimeček, I.
Year
2012
Published
Seminar on Numerical Analysis. Liberec: Technical University of Liberec, 2012. pp. 68-70. ISBN 978-80-7372-821-2.
Type
Proceedings paper
Departments
Annotation
In this paper I propose a GPU-running system that solves systems of linear equations on the GPU. Modern GPU hardware is capable of accelerating data-parallel algorithms so we can expect a huge speedup compared to CPU implementations.
Space-efficient sparse matrix storage formats for massively parallel systems
Authors
Year
2012
Published
HPCC 2012: IEEE 14th International Conference on High Performance Computing and Communications. Los Alamitos: IEEE Computer Society, 2012. pp. 54-60. ISBN 978-0-7695-4749-7.
Type
Proceedings paper
Departments
Annotation
In this paper, we propose and evaluate new storage formats for sparse matrices that minimize the space complexity of information about matrix structure. The motivation of our work are applications with very large sparse matrices that due to their size must be processed on massively parallel computer systems consisting of tens or hundreds of thousands of processor cores and that must be stored in a distributed file system using parallel I/O. The parallel I/O is typically the main performance bottleneck and reading or writing such matrices from/to distributed file system can take significant amount of time. We try to reduce this time by reducing the amount of data to be processed.
Space-efficient sparse matrix storage formats with 8-bit indices
Authors
Year
2012
Published
Seminar on Numerical Analysis. Liberec: Technical University of Liberec, 2012. pp. 161-164. ISBN 978-80-7372-821-2.
Type
Proceedings paper
Departments
Annotation
The paper is aimed at space-efficient storage formats for very large sparse
matrices (VLSMs). By VLSMs, we mean matrices that because of their size must be stored and
processed by massively parallel computer systems with distributed memory architectures
consisting of tens or hundreds of thousands of processor cores.
Experimental grid for numerical linear algebra
Authors
Šimeček, I.; Hladík, J.; Krupka, J.; Hovorka, M.
Year
2011
Published
SNA'11 Seminar on Numerical Analysis. Modelling and Simulation of Challenging Engineering Problems. Ostrava: Ústav geoniky AV ČR, 2011. pp. 104-107. ISBN 978-80-86407-19-7.
Type
Proceedings paper
Departments
Annotation
Time is very often the limiting factor in scientific codes. These codes can be accelerated by parallel executing on special distributed systems (grids). This is usual but very difficult
solution. In this paper, we describe a design of the new heterogenous grid for the numerical linear algebra with maximal ratio between prize and computational power. Contributions of this
paper is twofold: 1) a design of new parallel routines 2) an approach for parallelization of scientific codes by converting local numerical library calls into remote grid calls.
GPU cluster for acceleration of computations
Authors
Šimeček, I.; Šimek, F.
Year
2011
Published
Proceedings of International Conference PRESENTATION of MATHEMATICS '11. Liberec: Technická univerzita, 2011. pp. 125-130. ISBN 978-80-7372-773-4.
Type
Proceedings paper
Departments
Annotation
Plenty of numerical algebra libraries have been developed in recent years. These libraries are tuned for the given CPU and its memory architecture, fully utilize its memory hierarchy and inner pipelines and achieve impressive computation power. There is also a new trend in the high-performance computing: GPU computing.
This paper describes a new concept of the new heterogeneous grid for the numerical linear algebra computing with maximal ratio between prize and computational power. It allows a parallelization of scientific codes with minimal programming effort.
Improvement of shortest path algorithms through graph partitioning
Authors
Šimeček, I.; Šimek, F.
Year
2011
Published
Proceedings of International Conference PRESENTATION of MATHEMATICS '11. Liberec: Technická univerzita, 2011. pp. 131-137. ISBN 978-80-7372-773-4.
Type
Proceedings paper
Departments
Annotation
Route planning problem occurs often in many practical areas of life including problem of finding of optimal paths in road networks.
This paper presents an algorithm based upon traditional graph algorithms, like the Dijkstra's shortest path algorithm but it takes advantage of the specific properties of a road network graph.
Presented modification of algorithm partitions a given graph into smaller components, it results in efficient execution even for large-scale graphs. Searching in smaller components still guarantee optimality of solution.
Comparison of the Sparse Matrix Storage Formats
Authors
Year
2010
Published
Seminar on Numerical Analysis 2010. Praha: Ústav Informatiky AV ČR, v.v.i., 2010. pp. 127-130. ISBN 978-80-87136-07-2.
Type
Proceedings paper
Departments
Annotation
Computations with sparse matrices are widespread in scientific
projects. The performance of mathematical operations with sparse
matrices depends strongly on the used matrix storage format. In this
paper, we compare the performance during the execution of some basic
routines from linear solvers and its dependency on the used format.
The paper consists of four parts: a general introduction of sparse
matrix storage formats, performance testing, conclusions, and
suggestions for future work.
Parallel solvers of Poisson's equation
Authors
Krupka, J.; Šimeček, I.
Year
2010
Published
6th Doctoral Workshop on Mathematical and Engineering Methods in Computer Science. Brno: NOVPRESS, 2010. pp. 101-108. ISBN 978-80-87342-10-7.
Type
Proceedings paper
Departments
Annotation
Various problems arising from physical processes are modelled
using partial differential equations (PDEs). The Poisson's equation
is the example of elliptic PDE, which we use for presenting the finite
difference method for transforming the continuous problem to the discrete
problem and obtaining sparse system of linear equations (SLEs).
This paper discusses possibilities of parallel implementation of numerical
methods for solving these special SLEs on clusters of workstations
using message passing interface. We present theoretical background and
evaluate results of our experiments done with OpenMPI library in parallel
environment with which we have achieved reasonable high parallel
efficiency for our problem.
Superlinear speedup and constant efficiency for certain number of used
processors have been achieved.
Sparse Matrix Computations Using the Quadtree Storage Format
Authors
Year
2010
Published
Proceedings of 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2009). Los Alamitos: IEEE Computer Society, 2010. p. 168-173. ISBN 978-0-7695-3964-5.
Type
Proceedings paper
Departments
Annotation
Computations with sparse matrices are widespread in scientific projects. Used data format affects strongly the performance. Efficient formats for storing sparse matrices are still under
development, since the computation using widely-used formats (like XY or CSR) is slow and specialized formats (like SPARSITY or CARB) have a large transformation overhead.
In this paper, we represent some improvements to the quadtree storage format. We also compare the performance during the execution of some basic routines from the linear algebra using widely-used formats and the quadtree storage format.