High Performance Computing
Computers and computational resources in general have tremendously evolved in the last 50 years. Tools commonly used by our fathers such as tables, logarithmic slide rulers or mechanical calculators have been slowly replaced by electronic calculators and desktop computers.
Not every computer is able to sit on an office desk. Some computers are too large for an office and require an entire room with a special cooling system. Looking even further, there are several machines in the world which size is almost the size of a basketball court. These so called supercomputers are built from thousands of regular servers (in supercomputing terminology called nodes) connected together by a very fast interconnection network. This low latency and high bandwidth interconnection creates an illusion of a single enormous machine.
Our first supercomputer Anselm, named after the first coal mine in Ostrava, was open to researchers, students and industrial partners on May 25th 2013. Before Anselm was open to the public it was thoroughly tested and benchmarked for couple of months. More information about Anselm and other computational resources operated by the Center of Excellence IT4Innovations can be found at following website: https://support.it4i.cz/docs/anselm-cluster-documentation/.
Big national research centers are not the only adopters of supercomputing and parallel processing. The technologies, both hardware and software, which were originally developed for supercomputers, bring faster processing to everybody who use computers of any size. One of the most noticeable examples are multi-core processors that can be found in every desktop or laptop computer on the market. To be able to fully exploit the potential of these modern processors one has to use a parallel programming language and compiler to produce suitable parallel application. All these technologies were born in the high performance computing (HPC) world.
GPU (Graphic Processing Unit) acceleration is another important technology that boosts the performance of today’s largest supercomputers due to the fact that single GPU accelerator contains over 3000 processing units. In 1997 the world’s most powerful supercomputer (ASCI Red) was built from almost 10000 single-core processors. The computational power of this machine was close to 1012 floating point operations per second (FLOP) or in other worlds 1 teraFLOP. Less than fifteen years later a single computer with GPU accelerator is able to deliver the same performance. These two technologies have one essential feature in common – the performance is delivered by thousands of parallel execution units. In order to fully utilize these massively parallel architectures, parallel algorithms scalable to thousands of processors have to be developed.
The parallelization of an algorithm for today’s supercomputers is done in several stages:
- Parallelization across multiple computes nodes – distributed memory programing model – e.g. MPI, PGAS, …
- Parallelization inside the computes node across the processor cores – shared memory programming model – e.g. OpenMP, Pthreads, …
- Heterogeneous parallelization for accelerators – hybrid model – e.g. CUDA, OpenCL, OpenACC, …
In every stage a person who writes the parallel code has to:
- Adapt the algorithm to strengths and weaknesses of the underlying hardware architecture,
- Use different programing model and language.
All these limitations make efficient parallel programming extremely difficult. Nevertheless, being able to produce an efficient parallel code is essential.
The success of the supercomputers strongly depends on the development of novel parallel algorithms that are able to utilize their performance. Parallelism brought new constrains to the algorithm development and new criteria to the algorithm evaluation. An algorithm that was very efficient in sequential processing is in many cases inefficient for parallel processing and vice versa. In addition an implementation that suits one architecture might be inefficient for a different architecture.
So what the parallel processing can do for us:
- Solve a problem in a significantly shorter time,
- Solve a large scale problems that cannot be solved on a single computer due to hardware constrains.
One of the most significant property of any parallel algorithm is its scaling. If the algorithm has a good parallel scaling we expect that by doubling the resources the processing time is halved. Another important type of scaling is a numerical scaling. An algorithm with good numerical scaling is able to find a solution of a problem in a constant number of iteration for different problem sizes. An efficient parallel algorithm is the one that has good parallel and good numerical scaling.
Our group has a long experience in the development and implementation of efficient parallel algorithms and methods based on domain decomposition, see Basic Research. These novel methods were successfully applied to solve large scientific and engineering problems, see Portfolio, and have been implemented in our in-house parallel libraries, see Products.