Ansys Parallel Processing

ANSYS Academic Research HPC Workgroup 128
ANSYS Academic Research CFD (25 Tasks)
ANSYS Academic Research Mechanical and CFD (25 Tasks)

DOWNLOAD

Download PDF Version Here

IMPORTANT

Please contact me if you need access to any other referenced guides in the Parallel Processing guide

MOST OF YOU ARE PROBABLY NEEDING INFORMATION ON HOW TO UTILIZE PARALLEL PROCESSING ON A SINGLE WORKSTATION. PLEASE LOOK OVER THE DOCUMENTATION FOR SHARED-MEMORY PARALLEL PROCESSING (CHAPTER 2). MUCH OF THE SETTINGS AND IF IT WILL EVEN BENEFIT YOU IS BASED SOLELY ON THE SCOPE OF YOUR PROJECT AND WHAT YOU ARE DOING.

LICENSE

VMI has 128 HPC Workgroup (1 task) concurrent licenses available

Please Note: ANSYS uses physical cores vs virtual cores (hyperthreading) DO NOT EXCEED THE PHYSICAL CORE COUNT

Chapter 1: Overview of Parallel Processing

Solving a large model with millions of DOFs or a medium-sized model with nonlinearities that needs many iterations to reach convergence can require many CPU hours. To decrease simulation time, ANSYS, Inc. offers different parallel processing options that increase the model-solving power of ANSYS products by using multiple processors (also known as cores). The following three parallel processing capabilities are available:

Shared-memory parallel processing (Shared-Memory ANSYS) (Single [Multicore/Processor] Workstation)
Distributed-memory parallel processing (Distributed ANSYS) (Multiple machine configuration)
GPU acceleration (a type of shared-memory parallel processing

Multicore processors, and thus the ability to use parallel processing, are now widely available on all computer systems, from laptops to high-end servers. The benefits of parallel processing are compelling but are also among the most misunderstood. This chapter explains the two types of parallel processing available in ANSYS and also discusses the use of GPUs (considered a form of shared-memory parallel processing) and how they can further accelerate the time to solution.

Currently, the default scheme is to use two cores with distributed-memory parallelism. For many of the computations involved in a simulation, the speedups obtained from parallel processing are nearly linear as the number of cores is increased, making very effective use of parallel processing. However, the total benefit (measured by elapsed time) is problem dependent and is influenced by many different factors.

IMPORTANT

No matter what form of parallel processing is used, the maximum benefit attained will always be limited by the amount of work in the code that cannot be parallelized. If just 20 percent of the runtime is spent in nonparallel code, the maximum theoretical speedup is only 5X, assuming the time spent in parallel code is reduced to zero. However, parallel processing is still an essential component of any HPC system; by reducing wall clock elapsed time, it provides significant value when performing simulations.

Distributed ANSYS, shared-memory ANSYS, and GPU acceleration can require HPC licenses. You can use up to four CPU cores or a combination of four CPUs and GPUs without using any HPC licenses. Additional licenses will be needed to run with more than four. See HPC Licensing (p. 3 in PDF) for more information.

1.1 Parallel Processing Terminology

It is important to fully understand the terms we use, both relating to our software and to the physical hardware. The terms shared-memory ANSYS and Distributed ANSYS refer to our software offerings, which run on shared-memory or distributed-memory hardware configurations. The term GPU accelerator capability refers to our software offering which allows the program to take advantage of certain GPU (graphics processing unit) hardware to accelerate the speed of the solver computations.

1.1.1 Hardware Terminology
1.1.2 Software Terminology

The following terms describe the hardware configurations used for parallel processing:

Shared-memory hardware

This term refers to a physical hardware configuration in which a single shared-memory address space is accessible by multiple CPU cores; each CPU core "shares" the memory with the other cores. A common example of a shared-memory system is a Windows desktop machine or workstation with one or two multicore processors.

Distributed-memory hardware

This term refers to a physical hardware configuration in which multiple machines are connected together on a network (that is, a cluster). Each machine on the network (that is, each compute node on the cluster) has its own memory address space. Communication between machines is handled by interconnects (Gigabit Ethernet, Infiniband, etc.).

Virtually all clusters involve both shared-memory and distributedmemory hardware. Each compute node on the cluster typically contains at least two or more CPU cores, which means there is a shared-memory environment within a compute node. The distributed- memory environment requires communication between the compute nodes involved in the cluster.

GPU hardware

A graphics processing unit (GPU) is a specialized microprocessor that off-loads and accelerates graphics rendering from the microprocessor. Their highly parallel structure makes GPUs more effective than general-purpose CPUs for a range of complex algorithms. In a personal computer, a GPU on a dedicated video card is more powerful than a GPU that is integrated on the motherboard.

Head compute node

In a Distributed ANSYS run, the machine or node on which the master process runs (that is, the machine on which the job is launched). The head compute node should not be confused with the host node in a Windows cluster environment. The host node typically schedules multiple applications and jobs on a cluster, but does not typically run the application.

Software Terminology

The following terms describe our software offerings for parallel processing:

Shared-memory ANSYS

This term refers to running across multiple cores on a single machine (for example, a desktop workstation or a single compute node of a cluster). Shared-memory parallelism is invoked, which allows each core involved to share data (or memory) as needed to perform the necessary parallel computations.When run within a shared-memory architecture, most computations in the solution phase and many pre- and postprocessing operations are performed in parallel. For more information, see Using Shared-Memory ANSYS (p. 5).

Distributed ANSYS

This term refers to running across multiple cores on a single machine (for example, a desktop workstation or a single compute node of a cluster) or across multiple machines (for example, a cluster). Distributed-memory parallelism is invoked, and each core communicates data needed to perform the necessary parallel computations through the use of MPI (Message Passing Interface) software.With Distributed ANSYS, all computations in the solution phase are performed in parallel (including the stiffness matrix generation, linear equation solving, and results calculations). Preand postprocessing do not make use of the distributed-memory parallel processing; however, these steps can make use of sharedmemory parallelism. See Using Distributed ANSYS (p. 17) for more details.

GPU accelerator capability

This capability takes advantage of the highly parallel architecture of the GPU hardware to accelerate the speed of solver computations and, therefore, reduce the time required to complete a simulation. Some computations of certain equation solvers can be off-loaded from the CPU(s) to the GPU, where they are often executed much faster. The CPU core(s) will continue to be used for all other computations in and around the equation solvers. For more information, see GPU Accelerator Capability (p. 9).

Master process

The first process launched on the head compute node in a Distributed ANSYS run.

Worker process

A Distributed ANSYS process other than the master process.

Shared-memory ANSYS can only be run on shared-memory hardware. However, Distributed ANSYS can be run on both shared-memory hardware or distributed-memory hardware.While both forms of hardware can achieve a significant speedup with Distributed ANSYS, only running on distributedmemory hardware allows you to take advantage of increased resources (for example, available memory and disk space, as well as memory and I/O bandwidths) by using multiple machines. The GPU accelerator capability can be used with either shared-memory ANSYS or Distributed ANSYS.

1.2 HPC Licensing

ANSYS, Inc. offers the following high performance computing license options:

ANSYS HPC - These physics-neutral licenses can be used to run a single analysis across multiple processors (cores).
ANSYS HPC Packs - These physics-neutral licenses share the same characteristics of the ANSYS HPC licenses, but are combined into predefined packs to give you greater value and scalability.

For detailed information on these HPC license options, see HPC Licensing in the ANSYS Licensing Guide.

The HPC license options cannot be combined with each other in a single solution; for example, you cannot use both ANSYS HPC and ANSYS HPC Packs in the same analysis solution.

The order in which HPC licenses are used is specified by your user license preferences setting. See Specify Product Order in the ANSYS Licensing Guide for more information on setting user license product order.

You can choose a particular HPC license by using the Preferred Parallel Feature command line option. The format is ansys211 -ppf <license feature name>, where <license feature name> is the name of the HPC license option that you want to use. This option forces Mechanical APDL to use the specified license feature for the requested number of parallel cores or GPUs. If the license feature is entered incorrectly or the license feature is not available, a license failure occurs.

Both Distributed ANSYS and shared-memory ANSYS allow you to use four CPU cores without using any HPC licenses. ANSYS HPC licenses add cores to this base functionality, while the ANSYS HPC Pack licenses function independently of the four included cores.

In a similar way, you can use up to four CPU cores and GPUs combined without any HPC licensing (for example, one CPU and three GPUs). The combined number of CPU cores and GPUs used cannot exceed the task limit allowed by your specific license configuration.

Chapter 2: Using Shared-Memory ANSYS

When running a simulation, the solution time is typically dominated by three main parts: the time spent to create the element matrices and form the global matrices, the time to solve the linear system of equations, and the time spent calculating derived quantities (such as stress and strain) and other requested results for each element.

IMPORTANT

Shared-memory ANSYS can run a solution over multiple cores on a single machine.When using sharedmemory parallel processing, you can reduce each of the three main parts of the overall solution time by using multiple cores. However, this approach is often limited by the memory bandwidth; you typically see very little reduction in solution time beyond four cores.

The main program functions that run in parallel on shared-memory hardware are:

Solvers such as the Sparse, PCG, ICCG, Block Lanczos, PCG Lanczos, Supernode, and Subspace running over multiple processors but sharing the same memory address. These solvers typically have limited scalability when used with shared-memory parallelism. In general, very little reduction in time occurs when using more than four cores.
Forming element matrices and load vectors.
Computing derived quantities and other requested results for each element.
Pre- and postprocessing functions such as graphics, selecting, sorting, and other data and compute intensive operations.

2.1 Activating Parallel Processing in a Shared-Memory Architecture

By default, shared-memory ANSYS uses two cores and does not require any HPC licenses. Additional HPC licenses are required to run with more than four cores. Several HPC license options are available. See HPC Licensing (p. 3) for more information.
Open the Mechanical APDL Product Launcher:

Windows: Start >Programs >ANSYS 2021 R1 >Mechanical APDL Product Launcher
Linux: launcher211

Select the correct environment and license.
Go to the High Performance Computing Setup tab. Select Use Shared-Memory Parallel (SMP). Specify the number of cores to use.
Alternatively, you can specify the number of cores to use via the -np command line option:

ansys211 -smp -np N

where N represents the number of cores to use.

For large multiprocessor servers, ANSYS, Inc. recommends setting N to a value no higher than the number of available cores minus one. For example, on an eight-core system, set N to 7. However, on multiprocessor workstations, you may want to use all available cores to minimize the total solution time. The program automatically limits the maximum number of cores used to be less than or equal to the number of physical cores on the machine. This is done to avoid running the program on virtual cores (for example, by means of hyperthreading), which typically results in poor per-core performance. For optimal performance, consider closing down all other applications before launching ANSYS.

If you have more than one HPC license feature, you can use the -ppf command line option to specify which HPC license to use for the parallel run. See HPC Licensing (p. 3) for more information.

If working from the launcher, click Run to launch ANSYS.
Set up and run your analysis as you normally would.

2.1.1 System Specific Considerations

For shared-memory parallel processing, the number of cores that the program uses is limited to the lesser of one of the following:

The number of ANSYS HPC licenses available (plus the first four cores which do not require any licenses)
The number of cores indicated via the -np command line argument
The actual number of cores available

You can specify multiple settings for the number of cores to use during a session. However, ANSYS, Inc. recommends that you issue the /CLEAR command before resetting the number of cores for subsequent analyses.

2.2 Troubleshooting

This section describes problems which you may encounter while using shared-memory parallel processing as well as methods for overcoming these problems. Some of these problems are specific to a particular system, as noted.

Job Failes with SIGTERM signal (Linux Only)
Poor Speedup or No Speedup
Different Results Relative to a Single Core

Occasionally, when running on Linux, a simulation may fail with the following message: “process killed (SIGTERM)”. This typically occurs when computing the solution and means that the system has killed the ANSYS process. The two most common occurrences are (1) ANSYS is using too much of the hardware resources and the system has killed the ANSYS process or (2) a user has manually killed the ANSYS job (that is, kill -9 system command). Users should check the size of job they are running in relation to the amount of physical memory on the machine. Most often, decreasing the model size or finding a machine with more RAM will result in a successful run.

As more cores are utilized, the runtimes are generally expected to decrease. The biggest relative gains are typically achieved when using two cores compared to using a single core.When significant speedups are not seen as additional cores are used, the reasons may involve both hardware and software issues. These include, but are not limited to, the following situations.

HARDWARE

Oversubscribing hardware

In a multiuser environment, this could mean that more physical cores are being used by ANSYS simulations than are available on the machine. It could also mean that hyperthreading is activated. Hyperthreading typically involves enabling extra virtual cores, which can sometimes allow software programs to more effectively use the full processing power of the CPU. However, for computeintensive programs such as ANSYS, using these virtual cores rarely provides a significant reduction in runtime. Therefore, it is recommended you disable hyperthreading; if hyperthreading is enabled, it is recommended you do not exceed the number of physical cores.

Lack of memory bandwidth On some systems, using most or all of the available cores can result in a lack of memory bandwidth. This lack of memory bandwidth can affect the overall scalability of the ANSYS software.

Dynamic Processor Speeds

Many new CPUs have the ability to dynamically adjust the clock speed at which they operate based on the current workloads. Typically, when only a single core is being used the clock speed can be significantly higher than when all of the CPU cores are being utilized. This can have a negative effect on scalability as the per-core computational performance can be much higher when only a single core is active versus the case when all of the CPU cores are active.

Software

Simulation includes non-supported features

The shared- and distributed-memory parallelisms work to speed up certain compute-intensive operations in /PREP7, /SOLU and /POST1. However, not all operations are parallelized. If a particular operation that is not parallelized dominates the simulation time, then using additional cores will not help achieve a faster runtime.

Simulation has too few DOF (degrees of freedom)

Some analyses (such as transient analyses) may require long compute times, not because the number of DOF is large, but because a large number of calculations are performed (that is, a very large number of time steps). Generally, if the number of DOF is relatively small, parallel processing will not significantly decrease the solution time. Consequently, for small models with many time steps, parallel performance may be poor because the model size is too small to fully utilize a large number of cores.

I/O cost dominates solution time

For some simulations, the amount of memory required to obtain a solution is greater than the physical memory (that is, RAM) available on the machine. In these cases, either virtual memory (that is, hard disk space) is used by the operating system to hold the data that would otherwise be stored in memory, or the equation solver writes extra files to the disk to store data. In both cases, the extra I/O done using the hard drive can significantly affect performance, making the I/O performance the main bottleneck to achieving optimal performance. In these cases, using additional cores will typically not result in a significant reduction in overall time to solution.

Chapter 3: GPU Accelerator Capability

In an effort to provide faster performance during solution, Mechanical APDL supports offloading key solver computations onto graphics cards to accelerate those computations. Only high-end graphics cards, the ones with the most amount of cores and memory, can be used to accelerate the solver computations. For details on which GPU devices are supported and the corresponding driver versions, see the GPU requirements outlined in the Windows Installation Guide and the Linux Installation Guide.

It is important to understand that a GPU does not replace the CPU core(s) on which a simulation typically runs. One or more CPU cores must be used to run the Mechanical APDL program. The GPUs are used in support of the CPU to process certain calculations. The CPU continues to handle most operations and will automatically offload some of the time-intensive parallel operations performed by certain equation solvers. These parallel solver operations can usually be performed much faster on the highly parallel architecture of a GPU, thus accelerating these solvers and reducing the overall time to solution.

GPU acceleration can be used with both shared-memory parallel processing (shared-memory ANSYS) and distributed-memory parallel processing (Distributed ANSYS). In shared-memory ANSYS, one or multiple GPU accelerator devices can be utilized during solution. In Distributed ANSYS, one or multiple GPU accelerator devices per machine or compute node can be utilized during solution.

As an example, when using Distributed ANSYS on a cluster involving eight compute nodes with each compute node having two supported GPU accelerator devices, either a single GPU per node (a total of eight GPU cards) or two GPUs per node (a total of sixteen GPU cards) can be used to accelerate the solution. The GPU accelerator device usage must be consistent across all compute nodes. For example, if running a simulation across all compute nodes, it is not possible to use one GPU for some compute nodes and zero or two GPUs for the other compute nodes.

On machines containing multiple GPU accelerator devices, the program automatically selects the GPU accelerator device (or devices) to be used for the simulation. The program cannot detect if a GPU device is currently being used by other software, including another Mechanical APDL simulation. Therefore, in a multiuser environment, users should be careful not to oversubscribe the GPU accelerator devices by simultaneously launching multiple simulations that attempt to use the same GPU (or GPUs) to accelerate the solution. For more information, see Oversubscribing GPU Hardware (p. 14) in the troubleshooting discussion.

IMPORTANT

The GPU accelerator capability is only supported on the Windows 64-bit and Linux x64 platforms.

You can use up to four GPUs and CPUs combined without any HPC licensing (for example, one CPU and three GPUs). To use more than four, you need one or more ANSYS HPC licenses or ANSYS HPC Pack licenses. For more information see HPC Licensing in the ANSYS Licensing Guide.

3.1 Activating the GPU Accelerator Capability

Following is the general procedure to use the GPU accelerator capability:

Before activating the GPU accelerator capability, you must have at least one GPU card installed with the proper driver level. You may also need some type of HPC license; see HPC licensing for details.
Open the Mechanical APDL Product Launcher.

Windows: Start >Programs >ANSYS 2021 R1 >Mechanical APDL Product Launcher
Linux: launcher211

Select the correct environment and license.
Go to the High Performance Computing Setup tab, select a GPU device from the GPU Accelerator drop-down menu, and specify the number of GPU accelerator devices.
Alternatively, you can activate the GPU accelerator capability via the -acc command line option: ansys211 -acc nvidia -na N The -na command line option followed by a number (N) indicates the number of GPU accelerator devices to use per machine or compute node. If only the -acc option is specified, the program uses a single GPU device per machine or compute node by default (that is, -na 1).

If you have more than one HPC license feature, you can use the -ppf command line option to specify which HPC license to use for the parallel run. See HPC Licensing (p. 3) for more information.

If working from the launcher, click Run to launch Mechanical APDL.
Set up and run your analysis as you normally would.

With the GPU accelerator capability, the acceleration obtained by using the parallelism on the GPU hardware occurs only during the solution operations. Operational randomness and numerical round-off inherent to any parallel algorithm can cause slightly different results between runs on the same machine when using or not using the GPU hardware to accelerate the simulation.

The ACCOPTION command can also be used to control activation of the GPU accelerator capability.

3.2 Supported Analysis Types and Features

Some analysis types and features are not supported by the GPU accelerator capability. Supported functionality also depends on the specified GPU hardware. The following section gives general guidelines on what is and is not supported.

These are not comprehensive lists, but represent major features and capabilities found in the Mechanical APDL program.

3.2.1 NVIDIA GPU Hardware

This section lists analysis capabilities that are supported by the GPU accelerator capability when using NVIDIA GPU cards.

Supported Analysis Types

The following analysis types are supported and will use the GPU to accelerate the solution.

Static linear or nonlinear analyses using the sparse, PCG, or JCG solver.
Buckling analyses using the Block Lanczos or subspace eigensolver.
Modal analyses using the Block Lanczos, subspace, PCG Lanczos, QR damped, unsymmetric, or damped eigensolver.
Harmonic analyses using the full method and the sparse solver.
Transient linear or nonlinear analyses using the full method and the sparse, PCG, or JCG solver.
Substructuring analyses, generation pass only, including the generation pass of component mode synthesis (CMS) analyses.

In situations where the analysis type is not supported by the GPU accelerator capability, the solution will continue but GPU acceleration will not be used.

Performance Issues for Some Solver/Hardware Combinations

When using the PCG or JCG solver, or the PCG Lanczos eigensolver, any of the recommended NVIDIA GPU devices can be expected to achieve good performance.

When using the sparse solver or eigensolvers based on the sparse solver (for example, Block Lanczos or subspace), only NVIDIA GPU devices with significant double precision performance (FP64) are recommended in order to achieve good performance. For a list of these devices, see the Windows Installation Guide and the Linux Installation Guide.

Shared-Memory Parallel Behavior

For the sparse solver (and eigensolvers based on the sparse solver), if one or more GPUs are requested, only a single GPU is used no matter how many are requested.

For the PCG and JCG solvers (and eigensolvers based on the PCG solver), all requested GPUs are used.

Distributed-Memory Parallel Behavior

For the sparse solver (and eigensolvers based on the sparse solver), if the number of GPUs exceeds the number of processes (the -na value is greater than the -np value on the command line), the number of GPUs used equals the -np value. If the number of GPUs is less than the number of processes (-na is less than -np), all requested GPUs are used. For the PCG and JCG solvers (and eigensolvers based on the PCG solver), if the number of GPUs exceeds the number of processes (-na is greater than -np), all requested GPUs are used. If the number of GPUs is less than the number of processes (-na is less than -np), all requested GPUs are used.

3.2.1.2 Supported Features

As the GPU accelerator capability currently only pertains to the equation solvers, virtually all features and element types are supported when using this capability with the supported equation solvers listed in Supported Analysis Types (p. 11). A few limitations exist and are listed below. In these situations, the solution will continue but GPU acceleration will not be used (unless otherwise noted):

Partial pivoting is activated when using the sparse solver. This most commonly occurs when using current technology elements with mixed u-P formulation, Lagrange multiplier based MPC184 elements, Lagrange multiplier based contact elements (TARGE169 through CONTA178), or certain circuit elements (CIRCU94, CIRCU124).
The memory saving option is activated (MSAVE,ON) when using the PCG solver. In this particular case, the MSAVE option is turned off and GPU acceleration is used.
Unsymmetric matrices when using the PCG solver.
A non-supported equation solver is used (for example, ICCG, etc.).

3.3 Troubleshooting

This section describes problems which you may encounter while using the GPU accelerator capability, as well as methods for overcoming these problems. Some of these problems are specific to a particular system, as noted.

NVIDIA GPUs support various compute modes (for example, Exclusive thread, Exclusive process). Only the default compute mode is supported. Using other compute modes may cause the program to fail to launch.

To list the GPU devices installed on the machine, set the ANSGPU_PRINTDEVICES environment variable to a value of 1. The printed list may or may not include graphics cards used for display purposes, along with any graphics cards used to accelerate your simulation.

NO DEVICES
Be sure that a recommended GPU device is properly installed and configured. Check the driver level to be sure it is current or newer than the driver version supported for your particular device. (See the GPU requirements outlined in the Windows Installation Guide and the Linux Installation Guide.)
When using NVIDIA GPU devices, use of the CUDA_VISIBLE_DEVICES environment variable can block some or all of the GPU devices from being visible to the program. Try renaming this environment variable to see if the supported devices can be used.

Important

On Windows, the use of Remote Desktop may disable the use of a GPU device. Launching
Mechanical APDL through the ANSYS Remote Solve Manager (RSM) when RSM is installed
as a service may also disable the use of a GPU. In these two scenarios, the GPU Accelerator
Capability cannot be used. Using the TCC (Tesla Compute Cluster) driver mode, if
applicable, can circumvent this restriction.

NO VALID DEVICES
A GPU device was detected, but it is not a recommended GPU device. Be sure that a recommended GPU device is properly installed and configured. Check the driver level to be sure it is current or newer than the supported driver version for your particular device. (See the GPU requirements outlined in the Windows Installation Guide and the Linux Installation Guide.) Consider using the ANSGPU_OVERRIDE environment variable to override the check for valid GPU devices.
When using NVIDIA GPU devices, use of the CUDA_VISIBLE_DEVICES environment variable can block some or all of the GPU devices from being visible to the program. Try renaming this environment variable to see if the supported devices can be used.
POOR ACCELERATION OR NO ACCELERATION
Simulation includes non-supported features
A GPU device will only accelerate certain portions of a simulation, mainly the solution time. If the bulk of the simulation time is spent outside of solution, the GPU cannot have a significant effect on the overall analysis time. Even if the bulk of the simulation is spent inside solution, you must be sure that a supported equation solver is utilized during solution and that no unsupported options are used. Messages are printed in the output to alert users when a GPU is being used, as well as when unsupported options/features are chosen which deactivate the GPU accelerator capability.
Simulation has too few DOF (degrees of freedom)
Some analyses (such as transient analyses) may require long compute times, not because the number of DOF is large, but because a large number of calculations are performed (that is, a very large number of time steps). Generally, if the number of DOF is relatively small, GPU acceleration will not significantly decrease the solution time. Consequently, for small models with many time steps, GPU acceleration may be poor because the model size is too small to fully utilize a GPU.
Simulation does not fully utilize the GPU
Only simulations that spend a lot of time performing calculations that are supported on a GPU can expect to see significant speedups when a GPU is used. Only certain computations are supported for GPU acceleration. Therefore, users should check to ensure that a high percentage of the solution time was spent performing computations that could possibly be accelerated on a GPU. This can be done by reviewing the equation solver statistics files as described below. See Measuring Performance in the Performance Guide for more details on the equation solver statistics files.
- PCG solver file: The .PCS file contains statistics for the PCG iterative solver. You should first check to make sure that the GPU was utilized by the solver. This can be done by looking at the line which begins with: “Number of cores used”. The string “GPU acceleration enabled” will be added to this line if the GPU hardware was used by the solver. If this string is missing, the GPU was not used for that call to the solver. Next, you should study the elapsed times for both the “Preconditioner Factoring” and “Multiply With A22” computations. GPU hardware is only used to accelerate these two sets of computations. The wall clock (or elapsed) times for these computations are the areas of interest when determining how much GPU acceleration is achieved.
- Sparse solver files: The .DSP file contains statistics for the sparse direct solver. You should first check to make sure that the GPU was utilized by the solver. This can be done by looking for the following line: “GPU acceleration activated”. This line will be printed if the GPU hardware was used. If this line is missing, the GPU was not used for that call to the solver. Next, you should check the percentage of factorization computations (flops) which were accelerated on a GPU. This is shown by the line: “percentage of GPU accelerated flops”. Also, you should look at the time to perform the matrix factorization, shown by the line: “time (cpu & wall) for numeric factor”. GPU hardware is only used to accelerate the matrix factor computations. These lines provide some indication of how much GPU acceleration is achieved.
- Eigensolver files: The Block Lanczos and Subspace eigensolvers support the use of GPU devices; however, no statistics files are written by these eigensolvers. The .PCS file is written for the PCG Lanczos eigensolver and can be used as described above for the PCG iterative solver.
  Using multiple GPU devices
  When using the sparse solver in a shared-memory parallel solution, it is expected that running a simulation with multiple GPU devices will not improve performance compared to running with a single GPU device. In a shared-memory parallel solution, the sparse solver can only make use of one GPU device.
  Oversubscribing GPU hardware
  The program automatically determines which GPU devices to use. In a multiuser environment, this could mean that one or more of the same GPUs are picked when multiple simulations are run simultaneously, thus oversubscribing the hardware.
- If only a single GPU accelerator device exists in the machine, then only a single user should attempt to make use of it, much in the same way users should avoid oversubscribing their CPU cores.
- If multiple GPU accelerator devices exist in the machine, you can set the ANSGPU_DEVICE environment variable, in conjunction with the ANSGPU_PRINTDEVICES environment variable mentioned above, to specify which particular GPU accelerator devices to use during the solution.
  For example, consider a scenario where ANSGPU_PRINTDEVICES shows that four GPU devices are available with device ID values of 1, 3, 5, and 7 respectively, and only the second and third devices are supported for GPU acceleration. To select only the second supported GPU device, set ANSGPU_DEVICE = 5. To select the first and second supported GPU devices, set ANSGPU_DEVICE = 3:5.
  Solver/hardware combination
  When using NVIDIA GPU devices, some solvers may not achieve good performance on certain devices. For more information, see Performance Issue for Some Solver/Hardware Combinations (p. 11).

Chapter 4: Using Distributed ANSYS

When running a simulation, the solution time is typically dominated by three main parts: the time spent to create the element matrices and form the global matrices or global systems of equations, the time to solve the linear system of equations, and the time spent calculating derived quantities (such as stress and strain) and other requested results for each element.

The distributed-memory parallelism offered via Distributed ANSYS allows the entire solution phase to run in parallel, including the stiffness matrix generation, linear equation solving, and results calculations. As a result, a simulation using distributed-memory parallel processing usually achieves much faster solution times than a similar run performed using shared-memory parallel processing (p. 5), particularly at higher core counts.

Distributed ANSYS can run a solution over multiple cores on a single machine or on multiple machines (that is, a cluster). It automatically decomposes the model into smaller domains, transfers the domains to each core, solves each domain simultaneously, and creates a complete solution to the model. The memory and disk space required to complete the solution can also be distributed over multiple machines. By utilizing all of the resources of a cluster (computing power, RAM, memory and I/O bandwidth), distributed- memory parallel processing can be used to solve very large problems much more efficiently compared to the same simulation run on a single machine.

Distributed ANSYS Behavior

Distributed ANSYS works by launching multiple ANSYS processes on either a single machine or on multiple machines (as specified by one of the following command line options: -np, -machines, or - mpifile). The machine that the distributed run is launched from is referred to as the head compute node, and the other machines are referred to as the compute nodes. The first process launched on the head compute node is referred to as the master process; all other processes are referred to as the worker processes.

Each Distributed ANSYS process is essentially a running process of shared-memory ANSYS. These processes are launched through the specified MPI software layer. The MPI software allows each Distributed ANSYS process to communicate, or exchange data, with the other processes involved in the distributed simulation.

Distributed ANSYS does not currently support all of the analysis types, elements, solution options, etc. that are available with shared-memory ANSYS (see Supported Features (p. 30)). In some cases, Distributed ANSYS stops the analysis to avoid performing an unsupported action. If this occurs, you must launch shared-memory ANSYS to perform the simulation. In other cases, Distributed ANSYS will automatically disable the distributed-memory parallel processing capability and perform the operation using sharedmemory parallelism. This disabling of the distributed-memory parallel processing can happen at various levels in the program.

The master process handles the inputting of commands as well as all of the pre- and postprocessing actions. Only certain commands (for example, the SOLVE command and supporting commands such as /SOLU, FINISH, /EOF, /EXIT, and so on) are communicated to the worker processes for execution.

Therefore, outside of the SOLUTION processor (/SOLU), Distributed ANSYS behaves very similar to shared-memory ANSYS. The master process works on the entire model during these pre- and postprocessing steps and may use shared-memory parallelism to improve performance of these operations. During this time, the worker processes wait to receive new commands from the master process.

Once the SOLVE command is issued, it is communicated to the worker processes and all Distributed ANSYS processes become active. At this time, the program makes a decision as to which mode to use when computing the solution. In some cases, the solution will proceed using only a distributed-memory parallel (DMP) mode. In other cases, similar to pre- and postprocessing, the solution will proceed using only a shared-memory parallel (SMP) mode. In a few cases, a mixed mode may be implemented which tries to use as much distributed-memory parallelism as possible for maximum performance. These three modes are described further below.

Pure DMP Mode

The simulation is fully supported by Distributed ANSYS, and distributed-memory parallelism is used throughout the solution. This mode typically provides optimal performance in Distributed ANSYS.

Mixed Mode

The simulation involves a particular set of computations that is not supported by Distributed ANSYS. Examples include certain equation solvers and remeshing due to mesh nonlinear adaptivity. In these cases, distributed-memory parallelism is used throughout the solution, except for the unsupported set of computations.When that step is reached, the worker processes in Distributed ANSYS simply wait while the master process uses shared-memory parallelism to perform the computations. After the computations are finished, the worker processes continue to compute again until the entire solution is completed.

Pure SMP Mode

The simulation involves an analysis type or feature that is not supported by Distributed ANSYS. In this case, distributed-memory parallelism is disabled at the onset of the solution, and sharedmemory parallelism is used instead. The worker processes in Distributed ANSYS are not involved at all in the solution but simply wait while the master process uses shared-memory parallelism to compute the entire solution.

When using shared-memory parallelism inside of Distributed ANSYS (in mixed mode or SMP mode, including all pre- and postprocessing operations), the master process will not use more cores on the head compute node than the total cores you specify to be used for the Distributed ANSYS solution. This is done to avoid exceeding the requested CPU resources or the requested number of licenses.

The following table shows which steps, including specific equation solvers, can be run in parallel using shared-memory ANSYS and Distributed ANSYS.

Table 4.1 Parallel Capability in Shared-Memory and Distributed ANSYS

Solvers/ Feature	Shared-Memory ANSYS	Distributed ANSYS
Sparse	Y	Y
PCG	Y	Y
ICCG	Y	Y [1]
JCG	Y	Y [1][2]
QMR [3]	Y	Y [1]
Block Lanczos eigensolver	Y	Y
PCG Lanczos eigensolver	Y	Y
Supernode eigensolver	Y	Y [1]
Subspace eigensolver	Y	Y
Unsymmectric eigensolver	Y	Y
Damped eigensolver	Y	Y
QRDAMP eigensolver	Y	Y
Element formulation, results calculation	Y	Y
Graphics and other pre- and postprocessing	Y	Y [1]

This solver/operation only runs in mixed mode.
For static analyses and transient analyses using the full method (TRNOPT,FULL), the JCG equation solver runs in pure DMP mode only when the matrix is symmetric. Otherwise, it runs in SMP mode.
The QMR solver only supports 1 core in SMP mode and in mixed mode.

The maximum number of cores allowed in a Distributed ANSYS analysis is currently set at 8192. Therefore, you can run Distributed ANSYS using anywhere from 2 to 8192 cores (assuming the appropriate HPC licenses are available) for each individual job. Performance results vary widely for every model when using any form of parallel processing. For every model, there is a point where using more cores does not significantly reduce the overall solution time. Therefore, it is expected that most models run in Distributed ANSYS can not efficiently make use of hundreds or thousands of cores.

Files generated by Distributed ANSYS are named Jobnamen.ext, where n is the process number. (See Differences in General Behavior (p. 32) for more information.) The master process is always numbered 0, and the worker processes are 1, 2, etc.When the solution is complete and you issue the FINISH command in the SOLUTION processor, Distributed ANSYS combines all Jobnamen.RST files into a single Jobname.RST file, located on the head compute node. Other files, such as .MODE, .ESAV, .EMAT, etc., may be combined as well upon finishing a distributed solution. (See Differences in Postprocessing (p. 37) for more information.)

The remaining sections explain how to configure your environment to run Distributed ANSYS, how to run a Distributed ANSYS analysis, and what features and analysis types are supported in Distributed ANSYS. You should read these sections carefully and fully understand the process before attempting to run a distributed analysis. The proper configuration of your environment and the installation and configuration of the appropriate MPI software are critical to successfully running a distributed analysis.

4.1 Configuring Distributed ANSYS

To run Distributed ANSYS on a single machine, no additional setup is required.

To run an analysis with Distributed ANSYS on a cluster, some configuration is required as described in the following sections:

4.1.1. Prerequisites for Running Distributed ANSYS 4.1.2. Setting Up the Cluster Environment for Distributed ANSYS

4.1.1 Prerequisites for Running Distributed ANSYS

Whether you are running on a single machine or multiple machines, the following condition is true:

By default, Distributed ANSYS uses two cores and does not require any HPC licenses. Additional licenses will be needed to run a distributed solution with more than four cores. Several HPC license options are available. For more information, see HPC Licensing (p. 3) in the Parallel Processing Guide (p. 1).

If you are running on a single machine, there are no additional requirements for running a distributed solution.

If you are running across multiple machines (for example, a cluster), your system must meet these additional requirements to run a distributed solution.

Homogeneous network: All machines in the cluster must be the same type, OS level, chip set, and interconnects.
You must be able to remotely log in to all machines, and all machines in the cluster must have identical directory structures (including the ANSYS 2021 R1 installation, MPI installation, and working directories). Do not change or rename directories after you've launched ANSYS. For more information, see Directory Structure Across Machines (p. 29) in the Parallel Processing Guide (p. 1).
All machines in the cluster must have ANSYS 2021 R1 installed, or must have an NFS mount to the ANSYS 2021 R1 installation. If not installed on a shared file system, ANSYS 2021 R1 must be installed in the same directory path on all systems.
All machines must have the same version of MPI software installed and running. The table below shows the MPI software and version level supported for each platform.

4.1.1.1 MPI Software

The MPI software supported by Distributed ANSYS depends on the platform (see the table below).

The files needed to run Distributed ANSYS using Intel MPI, MS MPI, or Open MPI are included on the installation media and are installed automatically when you install ANSYS 2021 R1. Therefore, when running on a single machine (for example, a laptop, a workstation, or a single compute node of a cluster) on Windows or Linux, or when running on a Linux cluster, no additional software is needed. However, when running on multiple Windows machines you must use a cluster setup, and you must install the MPI software separately (see Installing the Software (p. 21) later in this section).

Table 4.2: Platforms and MPI Software

Platform	MPI Software
Linux	Intel MPI 2018.3.222
	Open MPI 3.1.5a
Windows 10 (Single Machine)	Intel MPI 2018.3.210
	MS MPI v10.1.12
Windows Server 2016 (Cluster)	Microsoft HPC Pack (MS MPI v10.1.12)b

Mellanox OFED driver version 4.4 or higher is required.

If you are running Distributed ANSYS across multiple Windows machines, you must use Microsoft HPC Pack (MS MPI) and the HPC Job Manager to start Distributed ANSYS (see Activating Distributed ANSYS (p. 25) ).

4.1.1.2 Installing the Software

Install ANSYS 2021 R1 following the instructions in the ANSYS, Inc. Installation Guide for your platform. Be sure to complete the installation, including all required post-installation procedures.

To run Distributed ANSYS on a cluster, you must:

Install ANSYS 2021 R1 on all machines in the cluster, in the exact same location on each machine.
For Windows, you can use shared drives and symbolic links. Install ANSYS 2021 R1 on one Windows machine (for example, C:\Program Files\ANSYS Inc\V211) and then share that installation folder. On the other machines in the cluster, create a symbolic link (at C:\Program Files\ANSYS Inc\V211) that points to the UNC path for the shared folder. On Windows systems, you must use the Universal Naming Convention (UNC) for all file and path names for Distributed ANSYS to work correctly.
For Linux, you can use the exported NFS file systems. Install ANSYS 2021 R1 on one Linux machine (for example, at /ansys_inc/v211), and then export this directory. On the other machines in the cluster, create an NFS mount from the first machine to the same local directory (/ansys_inc/v211).

Installing MPI Software on Windows

You can install Intel MPI from the installation launcher by choosing Install MPI for ANSYS, Inc. Parallel Processing. For installation instructions see:

Intel-MPI 2018.3.210 Installation Instructions in the ANSYS, Inc. Installation Guides

Microsoft MPI is installed and ready for use as part of the ANSYS 2021 R1 installation, but if you require MS MPI on another machine, the installer can be found at C:\Program Files\ANSYS Inc\V211\commonfiles\MPI\Microsoft\10.1.12498.18\Windows\MSMpiSetup.exe

Microsoft HPC Pack (Windows HPC Server 2016)

You must complete certain post-installation steps before running Distributed ANSYS on a Microsoft HPC Server 2016 system. The post-installation instructions provided below assume that Microsoft HPC Server 2016 and Microsoft HPC Pack (which includes MS MPI) are already installed on your system. The post-installation instructions can be found in the following README files:

Program Files\ANSYS Inc\V211\commonfiles\MPI\WindowsHPC\README.mht or Program Files\ANSYS Inc\V211\commonfiles\MPI\WindowsHPC\README.docx

Microsoft HPC Pack examples are also located in Program Files\ANSYS Inc\V211\commonfiles\ MPI\WindowsHPC. Jobs are submitted to the Microsoft HPC Job Manager either from the command line or the Job Manager GUI.

To submit a job via the GUI, go to Start> All Programs> Microsoft HPC Pack> HPC Job Manager. Then click on Create New Job from Description File.

4.1.2. Setting up the Cluster Environment for Distributed ANSYS

After you've ensured that your cluster meets the prerequisites and you have ANSYS 2021 R1 and the correct version of MPI installed, you need to configure your distributed environment using the following procedure.

Obtain the machine name for each machine on the cluster.

Windows 10 and Windows Server 2016:
From the Start menu, pick Settings >System >About. The full computer name is listed under PC Name. Note the name of each machine (not including the domain).
Linux: Type hostname on each machine in the cluster. Note the name of each machine.

Linux only: First determine if the cluster uses the secure shell (ssh) or remote shell (rsh) protocol.

For ssh: Use the ssh-keygen command to generate a pair of authentication keys. Do not enter a passphrase. Then append the new public key to the list of authorized keys on each compute node in the cluster that you wish to use.
For rsh: Create a .rhosts file in the home directory. Add the name of each compute node you wish to use on a separate line in the .rhosts file. Change the permissions of the .rhost file by issuing: chmod 600 .rhosts. Copy this .rhosts file to the home directory on each compute node in the cluster you wish to use.

Verify communication between compute nodes on the cluster via ssh or rsh. You should not be prompted for a password. If you are, correct this before continuing. For more information on using ssh/rsh without passwords, search online for "Passwordless SSH" or "Passwordless RSH", or see the man pages for ssh or rsh.

Windows only: Verify that all required environment variables are properly set. If you followed the post-installation instructions described above for Microsoft HPC Pack (Windows HPC Server), these variables should be set automatically.

On the head compute node, where ANSYS 2021 R1 is installed, check these variables:

ANSYS211_DIR=C:\Program Files\ANSYS Inc\v211\ansys ANSYSLIC_DIR=C:\Program Files\ANSYS Inc\Shared Files\Licensing

where C:\Program Files\ANSYS Inc is the location of the product install and C:\Program Files\ANSYS Inc\Shared Files\Licensing is the location of the licensing install. If your installation locations are different than these, specify those paths instead.

On Windows systems, you must use the Universal Naming Convention (UNC) for all ANSYS, Inc. environment variables on the compute nodes for Distributed ANSYS to work correctly.

On the compute nodes, check these variables:

ANSYS211_DIR=\head_node_machine_name\ANSYS Inc\v211\ansys ANSYSLIC_DIR=\head_node_machine_name\ANSYS Inc\Shared Files\Licensing

Windows only: Share out the ANSYS Inc directory on the head node with full permissions so that the compute nodes can access it.

DOWNLOAD

IMPORTANT

LICENSE

Chapter 1: Overview of Parallel Processing​

IMPORTANT

1.1 Parallel Processing Terminology​

1.2 HPC Licensing​

Chapter 2: Using Shared-Memory ANSYS ​

IMPORTANT

2.1 Activating Parallel Processing in a Shared-Memory Architecture​

2.1.1 System Specific Considerations​

2.2 Troubleshooting​

Chapter 3: GPU Accelerator Capability ​

IMPORTANT

3.1 Activating the GPU Accelerator Capability​

3.2 Supported Analysis Types and Features​

3.2.1 NVIDIA GPU Hardware​

Supported Analysis Types​

Performance Issues for Some Solver/Hardware Combinations​

Shared-Memory Parallel Behavior​

Distributed-Memory Parallel Behavior​

3.2.1.2 Supported Features​

3.3 Troubleshooting​

Important

Chapter 4: Using Distributed ANSYS​

Distributed ANSYS Behavior​

Pure DMP Mode​

Mixed Mode​

Pure SMP Mode​

4.1 Configuring Distributed ANSYS​

4.1.1 Prerequisites for Running Distributed ANSYS​

4.1.1.1 MPI Software​

4.1.1.2 Installing the Software​

Installing MPI Software on Windows​

Microsoft HPC Pack (Windows HPC Server 2016)​

4.1.2. Setting up the Cluster Environment for Distributed ANSYS​

Chapter 1: Overview of Parallel Processing

1.1 Parallel Processing Terminology

1.2 HPC Licensing

Chapter 2: Using Shared-Memory ANSYS

2.1 Activating Parallel Processing in a Shared-Memory Architecture

2.1.1 System Specific Considerations

2.2 Troubleshooting

Chapter 3: GPU Accelerator Capability

3.1 Activating the GPU Accelerator Capability

3.2 Supported Analysis Types and Features

3.2.1 NVIDIA GPU Hardware

Supported Analysis Types

Performance Issues for Some Solver/Hardware Combinations

Shared-Memory Parallel Behavior

Distributed-Memory Parallel Behavior

3.2.1.2 Supported Features

3.3 Troubleshooting

Chapter 4: Using Distributed ANSYS

Distributed ANSYS Behavior

Pure DMP Mode

Mixed Mode

Pure SMP Mode

4.1 Configuring Distributed ANSYS

4.1.1 Prerequisites for Running Distributed ANSYS

4.1.1.1 MPI Software

4.1.1.2 Installing the Software

Installing MPI Software on Windows

Microsoft HPC Pack (Windows HPC Server 2016)

4.1.2. Setting up the Cluster Environment for Distributed ANSYS