This post is contributed to by: Neil Ashton (AWS) – Principal CFD Specialist SA, Karthik Raman (AWS) – Senior HPC Specialist SA, Oliver Perks (Arm) – Principal HPC Engineer
Over the past 30 years, Computational Fluid Dynamics (CFD) has become a key part of many engineering design processes. From aircraft design to modelling the blood flow inside our bodies, the ability to understand the behavior of fluids has enabled countless innovations and improved the time to market for many products. Typically, the modelling of the underlying fluid flow equations remains the major limit to accuracy; however, the scale and availability of HPC resources is arguably the main bottleneck for the typical CFD user. The need for more HPC resources has accelerated over recent years due to the move to higher-fidelity approaches, in addition to the growing use of optimization techniques and machine learning driven workflows. These workflows require the simulation of thousands or even millions of small jobs to explore a design space or train an improved turbulence model. AWS enables engineers to overcome this HPC bottleneck by allowing the quick deployment of a supercomputer. This can scale to practically any size and with the right hardware to match your needs, for example, CPUs, GPUs.
CFD users have a broad choice for instance types on AWS. Typically the majority of users select the compute-optimized Amazon EC2 C5 family of instances. These offer a better price/performance over the Amazon EC2 M5 family of instances for the majority of CFD cases due to a need for memory bandwidth over total memory needs. This broad set of available instances allows users to easily incorporate Amazon EC2 M5 or Amazon EC2 R5 instances when higher memory is needed, for example, for pre or post-processing.
There are typically two competing needs for the typical CFD user: turn-around time and cost. Depending on the underlying numerical methods, the speed of a simulation does not linearly increase with an increased core count. At some point, the cost of network communication means a non-linear increase. This means the cost of increasing the number of cores is not matched by a similar decrease in runtime. Therefore, the user typically faces a choice between the fastest possible runtime or the most efficient cost for the simulation.
The recently announced Amazon EC2 C6g instances powered by AWS-designed Arm-based Graviton2 processors bring in a new instance option. So, the logical question for any AWS customer is whether this is suitable for their CFD workload. In this blog, we provide benchmarking results for the open-source CFD code OpenFOAM that can be easily compiled and run with the Amazon EC2 C6g instance. We demonstrate that you can achieve 37% better price/performance for both single-node and multi-node cases on more than 200M cells on thousands of cores.
AWS Graviton2-based Instances
AWS Graviton2 processors are custom built by AWS using the 64-bit Arm Neoverse cores to deliver great price performance for your cloud workloads running in Amazon EC2. The Graviton2-powered EC2 instances were announced at re:Invent 2019. The new Amazon EC2 C6g compute-optimized instances are available as part of the sixth generation EC2 offering. These instances are powered by AWS Graviton2 processors that use 64-bit Arm Neoverse N1 cores and custom silicon designed by AWS, built using advanced 7-nanometer manufacturing technology.
Graviton2 processor cores feature 64 KB L1 cache and 1 MB L2 cache, includes dual SIMD units to double the floating-point performance versus first-generation Graviton processors. This targets high performance computing workloads and also support int8/fp16 number formats to accelerate machine learning inference workloads. Every vCPU is a physical core (that is, no simultaneous multithreading – SMT). The instances are single socket, so there are no NUMA concerns since every core sees the same path to memory and other cores. There are 8x DDR4 memory channels running at 3200 MT/s delivering over 200 GB/s of peak memory bandwidth.
The compute-optimized Amazon EC2 C6g instances are available in 9 sizes 1, 2, 4, 8, 16, 32, 48, and 64 vCPUs, or as bare metal instances. The compute-optimized Amazon EC2 C6g instances support configurations with up to 128 GB of DDR4 memory or 2 GB/ CPU. The instances support up-to 25 Gbps of network bandwidth and 19 Gbps of Amazon EBS bandwidth. These instances are powered by the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor.
In order to demonstrate the performance of the Amazon EC2 C6g instance, we have taken two test-cases that reflect the range of cases that CFD users run. For the first case, we simulate a 4 million cell motorbike case that is part of the standard OpenFOAM tutorial suite. It might seem that these cases do not represent the complexity of some production CFD workflows. However, these were chosen to allow readers to easily replicate the results. This case uses the OpenFOAM simpleFoam solver, which uses a steady-state incompressible segregated SIMPLE approach combined with a multigrid method for the linear solver. The k- SST model is used with second order upwinding for both momentum and turbulent equations. The scotch domain decomposition approach is used to ensure best parallel efficiency. However, for this first case the focus is on the single-node performance, given the low cell count.
Figures 1 and 2 show the time taken to run 5000 iterations as well as the cost for that simulation based upon on-demand pricing. It can be seen that the C6g.16xlarge offers a clear cost per simulation benefit over both the C5n.18xlarge (37%), and C5.24xlarge (29%) instances. There is however an increase in total run time compared to the C5.24xlarge (33%) and C5n.18xlarge (13%) which may mean that the C5.24xlarge or C5n.18xlarge will still be preferred for those prioritizing turn-around time.
Figure 1 – Single-node OpenFOAM cost per simulation
Figure 2 – Single-node OpenFOAM run-time performance
Single-Node Performance Profile
Profiling performed on the simpleFoam solver (used in the motorbike example) showed that it is limited by the sustained memory bandwidth on the instance. We therefore measured the peak sustainable memory bandwidth performance (Figure 3) on the three instance types (C5n.18xlarge, C6g.16xlarge, C5.24xlarge) using STREAM Triad. Then calculated the corresponding bandwidth efficiencies, which are shown in Figure 4.
Figure 3 – Peak memory bandwidth using STREAM
Figure 4 – Memory bandwidth efficiency using STREAM Triad
The C5n.18xlarge and C5.24xlarge instances have higher peak memory bandwidth due to more memory channels (dual socket) compared to C6g.16xlarge instance. However, the Graviton2 based processor can achieve a higher percentage of peak (~86% bandwidth efficiency) compared to the x86 processors (~79% on C5.24xlarge). This in addition to the cost benefits (47% vs. C5.24xlarge, 44% vs. C5n.18xlarge) is the reason for the improved cost per simulation with the C6g.16xlarge instance as shown in Figure 1.
A further, larger mesh of 222 million cells was studied on the same motorbike geometry with the same underlying numerical methods. We did this to assess whether the same performance trends continue for much larger multi-node simulations. Figure 5 shows the number of iterations possible per minute for different number of cores for the C5.24xlarge, C5n.18xlarge, and C6g.16xlarge instances. You can see in Figure 5 that the scaling of the C5n.18xlarge instances is much better than C5.24xlarge and C6g.16xlarge instances due to the use of the Elastic Fabric Adapter (EFA), which enables optimum scaling thanks to low latency, high-bandwidth communication. However, while the scaling is much improved for C5n.18xlarge instances with EFA, the cost per simulation (Figure 6) shows up to 37% better price/performance for the C6g.16xlarge instances over the C5n.18xlarge and C5n.24xlarge instances.
Figure 5 – Multi-node OpenFOAM Scaling Performance
Figure 6 – Multi-node OpenFOAM cost per simulation
This blog has demonstrated the ease with which both single-node and multi-node CFD simulations can be ported to the new Amazon EC2 C6g instances powered by Arm-based AWS Graviton2 processors. With no code modifications, we have been able to port an x86 workload to the new C6g instances and achieve competitive single node and multi-node performance with up to 37% better price/performance over other C family instances. We encourage you to test out your applications for yourself and reach out to us if you have any questions!
OpenFOAM compilation on AWS Graviton2 Instances
OpenFOAM v1912 builds on modern Arm-based systems, with support in-place for both the GCC compiler and the Arm Compiler for Linux (ACfL). No source code modifications are required to obtain a working OpenFOAM binary. The process for building OpenFOAM on Arm based systems is equivalent to that on x86 systems, by specifying the desired compiler in OpenFOAM bashrc file.
The only external dependencies are a working MPI library. For this experiment, we used Open MPI version 4.0.3, built with UCX 1.8 and GCC 9.2. We note that we have not applied any additional hardware-specific performance optimizations. Although deploying OpenFOAM on the Amazon EC2 C6g instances was trivial, there is documentation on the Arm community GitLab pages. This documentation covers the installation of different OpenFOAM versions and with different compilers.
For the C5n.18xlarge and C5.24xlarge simulations, OpenFOAM v1912 was compiled using GCC 8.2 and IntelMPI 2019.7. For all simulations Amazon Linux 2 was the operating system. The HPC environment for all tests was created using AWS ParallelCluster, which will soon include official support for AWS’ latest Graviton2 instances, including C6g. You can stay up to date on the latest information and releases of AWS ParallelCluster on GitHub.
|Model||vCPU||Memory (GiB)||Instance Storage (GiB)||Network Bandwidth (Gbps)||EBS Bandwidth (Mbps)|