CompilingLAMMPSForGPU

Makefile preparation

The makefiles used to compile the GPU library and lammps executable need to be tailor-made for the GPU hardware and simulation requirements. This section will go over how to prepare the makefiles for compilation.

Setting up the GPU library makefiles

The GPU library source files are located in lib/gpu. Within this directory, there will be some template makefiles you can start with and make our modifications for your specific use-case. Copy the "Makefile.linux" template to a new file also in lib/gpu. In this example, we will name it Makefile.FSLP100.cuda10-0 (indicating that this was built for simulations on the FSL P100 GPUs with cuda10.0), but we can pick any name here.

There is a second Makefile used in the compilation of the GPU library. The template file we will use is "Makefile.lammps.standard". Make a copy of this file (in lib/gpu as well) and name it as you wish. For this example, we will name it "Makefile.lammps.FSLcuda10-0"

First we will go over the settings within the main makefile, then will briefly discuss what to setup in the extra makefile.

Makefile.* (Makefile.FSLP100.cuda10-0 in the example)

This makefile has several options that need to be changed from the defaults stored in the template "Makefile.linux" file:

EXTRAMAKE
This parameter should be set to the name of the extramake file we created above. For this example, it is "Makefile.lammps.FSLcuda10-0"

EXTRAMAKE = Makefile.lammps.FSLcuda10-0
CUDA_HOME
This setting is defaulted to /usr/local/cuda which works on most systems, but since FSL uses modules this needs to be changed with each version of cuda. This example will use cuda10.0. To determine where the cuda home directory is located, first load the cuda module of interest (to load cuda 10.0, we would use "module load cuda/10.0"). Next, run "whereis nvcc", and the results should tell us where the cuda home directory is. For cuda/10.0 on FSL, this is located at /apps/cuda/10.0.130

CUDA_HOME = /apps/cuda/10.0.130
CUDA_ARCH
This setting is used to optimize the software code to best work with the hardware on which it will run. Setting this to the wrong architecture will cause a loss of performance

Hardware / Architecture Matching
Fermi (CUDA 3.2 until CUDA 8) (deprecated from CUDA 9)

sm20 or sm_20, compute_30 – Older cards such as GeForce 400, 500, 600, GT-630

Kepler (CUDA 5 and later)

sm30 or sm_30, compute_30 – Kepler architecture (generic – Tesla K40/K80, GeForce 700, GT-730)

Adds support for unified memory programming

sm35 or sm_35, compute_35 – More specific Tesla K40

Adds support for dynamic parallelism. Little benefit over sm30

sm37 or sm_37, compute_37 – More specific Tesla K80

Adds a few more registers. Little benefit over sm30

Maxwell (CUDA 6 and later)

sm50 or sm_50, compute_50 – Tesla/Quadro M series

sm52 or sm_52, compute_52 – Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X

sm53 or sm_53, compute_53 – Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano

Pascal (CUDA 8 and later)

sm60 or sm_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)

sm61 or sm_61, compute_61 – GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2

sm62 or sm_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Volta (CUDA 9 and later)

sm70 or sm_70, compute_70 – DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100

sm72 or sm_72, compute_72 – Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX

Turing (CUDA 10 and later)

sm75 or sm_75, compute_75 – GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4

Ampere (CUDA 11 and later)

sm80 or sm_80, compute_80 – RTX Ampere – RTX 3080

The m8g cluster on FSL uses K80s, while the m9g cluster uses P100s. For this example, we will set this up for the newer P100 GPUs. This should be changed if running on different hardware.

CUDA_ARCH = -arch=sm_60

LMP_INC
This setting changes the size of certain integer data types. Most simulations should just use the default which is LMP_INC = -DLAMMPS_SMALLBIG. If you need more than 2 billion atom IDs or want more info on this option read more here https://lammps.sandia.gov/doc/Build_settings.html

LMP_INC = -DLAMMPS_SMALLBIG

Makefile.lammps.* (Makefile.lammps.FSLcuda10-0 in the example)

This is the "extramake" file. Only one setting needs to be changed here:

CUDA_HOME
This should be setup identically to how it is setup in the other makefile above. For cuda10.0 on FSL, the line should read

CUDA_HOME = /apps/cuda/10.0.130

Setting up the LAMMPS makefile

Now that the GPU library compiling makefiles have been setup correctly, let's setup the makefile used to compile LAMMPS. This is located in the src/MAKE directory. Within this directory, the Makefile.mpi and Makefile.serial files can be found. We want to use the "Makefile.gpu" template, which is found in the src/MAKE/OPTIONS directory. Copy this file to the src/MAKE directory so we can use it to compile LAMMPS with GPU support.

This file has most of the same options as the Makefile.serial and Makefile.mpi files. If you need to tweak things like memory alignment, the MPI library you want to use, the FFT tool you want to use, or JPEG/PNG support, you will want to tweak these options. More info can be found at https://lammps.sandia.gov/doc/Build_make.html

Compiling LAMMPS with GPU support

FSL ONLY: When compiling on FSL, we need to make sure the modules we require are loaded. Load the following modules:

module load gcc/7

module load openmpi/3.1

module load cuda/YOURVERSIONHERE (for this example, we will load cuda/10.0)

As of the Mar 3, 2020 release of LAMMPS, gcc8 is not supported, so be sure to load the gcc/7 module

NOTE: some of these module versions may be phased out in the future or LAMMPS requirements may change and this guide may not remain updated

Outside of FSL, you will likely not need to load these modules, unless the system you are using has modules implemented.

Move into the lib/gpu directory
cd lib/gpu
Compile the GPU library LAMMPS requires
make -j 8 -f Makefile.FSLP100.cuda10-0
This may take some time. Wait for the compilation to finish before proceeding.
Move into the src directory
cd lib/gpu
Flag the options required for our simulations (don't just blindly copy these—these are what I use but your requirements are almost certainly different—check to make sure the functionalities you require are included here. See the "LAMMPS Packages" section here: http://knottsgroup.groups.et.byu.net/labbook/index.php?n=Main.CompilingLAMMPS)
make yes-gpu (for GPU support)
make yes-misc (I use this for "fix viscosity")

make yes-user-misc (I use this for "dihedral fourier")

make yes-molecule (I use this for "atom_style full")

make yes-kspace (I use this for "pair_style lj/cut/coul/long")
Begin compiling lammps with GPU support
make -j 8 gpu
This may take some time. Wait for the compilation to finish before proceeding.

Upon completion, a lmp_gpu file should be generated in the src directory. Copy this somewhere safe and name it however you would like.
For this example, I would name it "lmp_gpu_03Mar2020_FSLP100_cuda10.0.exe" to indicate that it was built with gpu support optimized for the FSL P100s with cuda10.0 using the 03Mar2020 release of LAMMPS. I use rather verbose names because I keep multiple compiled exectuables on hand, but you can name it whatever you would like.

Running simulations using LAMMPS compiled with GPU support

Now that you have a LAMMPS executable built with GPU support, we can run GPU accelerated simulations. Note that although we have built the LAMMPS executable with support for GPU acceleration, if we don't explicitly tell LAMMPS to run with GPU support, it will not use GPU acceleration.

Simulations run on FSL are commonly submitted using a submission script (a '.sbatch' file). There are some key changes that need to be made to an sbatch file from mpi/serial submission scripts in order to utilize GPU acceleration.

The SBATCH header
The SBATCH header is where we can put instructions for the SLURM scheduler. Getting these settings right is important--otherwise your jobs may not be put on the correct nodes or may have too much/too little resources.
- #SBATCH --time=10:00:00
  This setting tells the slurm server how long your jobs will need to run. After this much time has passed, your job will be cancelled, whether it finished or not. So it is important to not undershoot this. The maximum is 72
  00:00 (72 hours). Setting this too long can cause your jobs to take longer to run, as the scheduler will need to allocate the full time to your job (if there is a 2 day gap available but you ask for a 72hr walltime, you will not be queued in that available slot in the schedule and will have to wait longer to be queued):
- #SBATCH --gres=gpu:N
  This setting requests N gpu gres (general resources). The pascal nodes each have 4 gpus, so you could set this to 4, but it will make it more difficult for your jobs to find time to run and will use more resources on FSL (although it may speed up your simulations).
- #SBATCH -C pascal
  This tells the slurm server that you need these jobs to run with the "pascal" requirement. You can change this to kepler if you are running on the old K80 GPUs (a great idea if the pascal nodes are all in use)
- #SBATCH --ntasks=N
  This setting is used to request N threads. I always just had this mirror the number of GPUs I was requesting in the gres=gpu
  N setting.:
- #SBATCH --nodes=M
  This setting requests M nodes. Since we are using "-C pascal" these will be pascal nodes. Each pascal nodes has 4 GPUs attached, so unless you need more than 4, you will use "#SBATCH --nodes=1".
- #SBATCH --mem=??
  This tells the slurm scheduler how much memory to reserve for your job. Requesting too little can cause your jobs to run out of memory and crash, requesting too much will cause your jobs to take longer to start or prevent your jobs from ever starting. My jobs use 2GB of memory, which I request via "#SBATCH --mem=2G". Similarly, requesting 200MB could be done via "#SBATCH --mem=200M". This setting is highly dependent on your specific simulation. You can attempt to find the sweet spot by running a simulation with a moderate amount of memory (maybe 4GB to start?) then reviewing the memory utilization in your job statistics on https://rc.byu.edu/account/stats/job/. It will report an OUT_OF_MEMORY job state if there was too little requested, and COMPLETE if the job ran successfully. On successful jobs you can then look at the "Memory Utilization" to see what portion of your requested memory you actually used. So if this reported "0.40" on a 4GB job, you could request 2GB next time and only ~80% should be used. Try to keep the "Memory Utilization" somewhere in the 0.50~0.80 range if possible.
Loading required modules within the ".sbatch" script
The sbatch script is run on the node you reserved when your job starts. On FSL, we will want to make sure we load the necessary modules before starting LAMMPS. From the above guide, we will load the same gcc, openmpi, and cuda versions we used when we compiled lammps for GPU support.

module load gcc/7

module load openmpi/3.1

module load cuda/10.0
The "run" line
This is the line in our ".sbatch" file where we actually run the simulation with LAMMPS. It should look something like this:

mpirun -np [numtasks] [path-to-lammps-executable] -sf gpu -pk gpu [numGPU] -in [path-to-LAMMPS-input-file]

e.g.:

mpirun -np 1 ./lmp_gpu_pascal_cuda10.0.exe -sf gpu -pk gpu 1 -in ./start_EMD-TD.in
- -np [numtasks]
  [numtasks] should match the --ntasks=N used in the SBATCH header.
- [path-to-lammps-executable]
  This should be the executable you compiled for GPU. It should also be a unique executable for each simulation (copy it out to each simulation folder). I have had issues using a single common executable for multiple simulations.
- -sf gpu
  This attempts to add the /gpu suffix to the various fix, bond, angle, and dihedral styles used in your simulations. Note that only these will be GPU accelerated. If none of the styles you use support the /gpu suffix, then there is no reason for you to run on GPU. Check whether your styles are supported on the LAMMPS documentation website.
  
  An example is the page for the mie/cut pair_style (https://lammps.sandia.gov/doc/pair_mie.html). You can see both the pair_style mie/cut and pair_style mie/cut/gpu styles listed at the top of the page. This indicates that this style DOES support the gpu suffix and will be accelerated when run on GPUs. Note that angle_style harmonic DOES NOT support GPUs (I use both these styles in my simulations--this means that my pair interactions are greatly accelerated, but the angle styles will be calculated at normal single-threaded speeds. Since pair styles take up alot of computation time, this still provides a BIG speed increase for me).
- -pk gpu [numGPU]
  This tells lammps you are using the gpu package, and tells it that you are using N GPUs. This should match the value set in "#SBATCH --gres=gpu:N" above.
- -in [path-to-LAMMPS-input-file]
  This is just the path to your input file.
Note that these are the settings I use for my jobs. Addison uses -partition, -l, and -sc settings as well (http://knottsgroup.groups.et.byu.net/labbook/index.php?n=Main.LAMMPSInputFiles), which can be added here.

Just be sure to maintain their placement AFTER [path-to-lammps-executable]. Settings placed before [path-to-lammps-executable] are openmpi settings (-np N), and those placed after [path-to-lammps-executable] are LAMMPS settings (-sf gpu, -pk gpu N, etc.)

An example ".sbatch" script

#!/bin/bash -l
#SBATCH --time=10:00:00 #SBATCH --ntasks=4 #SBATCH --nodes=1 #SBATCH --mem=2G #SBATCH --mail-type=FAIL #SBATCH --mail-user=my.email@gmail.com module load gcc/7 module load openmpi/3.1 module load cuda/10.0 mpirun -np 1 ./lmp_gpu_pascal_cuda10.0.exe -sf gpu -pk gpu 1 -in ./start_EMD-TD.in

Running the sbatch script
The sbatch script can be run from any FSL node with "sbatch [path-to-sbatch-file]". The paths you have to the lammps executable and lammps input file are relative to the path in which you start the job. The slurm-jobid.out files will be placed in the directory from which this command was submitted.

For more information on how to run the executable you have compiled, check https://lammps.sandia.gov/doc/Speed_gpu.html