SubmittingJobs
Two cluster resources are available to members of the group. The first is a group cluster named "watt" available to the Knotts Group. The second is the Fulton Supercomputer available to you if you have an account. It is much faster than Watt, but has walltime constraints.
Submitting jobs to either cluster requires you to first
- Compile your code to create an executable.
- Create the input files/folders needed to run the program.
- Create the submission script.
- Submit the job to the cluster.
Compiling the code and input file are discussed on other pages. Creating the submission script and submitting the job to the cluster are described below.
For Submitting a Job on the Fulton Supercomputer
The FSL uses a job scheduler called SLURM. Information on the basic SLURM commands can be found at https://marylou.byu.edu/documentation/slurm/commands. More information is available at http://slurm.schedmd.com/.
The job submission script is a shell script. It can have any desired name but it commonly has the .sbatch
file extension. The FSL has a Script Generator that can also be used to generate a SLURM submit file. Note, if a serial submission, these submission scripts can be run in the terminal like a normal .sh bash script. This is useful for troubleshooting and testing.
For submission enter the command:$ sbatch submit.sbatch
You can see the status of your jobs using:
$ squeue -u username-on-fsl
Example Submit Script for Serial Jobs
#!/bin/bash
#SBATCH --time=00:10:00 # walltime
#SBATCH --ntasks=1 # number of processor cores (i.e. tasks)
#SBATCH --mem-per-cpu=1G # memory per CPU
#SBATCH -J "hello" # job name
./TheExecutable -in TheInputFile
exit 0
Definitions and other commands:
- #SABTCH --time= defines wall time in hr:min:sec
- #SABTCH --ntasks= defines the number of tasks or cores requested, for serial this is always 1
- #SABTCH --mom-per-cpu= defines the memory allocated per cpu
- #SABTCH -J defines the job name
- -in in the executable line defines the input file
- -l in the executable line defines log file redirection (for LAMMPS)
Example Submit Script for Parallel Jobs
#!/bin/bash
#SBATCH --time=00:10:00 # walltime
#SBATCH --ntasks=20 # number of processor cores (i.e. tasks)
#SBATCH --mem-per-cpu=1G # memory per CPU
#SBATCH -J "hello" # job name
##SBATCH --qos=test # run a test submission
# Set the max number of threads to use for programs if using OpenMP
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE
mpiexec -np 20 ./TheExecutable.exe
exit 0
Definitions and other commands:
- Same as for Serial jobs above
- #SBATCH --qos=test tells the scheduler you want to run a "test" submission. Walltime cannot exceed 1hr and can be commented-out for full submissions with an additional "#" (as is done above).
- -np in the executable line defines how many cores are being allocated to the job. MUST equal the ntasks number, $SLURM_NTASKS, or if applicable OMP_NUM_THREADS (see above)
- -p in the executable line tells the program how many "partitions" or "boxes" you are making (for LAMMPS)
- -sc in the executable line defines screen file redirection (for LAMMPS)
- Note, can use the mpiexec or mpirun commands, whatever you have downloaded or is faster. FSL has both.
Other squeue flags, as well as other SLURM commands can be found online (http://slurm.schedmd.com/). You can also check the man pages.
For Submitting a Job on Watt
Watt uses a job scheduler named torque. The basics of using torque are below. You can find a lot more by searching the web. The job submission script is a shell script. It can have any desired name but is commonly something like submit.sub
or submit.sh
.
For submission enter the command:$ qsub submit.sh
You can see the status of your jobs using:
$ qstat -u username-on-watt
Example Submit Script for Serial (one-processor) Jobs
#!/bin/bash
#PBS -l nodes=1:ppn=1,pmem=500mb,walltime=00:20:00
#PBS -N name-of-job
#PBS -M user-name@byu.edu
#PBS -m abe
#PBS -q hex
cd /home/username/folder-for-job/
/home/username/folder-for-job/executable
exit 0
Definitions and other commands:
nodes
: refers to the number of physical machines that you want. See Computer Resources for a description of the types of machines available on watt.ppn
: refers to the number of processors requested on this machine. More specifically, this is the number of cores you want. The total number of cores requested will benodes*ppn
.- For example, nodes in the
hex
queue have two, hex core processors for a total of 24 core. If you wanted to run a job on 36 processors, you could do so by settingnodes=3:ppn=12
.
- For example, nodes in the
pmem=100mb
is the amount of memory you are requesting. For our jobs, memory usually isn't the limitation. Our jobs usually require much less than 1 GB. All the machines on watt have at least 1 GB per core.walltime
is how long your job will run in HH:MM:SS. You want to give your very best prediction of how long the jobs will run. If you always give it too long of a walltime, your priority will go down on future jobs. If you don't give it a long enough time, your job will terminate before the simulation is over. It is better to over predict than underestimate.-N name-of-job
specifies the name of the job and can be any description you desire.#PBS -M user-name@byu.edu
specifies that emails about the job will be sent to user-name@byu.edu-m abe
specifies that you will receive an email when the job (a)borts, (b)egins, and (e)nds. You can you any combination of a, b, and e.#PBS -q hex
specifies that the job should be run on the hex queue. This can be batch, dual, quad, hex, or gpu. See Computer Resources for more information.
Example Submit Script for Parallel (multiple-processor) Jobs
#!/bin/bash
#PBS -l nodes=2:ppn=8,pmem=500mb,walltime=100:00:00
#PBS -N name-of-job
#PBS -M user-name@byu.edu
#PBS -m abe
#PBS -q quad
cd /home/username/folder-for-job/
mpiexec /home/username/folder-for-job/executable
exit 0
Definitions and other commands:
- This job is requesting 16 total cores. (2 nodes and 8 cores per node)
- It is submitted to the queue quad where each machine (node) has two, quad-core processors.
- Notice that the executable is preceded by command mpiexec.
Other qstat flags, as well as other torque commands can be found online (e.g. https://kb.iu.edu/d/avgl, http://rcc.uh.edu/hpc-docs/49-using-torque-to-submit-and-monitor-jobs.html, http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm). You can also check the man pages.