CPU 版本的 Tensorflow 預設值會搶走所有能用的 CPU 核心數,並不適用多工作的排程系統,最好在 Python 程式碼中特別指定(否則你會用非常多個計算核心工作,但每個核心的使用率都被限制得非常低),有以下兩點要注意:
By default, the CPU version of TensorFlow attempts to use all available CPU cores, which is not suitable for multi-user job scheduling systems. This can cause TensorFlow to occupy many cores while keeping each core’s utilization very low. To avoid this, you should explicitly specify the number of CPU cores TensorFlow can use. There are two important points to note,
In your job.sh script, pass the number of allocated cores to your Python program. A sample script is shown below,
#!/bin/bash
#SBATCH --job-name="My job"
#SBATCH --partition=rtx2080ti
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --time=0-1:0
#SBATCH --chdir=.
#SBATCH --output=cout.txt
#SBATCH --error=cerr.txt
sbatch_pre.sh
module load opt gcc python/3.8.10-gpu
python3 test.py $SLURM_NTASKS
sbatch_post.sh
In the above submission script,
For the CPU version of TensorFlow, to request 2 CPU cores, use #SBATCH --ntasks=2
For the GPU version of TensorFlow, to request 2 CPU cores and 1 GPU, use #SBATCH --ntasks=2 and #SBATCH --gres=gpu:1
In your Python code, you should also specify the number of CPU cores TensorFlow is allowed to use. An example is shown below,
import tensorflow as tf
def MyFunction(CPU_cores): # Obtain the number of allocated CPU cores
print ('Number of CPU cores: ' + str(CPU_cores)) # Print it out to verify
MyConfig = tf.ConfigProto (device_count={'CPU': CPU_cores}, inter_op_parallelism_threads=1, intra_op_parallelism_threads=1) # Set the number of TensorFlow’s available CPU cores to CPU_cores
MySession = tf.Session (config=MyConfig)
if __name__ == '__main__':
import sys
len_argv = len(sys.argv) # Retrieve the number of external (command-line) parameters from the system and store it in the variable len_argv
if len_argv >= 2: # If more than one argument is detected, assume that an external CPU core count has been specified.
cc = int(sys.argv[1]) # Store the second external (command-line) argument in the variable cc
MyFunction (cc) # Pass the variable cc to the user-defined function MyFunction
else: # If there are not more than one argument
MyFunction (1) # Set the default value to use only 1 CPU core.