Smoke tests and performance tests: ensure GPU memory and other resources are available
It can happen that there is not sufficient GPU memory or other resources available when running the smoke and performance tests. This should be checked for in the .sh files. For example, you could implement a loop that re-checks every 10 seconds how mich GPU memory is available just before the xvfb-run is executed in run_smoke_tests.sh. I did something similar a while ago when working on parallel processing. My solution looked like this:
# Only run if there is a GPU with enough free memory
free_mem=0
this_gpu_id=-1
while [ "$free_mem" -le "$min_mem_free" ]; do
for gpu_id in "${gpu_ids[@]}"; do
free_mem=$(nvidia-smi --query-gpu=memory.free --format=csv -i $gpu_id | grep -Eo [0-9]+)
if [ "$free_mem" -ge "$min_mem_free" ]; then
this_gpu_id=${gpu_id}
break
fi
echo "${free_mem} MB is free on GPU ${gpu_id}, but ${min_mem_free} MB is required. Waiting..."
sleep $sleep_time
done
done
Note that this solution has even more functionality in that it iterates over all GPUs and auto-selects a GPU with free memory as soon if one becomes available.
As a workaround, I have now added
export CUDA_VISIBLE_DEVICES=""
in run_smoke_tests.sh
.