Mitigate runaway processes
Sometimes I find myself running testsuites that typically, in order to make the most of the several cores available in the system, spawn many processes so the tests can run in parallel. This allows running the testsuites much faster.
One side-effect, though, of these mechanisms is that they may not be able
to handle correctly cancellation, say pressing Ctrl-C
.
Today we are going to see a way to mitigate this problem using systemd-run
.
Systemd
Systemd is the system and service manager used in Linux these days in replacement of existing solutions based on shell scripts. In contrast to loosely coupled scripts, systemd is a more integrated solution. In that sense it has pros and cons but the former seem to outweigh the latter and most Linux distributions have migrated to use systemd.
Systemd uses the concept of units, of which there are different kinds, and we are interested in the service unit type.
Typically units are described by files on the disk so we can start, stop, etc. using
the systemctl
command.
systemd-run
The tool systemd-run
allows us to create service units on the fly for ad-hoc
purposes. By default systemd-run
will try to use the global (system-wide)
systemd
session, but we can tell it to use the systemd session created when
the user logged on (e.g. via ssh
) using the command option --user
.
One interesting flag is the --shell
flag, which allows us to run $SHELL
as
a systemd service. This means that systemd is in control of the processes
created in there.
$ systemd-run --user --shell
Running as unit: run-u100.service
Press ^] three times within 1s to disconnect TTY.
$ uname -a
Linux mybox 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
$ exit
exit
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 2.715s
CPU time consumed: 10ms
The flag --shell
according the
documentation
is a shortcut for the command options --pty --same-dir --wait --collect --service-type=exec $SHELL
.
Use case
As part of my dayjob I often run the LLVM
unit and regression
tests. Once we have
built LLVM, along with other projects such as clang
, flang
and lld
, there
is a target in the build system called check
. Check will build the necessary
infrastructure for unit tests and invoke
lit
# Build LLVM and all the projects
user:~/llvm-build$ cmake --build .
# Run the unit and regression tests
user:~/llvm-build$ cmake --build . --target check
lit
is implemented in Python and in order to exploit parallelism uses the
multiprocessing
module. Unfortunately if for some reason you need to cancel early the
testsuite execution (e.g., you realised you forgot to add a test), say,
pressing Ctrl-C
, if your machine has lots of threads, you will end with a
large number of runaway processes. This is easy to observe when LLVM is build
in Debug mode as everything runs much slower, including tests. I have not dug
further but I assume this is a limitation of the multiprocessing
module.
Following is an example of what typically happens if we press Ctrl-C on a machine with 16 cores (32 threads):
user:~/llvm-build$ cmake --build . --target check
[2/3] cd /home/user/soft/llvm-build... /usr/bin/python3 -m unittest discover
.................................................................................................................................
----------------------------------------------------------------------
Ran 129 tests in 1.403s
OK
[2/3] Running all regression tests
llvm-lit: /home/user/llvm-src/llvm/utils/lit/lit/llvm/config.py:488: note: using clang: /home/user/llvm-build/bin/clang
^C interrupted by user, skipping remaining tests
Testing Time: 4.53s
Total Discovered Tests: 74509
Skipped: 74509 (100.00%)
ninja: build stopped: interrupted by user.
If right after cancelling we check ps -x -f
, we will see a large number of
processes that have been detached from the lit
process.
user:~/llvm-build$ ps -x -f
…
16574 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-global-agent.ll.script
16575 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx600 -verify-machineinstrs
16576 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX6 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll
16577 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-local-singlethread.ll.script
16578 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx600 -verify-machineinstrs
16579 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX6 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll
16580 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/sched-group-barrier-pipeline-solver.mir.script
16612 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -march=amdgcn -mcpu=gfx908 -amdgpu-igrouplp-exact-solver -run-pass=machine-scheduler -o - /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/sched-group-barrier-pipeline-solver.mir
16613 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck -check-prefix=EXACT /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/sched-group-barrier-pipeline-solver.mir
16583 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-global-system.ll.script
16584 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx600 -verify-machineinstrs
16585 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX6 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll
16586 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-flat-agent.ll.script
16587 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -verify-machineinstrs
16588 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX7 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll
16590 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-flat-singlethread.ll.script
16591 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -verify-machineinstrs
16592 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX7 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll
16593 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-flat-system.ll.script
16594 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -verify-machineinstrs
16595 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX7 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
16596 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-flat-wavefront.ll.script
16597 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -verify-machineinstrs
16598 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX7 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll
16600 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/tools/clang/test/CodeGen/X86/Output/x86_64-xsave.c.script
16658 pts/2 R 0:04 | \_ /home/user/llvm-build/bin/clang -cc1 -internal-isystem /home/user/llvm-build/lib/clang/18/include -nostdsysteminc /home/user/llvm-src/clang/test/CodeGen/X86/x86_64-xsave.c -DTEST_XSAVE -O0
16659 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck /home/user/llvm-src/clang/test/CodeGen/X86/x86_64-xsave.c --check-prefix=XSAVE
16603 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/memory-legalizer-flat-workgroup.ll.script
16607 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -verify-machineinstrs
16608 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck --check-prefixes=GFX7 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll
16609 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/tools/clang/test/CodeGen/X86/Output/rot-intrinsics.c.script
16646 pts/2 R 0:05 | \_ /home/user/llvm-build/bin/clang -cc1 -internal-isystem /home/user/llvm-build/lib/clang/18/include -nostdsysteminc -x c -ffreestanding -triple x86_64--linux -no-enable-noundef-analysis -emit-llvm /home/roge
16647 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck /home/user/llvm-src/clang/test/CodeGen/X86/rot-intrinsics.c --check-prefixes CHECK,CHECK-64BIT-LONG
16621 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/tools/clang/test/Headers/Output/opencl-builtins.cl.script
16642 pts/2 R 0:09 | \_ /home/user/llvm-build/bin/clang -cc1 -internal-isystem /home/user/llvm-build/lib/clang/18/include -nostdsysteminc -include /home/user/llvm-src/clang/test/Headers/opencl-builtins.cl /home/ro
16622 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/tools/clang/test/CodeGen/PowerPC/Output/ppc-smmintrin.c.script
16652 pts/2 R 0:04 | \_ /home/user/llvm-build/bin/clang -S -emit-llvm -target powerpc64-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS /home/user/llvm-src/clang/test/CodeGen/PowerPC/ppc-smmintrin.c -fno-discard-
16623 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/tools/clang/test/CodeGen/X86/Output/x86_32-xsave.c.script
16656 pts/2 R 0:04 | \_ /home/user/llvm-build/bin/clang -cc1 -internal-isystem /home/user/llvm-build/lib/clang/18/include -nostdsysteminc /home/user/llvm-src/clang/test/CodeGen/X86/x86_32-xsave.c -DTEST_XSAVE -O0
16657 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck /home/user/llvm-src/clang/test/CodeGen/X86/x86_32-xsave.c --check-prefix=XSAVE
16624 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/GlobalISel/Output/fdiv.f16.ll.script
16627 pts/2 R 0:10 | \_ /home/user/llvm-build/bin/llc -global-isel -march=amdgcn -mcpu=tahiti -denormal-fp-math=ieee -verify-machineinstrs
16629 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck -check-prefixes=GFX6,GFX6-IEEE /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll
16625 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/tools/clang/test/Headers/Output/opencl-c-header.cl.script
16648 pts/2 R 0:05 | \_ /home/user/llvm-build/bin/clang -cc1 -internal-isystem /home/user/llvm-build/lib/clang/18/include -nostdsysteminc -O0 -triple spir-unknown-unknown -internal-isystem ../../lib/Headers -include opencl-c.h -e
16649 pts/2 S 0:00 | \_ /home/user/llvm-build/bin/FileCheck /home/user/llvm-src/clang/test/Headers/opencl-c-header.cl
16636 pts/2 S 0:00 \_ /bin/bash /home/user/llvm-build/test/CodeGen/AMDGPU/Output/mad-mix.ll.script
16650 pts/2 R 0:05 \_ /home/user/llvm-build/bin/llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs
16651 pts/2 S 0:00 \_ /home/user/llvm-build/bin/FileCheck -check-prefixes=GFX900,SDAG-GFX900 /home/user/llvm-src/llvm/test/CodeGen/AMDGPU/mad-mix.ll
…
Granted, given enough time, those processes will eventually finish silently. But given that tests sometimes use deterministic intermediate files, if we run them again immediately we risk having spurious failures caused by two processes writing to the same file (i.e. kind of a a filesystem data race).
Running inside systemd-run
One of the downsides of running something as a service using systemd-run is
that it won’t inherit the environment but instead will use the environment of
the systemd session. Luckily this can be addressed using the -p
EnvironmentFile=<file>
option.
With all this, we can build a convenient shell script.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/env bash
set -euo pipefail
function cleanup() {
[ -n "${ENV_FILE}" ] && rm -f "${ENV_FILE}"
}
ENV_FILE="$(mktemp)"
trap cleanup EXIT
env > "${ENV_FILE}"
systemd-run --user --pty --same-dir --wait --collect --service-type=exec -q \
-p "EnvironmentFile=${ENV_FILE}" -- "$@"
The flag -q
silences the informational messages emitted systemd-run
on
start and end.
Now we can run the regression tests using this convenient script, and even if we abort the execution by pressing Ctrl-C, systemd will kill all the process tree.
user:~/llvm-build$ confine.sh cmake --build . --target check
[2/3] cd /home/user/llvm-src/clang/bindings/python && /usr/bin/cmake -E env CLANG_NO_DEFAULT_CONFIG=1 CLANG_LIBRARY_PATH=/home/user/llvm-build/lib /usr/bin/python3 -m unittest discover
.................................................................................................................................
----------------------------------------------------------------------
Ran 129 tests in 1.410s
OK
[2/3] Running all regression tests
llvm-lit: /home/user/llvm-src/llvm/utils/lit/lit/llvm/config.py:488: note: using clang: /home/user/llvm-build/bin/clang
^C interrupted by user, skipping remaining tests
Testing Time: 18.81s
Total Discovered Tests: 74509
Skipped: 74509 (100.00%)
ninja: build stopped: interrupted by user.
user:~/llvm-build$ ps -x -f | grep "bash.*\.script" | wc -l
0
Hope this is useful :)