scalar: work around hang in git-fetch(1) with fsmonitor
For a long time, we have seen CI jobs on macOS to fail both in GitLab Workflows and in GitLab CI. After some painful debugging we have found out that the offending test suites are t9210 and t9211. The common symptom here is that there was a git-fetch(1) process hanging while it does seemingly nothing, as well as a bunch of fsmonitor processes. When killing the fsmonitor processes, git-fetch(1) becomes unstuck and the test continues to run.
This issue can only be reproduced when the system is highly loaded. The
most successful way to trigger the issue is to run both of these test
suites in parallel with --stress
. Eventually, tests start to get stuck
and progress grinds to a halt.
All of this smells like a race condition somewhere deep in the fsmonitor logic. The most likely scenario is that some events in the FSEventStream used by macOS to listen for filesystem events get lost. I cannot really tell though, and do not have enough knowledge around macOS internals to properly debug this. This is made even harder by the fact that this race only happens sometimes and under high load, which makes it really hard to debug.
Instead of fixing the underlying issue, I have found a workaround that
makes the symptom go away: we can start the fsmonitor daemon manually
before we execute git-fetch(1). This means that git-fetch(1) won't have
to spawn the daemon itself anymore, and that is seemingly sufficient to
fix the underlying race. At least CI seems to be happy, and running the
two tests with --stress
for ~30 minutes didn't surface any hanging
tests anymore.
While it feels bad to paper over the issue without fully understanding
it, it does at least solve an actual bug. It shouldn't be a regression
in functionality either, as we would eventually spawn the fsmonitor
even without this change -- either via git-fetch(1), or via a later call
to start_fsmonitor_daemon()
via register_dir()
that we execute after
the checkout.