Skip to content

scalar: work around hang in git-fetch(1) with fsmonitor

Patrick Steinhardt requested to merge pks-ci-macos-hang into master

For a long time, we have seen CI jobs on macOS to fail both in GitLab Workflows and in GitLab CI. After some painful debugging we have found out that the offending test suites are t9210 and t9211. The common symptom here is that there was a git-fetch(1) process hanging while it does seemingly nothing, as well as a bunch of fsmonitor processes. When killing the fsmonitor processes, git-fetch(1) becomes unstuck and the test continues to run.

This issue can only be reproduced when the system is highly loaded. The most successful way to trigger the issue is to run both of these test suites in parallel with --stress. Eventually, tests start to get stuck and progress grinds to a halt.

All of this smells like a race condition somewhere deep in the fsmonitor logic. The most likely scenario is that some events in the FSEventStream used by macOS to listen for filesystem events get lost. I cannot really tell though, and do not have enough knowledge around macOS internals to properly debug this. This is made even harder by the fact that this race only happens sometimes and under high load, which makes it really hard to debug.

Instead of fixing the underlying issue, I have found a workaround that makes the symptom go away: we can start the fsmonitor daemon manually before we execute git-fetch(1). This means that git-fetch(1) won't have to spawn the daemon itself anymore, and that is seemingly sufficient to fix the underlying race. At least CI seems to be happy, and running the two tests with --stress for ~30 minutes didn't surface any hanging tests anymore.

While it feels bad to paper over the issue without fully understanding it, it does at least solve an actual bug. It shouldn't be a regression in functionality either, as we would eventually spawn the fsmonitor even without this change -- either via git-fetch(1), or via a later call to start_fsmonitor_daemon() via register_dir() that we execute after the checkout.

Edited by Patrick Steinhardt

Merge request reports