Rework secondary node reconfiguration
Improve documentation on how to promote the Geo secondary nodes to a Geo primary
Original, abandoned premise:
We may be able to make these machine reconfigure steps to happen after failover. The incentive here is that they can take a long time to run; we want the maintenance window to be as short as possible, and we're already pushing up against a one-hour budget.
I think this includes the previously-forgotten-about step of "update the external_url
". However, it may not - having this set incorrectly should still allow requests to be served correctly, but we may write the wrong absolute URL into system notes, or other places. It may also interfere with the container registry, GitLab pages, etc. Needs investigation.
We can't update any of these things (including external_url
) pre-failover because that will break the secondary's conception of itself as a secondary.
Merge request reports
Activity
- Resolved by Nick Thomas
147 145 1. [ ] Ensure all "after the blackout" QA automated tests have succeeded - @meks 148 146 1. [ ] Ensure all "after the blackout" QA manual tests have succeeded - @meks 149 147 148 ## Post-failover @brodock @vsizov @stanhu @ahmadsherif @andrewn what do we think of this change? Customers would be accessing the new GitLab.com at this point, having passed all QA. We'd modify the running services after the fact.
This should be safe because all the Geo-related code is gated on
::Gitlab::Geo.secondary?
returning true, which it won't do after runningset_secondary_as_primary
.I'm less sure about the external_url. If we can't make that change happen post-failover, it's probably not worth having this optimization - running two changes across the whole fleet isn't likely to be much slower than running a single change, right?
I'm also more confortable doing all this before opening up to external traffic again. Should we consider extending the outage window to 2 or even 3 hours, then, to ensure we've got time to do everything? /cc @andrewn
changed this line in version 2 of the diff
assigned to @ahmadsherif
- Resolved by Nick Thomas
- Resolved by Nick Thomas
- Resolved by Nick Thomas
assigned to @nick.thomas
added 17 commits
-
7935ad09...3abdc1e9 - 14 commits from branch
master
- 33a05699 - Reconfigure secondary nodes to disable Geo services *post-failover*
- 8195355d - Post-failover steps
- d83fce05 - Remove post-failover meat
Toggle commit list-
7935ad09...3abdc1e9 - 14 commits from branch
assigned to @ahmadsherif
OK @ahmadsherif , ready for another review.
I've moved most of the reconfigure-secondary-nodes meat out of the procedure itself in favour of linking to the merge request containing the details.
I think that what we'll end up with, running this procedure in this order, is gstg/gprd becoming [staging.]gitlab.com with geo not set up, but with a primary in the database matching the old hostname. This might be undesirable.
- Resolved by Nick Thomas
- Resolved by Nick Thomas
- Resolved by Nick Thomas
assigned to @nick.thomas
mentioned in commit 813b3ffe