Skip to content
Snippets Groups Projects

Rework secondary node reconfiguration

Improve documentation on how to promote the Geo secondary nodes to a Geo primary


Original, abandoned premise:

We may be able to make these machine reconfigure steps to happen after failover. The incentive here is that they can take a long time to run; we want the maintenance window to be as short as possible, and we're already pushing up against a one-hour budget.

I think this includes the previously-forgotten-about step of "update the external_url". However, it may not - having this set incorrectly should still allow requests to be served correctly, but we may write the wrong absolute URL into system notes, or other places. It may also interfere with the container registry, GitLab pages, etc. Needs investigation.

We can't update any of these things (including external_url) pre-failover because that will break the secondary's conception of itself as a secondary.

Edited by Nick Thomas

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • 147 145 1. [ ] Ensure all "after the blackout" QA automated tests have succeeded - @meks
    148 146 1. [ ] Ensure all "after the blackout" QA manual tests have succeeded - @meks
    149 147
    148 ## Post-failover
    • @brodock @vsizov @stanhu @ahmadsherif @andrewn what do we think of this change? Customers would be accessing the new GitLab.com at this point, having passed all QA. We'd modify the running services after the fact.

      This should be safe because all the Geo-related code is gated on ::Gitlab::Geo.secondary? returning true, which it won't do after running set_secondary_as_primary.

      I'm less sure about the external_url. If we can't make that change happen post-failover, it's probably not worth having this optimization - running two changes across the whole fleet isn't likely to be much slower than running a single change, right?

    • I think I'd rather spend another hour getting the config into a consistent and known state rather than chase possibly phantom issues due to incomplete changes. What do you think?

    • I'm also more confortable doing all this before opening up to external traffic again. Should we consider extending the outage window to 2 or even 3 hours, then, to ensure we've got time to do everything? /cc @andrewn

    • changed this line in version 2 of the diff

    • Please register or sign in to reply
  • Nick Thomas unmarked as a Work In Progress

    unmarked as a Work In Progress

  • Nick Thomas changed title from WIP: Post failover reconfigure to Rework secondary node reconfiguration

    changed title from WIP: Post failover reconfigure to Rework secondary node reconfiguration

  • Nick Thomas changed the description

    changed the description

  • Nick Thomas added 17 commits

    added 17 commits

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    Compare with previous version

  • OK @ahmadsherif , ready for another review.

    I've moved most of the reconfigure-secondary-nodes meat out of the procedure itself in favour of linking to the merge request containing the details.

    I think that what we'll end up with, running this procedure in this order, is gstg/gprd becoming [staging.]gitlab.com with geo not set up, but with a primary in the database matching the old hostname. This might be undesirable.

  • Ahmad Sherif
  • Nick Thomas added 5 commits

    added 5 commits

    • 2f957968 - 1 commit from branch master
    • 72ea79ea - Reconfigure secondary nodes to disable Geo services *post-failover*
    • c7c219fa - Post-failover steps
    • b27406fd - Remove post-failover meat
    • 48f5ee44 - Address feedback

    Compare with previous version

  • Nick Thomas mentioned in commit 813b3ffe

    mentioned in commit 813b3ffe

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading