Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opt out of binlogs #759

Closed
wants to merge 2 commits into from
Closed

Conversation

zmoazeni
Copy link
Contributor

@zmoazeni zmoazeni commented Jun 22, 2019

This PR allows users to opt out of writing to binary logs for --test-on-replica, --migrate-on-replica, and --allow-on-master

This is still WIP. I have a few things that I want to clarify or clean up:

  • Should we allow this on --allow-on-master?
  • Run shellcheck on the new bash scripts
  • Fail early on v5.5
  • Fail early if the server has replicas
  • Should we offer a way to write empty GTID statements to a file?
  • Should we generate gtid_next for every statement? Answer: No
  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

Copy link
Contributor Author

@zmoazeni zmoazeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomi-noach @evanelias this is still WIP, but mostly done. I'm curious about your thoughts on this.

@@ -90,6 +90,7 @@ func main() {
cutOver := flag.String("cut-over", "atomic", "choose cut-over type (default|atomic, two-step)")
flag.BoolVar(&migrationContext.ForceNamedCutOverCommand, "force-named-cut-over", false, "When true, the 'unpostpone|cut-over' interactive command must name the migrated table")
flag.BoolVar(&migrationContext.ForceNamedPanicCommand, "force-named-panic", false, "When true, the 'panic' interactive command must name the migrated table")
flag.BoolVar(&migrationContext.UseBinLog, "sql-log-bin", true, "When true, gh-ost will write to the binlog for all operations that perform --test-on-replica, --migration-on-replica, and --allow-on-master")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discuss whether we should allow --allow-on-master to work considering the scope of the GTID surgery. Feels super dangerous.

go/logic/applier.go Outdated Show resolved Hide resolved
go/logic/applier.go Outdated Show resolved Hide resolved
go/logic/gtidset.go Show resolved Hide resolved
go/mysql/connection.go Outdated Show resolved Hide resolved
@@ -134,6 +134,16 @@ test_single() {
echo_dot
sleep 1
#

local callback_return
if [ -f $tests_path/$test_name/before.sh ] ; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new way for tests to have before.sh/after.sh to expand on our tests. Each of the new tests depend on this.

@shlomi-noach
Copy link
Contributor

@zmoazeni thank you for working on this. I'm ambivalent and am more inclined to run a RESET MASTER... operation at end of migration so as to cleanup errant GTIDs, as opposed to avoiding writes in binlog in the first place.

There is at least one crucial logic whose path goes through the binary logs: the waitForEventsUpToLock logic. I don't see this PR handling the waitForEventsUpToLock logic, and to handle it outside the binary log means changing the logic that identifies the cut-over timing.

When gh-ost identifies that all row copy is complete, it wants to cut-over. However, there can also be a few more transactions in the binary log, so it can't cut-over just yet. The logic is:

  • lock the original table for writes.
  • There can still be binlog entries for that table, of course, that must be handled. But no new binlog entries are possible.
  • gh-ost injects a "hey everything is processed" challenge
  • the migrator meanwhile keeps on applying binlog events
  • but when it sees the AllEventsUpToLockProcessed challenge in the binary log, it knows beyond doubt that all table events are processed (remember AllEventsUpToLockProcessed is injected after the table is locked).
  • (on timeout, though, a new challenge is required)
  • if all is well, proceed to cut-over.

On a test-on-replica, I guess this can be worked around, but... it will change the logic for the cut-over. And that would make the testing less reliable. test-on-replica really works through the entire logic...

"strings"
)

type GTIDSet struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. I may be missing something, but I couldn't figure out a clean way to use gtid_subtract and still have the functionality we're looking for. So I left it in.

go/logic/applier.go Outdated Show resolved Hide resolved
@zmoazeni
Copy link
Contributor Author

I'm ambivalent and am more inclined to run a RESET MASTER... operation at end of migration so as to cleanup errant GTIDs, as opposed to avoiding writes in binlog in the first place.

Ah right on. I can drop the sql_log_bin=OFF business. At this point I think it's just space saver.

There is at least one crucial logic whose path goes through the binary logs: the waitForEventsUpToLock logic. I don't see this PR handling the waitForEventsUpToLock logic, and to handle it outside the binary log means changing the logic that identifies the cut-over timing.

This all might be moot if I drop the sql_log_bin=OFF wrapping, but I believe I'm explicitly allowing those to write to the binary log via https://github.com/github/gh-ost/pull/759/files#diff-55e7b291008bc31a7fac46abdad84feaR293 - the tests would hang/wait at the cutover phase without that line.

Unless I'm missing something important here. Are you saying that by excluding binary log writing of the other statements perhaps interferes with the locking logic and assumptions?

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Jun 24, 2019

Unless I'm missing something important here. Are you saying that by excluding binary log writing of the other statements perhaps interferes with the locking logic and assumptions?

What I was saying:

  • You can get rid of many binlog entries
  • But you can't get rid of all of them. There's a few which are required for the correct operation of gh-ost.
  • So you end up with errant GTID anyway.
  • And so you still need to e.g. RESET MASTER
  • In which case why go through the effort of all the sql_log_bin=OFF stuff.

The above is my preferred view/opinion. It could be possible to get rid of all binlog entries, but at the cost of changing the cut-over logic; something I'm not keen on.

@zmoazeni
Copy link
Contributor Author

In which case why go through the effort of all the sql_log_bin=OFF stuff.

Ahh I see your point. I was doing that to help #149:

The specific situation I am in is that I am compressing some large tables. Based on small tests and estimates, we should have enough space to complete the compression of one table without running out of disk space. I had intended on running the migration on the passive master, which is taking no traffic and could run out of disk space without any negative issues. With the migration being written to the binlogs, though, this would also cause the active master to run out of space as well.

I read that as: they have enough space on the server to duplicate the table but when we also account for the extra space in the binary log, things go haywire for that person.

This PR would allow all the changelog events to still write to the binary log (which I believe appropriately handles waitForEventsUpToLock), but I need to do a bit more reading on it to make sure I understand it.

The above is my preferred view/opinion. It could be possible to get rid of all binlog entries, but at the cost of changing the cut-over logic; something I'm not keen on.

I totally agree with you here. I'm not wanting to change over the cut-over logic. I'll spend some more time reading up on waitForEventsUpToLock to see if I screwed up something here. At the very least I could pare this PR down to not do any sql_log_bin=OFF and could re-introduce that in a separate PR, treating that as a separate concern (optimizing disk space).

go/base/context.go Outdated Show resolved Hide resolved
@zmoazeni
Copy link
Contributor Author

zmoazeni commented Jul 8, 2019

Thanks for the feedback gang. I've been taking some vacation time lately. Will jump back in soon-ish to make changes based on the feedback! I believe the next iteration will be a legit non-draft PR.

We can't fully stop all binary writing. But we can opt out of the vast
majority of statements. The main one we need is the final changelog
event to know when we can cut over.

Users can only use this switch for in a few scenarios:
--test-on-replica, --migrate-on-replica, or --allow-on-master
@zmoazeni zmoazeni marked this pull request as ready for review August 27, 2019 16:33
@zmoazeni zmoazeni changed the title WIP Opt out of binlogs Opt out of binlogs Aug 27, 2019
@zmoazeni
Copy link
Contributor Author

Hrm. There may be a timing issue. The test that is failing right now https://github.com/github/gh-ost/pull/759/checks?check_run_id=204694461 passes locally, but I can't seem to re-run without pushing a new commit. (I don't see a button to kick off a rebuild)

@zmoazeni
Copy link
Contributor Author

Closing unmerged. I have since moved my focus elsewhere.

@zmoazeni zmoazeni closed this Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants