Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove automatic retries in case of errors #132

Merged

Conversation

graywolf-at-work
Copy link
Contributor

Currently this gems tries to do automatic retries to the next (or same)
endpoint in case of an error. This has two issues, stack overflow (retry
is by recursion) and not respecting timeouts. Stack overflow is
solvable, however the timeout issue not really given current
architecture.

Cleanest solution is to just rotate the endpoints and move the
responsibility for retry to the calling application.

Fixes #130
Fixes #131

@codecov
Copy link

codecov bot commented Feb 11, 2019

Codecov Report

Merging #132 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
+ Coverage   99.08%   99.15%   +0.07%     
==========================================
  Files          30       30              
  Lines        1639     1662      +23     
==========================================
+ Hits         1624     1648      +24     
+ Misses         15       14       -1
Impacted Files Coverage Δ
spec/etcdv3/connection_wrapper_spec.rb 100% <100%> (ø) ⬆️
spec/helpers/connections.rb 93.75% <100%> (ø) ⬆️
lib/etcdv3/connection_wrapper.rb 97.29% <100%> (+3.54%) ⬆️
lib/etcdv3.rb 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a542e8...27153fa. Read the comment docs.

Currently this gems tries to do automatic retries to the next (or same)
endpoint in case of an error. This has two issues, stack overflow (retry
is by recursion) and not respecting timeouts. Stack overflow is
solvable, however the timeout issue not really given current
architecture.

Cleanest solution is to just rotate the endpoints and move the
responsibility for retry to the calling application.

Fixes davissp14#130
Fixes davissp14#131
@graywolf-at-work graywolf-at-work force-pushed the dont_use_recursion_for_reconnect branch from 5c3943d to 685f0ac Compare February 11, 2019 14:20
@graywolf-at-work
Copy link
Contributor Author

Looks like the travis build passed https://travis-ci.org/davissp14/etcdv3-ruby/ but it's not updated here for some reason?

@graywolf-at-work
Copy link
Contributor Author

Could we possibly get some feedback on this PR? :)

@davissp14
Copy link
Owner

davissp14 commented Feb 14, 2019 via email

@davissp14
Copy link
Owner

davissp14 commented Feb 19, 2019

Stack overflow is solvable, however the timeout issue not really given current
architecture.

What's currently in place should align pretty closely with what Etcd defines as the command-timeout. The dial-timeout, which I think you are referring too hasn't been implemented yet. The generic naming of timeout, probably makes things a little confusing.

Would the implementation of a dial-based timeout address your major concerns? Aside, from the obvious stack-overflow issue.

--command-timeout=5s -  timeout for short running command (excluding dial timeout)
--dial-timeout=2s   -  dial timeout for client connections

@graywolf-at-work
Copy link
Contributor Author

graywolf-at-work commented Feb 20, 2019

The problem is that behaviour we want to have is that call to the etcd will finish inside X seconds. Sure, if X is for example 2, I could give it command_timeout: 1, dial_timeout: 1. That would work for example .put or .get probably fine.

But I'm not sure how to make it work for e.g. watch API. I mean, I want the watch to take at most 2 seconds and spent as much of that as possible watching. I could play around with something like command_timeout: 1.75, dial_timeout: 0.25 but that just seems so fragile, guessing how low can the dial timeout go.

How would you recommend to implement this using the two timeouts you propose?

EDIT: Also based on grpc documentation, it seems that timeout currently works more like --total-timeout from etcdctl v2 instead of --command-timeout from etcdctl v3 (at least when having just one endpoint). But it's possible I'm wrong.

@davissp14
Copy link
Owner

davissp14 commented Feb 20, 2019

But I'm not sure how to make it work for e.g. watch API. I mean, I want the watch to take at most 2 seconds and spent as much of that as possible watching. I could play around with something like command_timeout: 1.75, dial_timeout: 0.25 but that just seems so fragile, guessing how low can the dial timeout go.

If you're looking for a strict holistic timeout that includes both the time it takes to dial the connection and execution time for the command, I would probably just wrap the command in a Timeout block within your code.
http://ruby-doc.org/stdlib-2.6/libdoc/timeout/rdoc/Timeout.html

There's simply too many factors that could affect the time it takes to establish a new connection, and none of these factors are related to the watch command itself. So I think it would be unexpected in the general case for a watch(timeout=5) command to only run for 1 second because it took 4 seconds to prep the connection.

The dial_timeout implementation should be set on initialization and isn't something that should be dynamically set on a per-command basis.

I hope that makes sense!

@graywolf-at-work
Copy link
Contributor Author

I'm looking into the Timeout.timeout and it looks weird

+[dockwolf@test devel]$ cat test.rb
require "etcdv3"
require "timeout"

conn = Etcdv3.new(endpoints: 'http://etcd01.de.showmax.io:2379')

begin
        Timeout.timeout(1) do
                conn.watch("foo", timeout: 5)
        end
rescue GRPC::DeadlineExceeded, Timeout::Error
        puts "Deadline"
end
+[dockwolf@test devel]$ time ruby test.rb
Deadline
E0221 08:55:18.278105569     252 backup_poller.cc:110]       run_poller: {"created":"@1550739318.278025670","description":"Shutting down timer system","file":"src/core/lib/iomgr/timer_generic.cc","file_line":704}

real    0m11.178s
user    0m0.129s
sys     0m0.020s

It takes 11 seconds compared to expected around 1. And there is the run_poller: line in console.

Compared with using timeout: 1 instead:

+[dockwolf@test devel]$ cat test.rb
require "etcdv3"

conn = Etcdv3.new(endpoints: 'http://etcd01.de.showmax.io:2379')

begin
        conn.watch("foo", timeout: 1)
rescue GRPC::DeadlineExceeded
        puts "Deadline"
end
+[dockwolf@test devel]$ time ruby test.rb
Deadline

real    0m1.126s
user    0m0.127s
sys     0m0.007s

That looks much more as expected.

Could you point me in the right direction of what I'm doing wrong?

@davissp14
Copy link
Owner

@graywolf-at-work

Your results will be off because you're timing the execution of the entire script, rather than just the timeout block.

#!/usr/local/env ruby

require 'etcdv3'
require 'timeout'
require 'benchmark'

cl = Etcdv3.new(endpoints: "http://localhost:2379")

Benchmark.bm do |x|
  # 1 second timeout
  x.report do
    begin
      Timeout::timeout(1) do
        cl.watch('foo')
      end
    rescue Timeout::Error
    end
  end

  # 5 second timeout
  x.report do
    begin
      Timeout::timeout(5) do
        cl.watch('foo2')
      end
    rescue Timeout::Error
    end
  end

  # 10 second timeout
  x.report do
    begin
      Timeout::timeout(10) do
        cl.watch('foo3')
      end
    rescue Timeout::Error
    end
  end

end

Results

$ ruby timeout_test.rb
       user     system      total        real
   0.010000   0.010000   0.020000 (  1.007412)
   0.000000   0.010000   0.010000 (  5.023335)
   0.010000   0.010000   0.020000 ( 10.019187)

@dcepelik
Copy link
Contributor

Hello again @davissp14,

So I think it would be unexpected in the general case for a watch(timeout=5) command to only run for 1 second because it took 4 seconds to prep the connection.

I believe this is the source of confusion.

In our perception, a method which takes a timeout argument is a signal to the caller that the function will not delay them for any reason by more than timeout seconds. Should the time-out occur, execution will be gracefully aborted and control will be returned to the caller to deal with the situation.

(It can usually delay them a little bit longer, but that extra overhead should be constant and cannot really be practically avoided, because you have to return from the function, perhaps do some cleanup etc.)

Providing strict guarantees to the caller is a great feature of code, because then, other code which needs to provide strict guarantees can be built atop. On the other hand, you cannot call code which isn't dependable from something that ought to be dependable.

If you're looking for a strict holistic timeout that includes both the time it takes to dial the connection and execution time for the command, I would probably just wrap the command in a Timeout block within your code.

Yes, you can achieve something that looks similar using Timeout#timeout, but you don't get the same guarantees.

Since the lower layer (GRPC in this case) is ultimately the part which implements the timeout-case behavior, it has got enough information to do proper clean-up of resources after failed connection attempts and unsuccessful commands. On the other hand, with Timeout#timeout, you get an exception thrown literally anywhere, possibly somewhere in native code, and to my best knowledge this is never really a safe way of cancellation.

(There are great articles which describe the issues in details. Even if Timeout#timeout worked flawlessly, it's a complicated device I'd rather avoid, because it may not work tomorrow.)


But I have to say that I understand your point too. Because the reconnect logic comes handy a lot of the time when you don't care all that much about having proper time-outs all the time, and having to retry yourself can be a bit of an overkill.

Hence I'd propose a compromise: keep the current behavior and add a allow_reconnect argument to Etcdv3#new which would default to true. true would basically mean to use current behavior (command may not take longer than timeout seconds, but if reconnection logic kicks in, it can add an arbitrary delay).

If allow_reconnect is set to false and current ETCD connection is bad, the end-points will be rotated and an exception will be raised, returning control to the caller to either retry or give up, whichever works better for them at the moment (because they have the information needed to decide that).

Would you fined that acceptable?

In any case, thank you very much.

@dcepelik
Copy link
Contributor

dcepelik commented Mar 4, 2019

@davissp14 Hi Shaun, would you be willing to further discuss this, please? The proposed behavior in my last comment is IMHO a nice compromise and we're blocked on this.

As usual, thanks for the time and energy you put into maintenance of this gem!

@davissp14
Copy link
Owner

@dcepelik Sorry for the delay, life has been a little crazy as of late. The allow_reconnect flag sounds like a solid plan to me!

@graywolf-at-work
Copy link
Contributor Author

Will implement tomorrow, thank you for greenlighting :)

Tomas Volf added 4 commits March 6, 2019 16:41
Sometimes auto-retrying calls is not a good idea. There might be user
requirements on precision of timeouts (they don't really work if
multiple endpoints specified) or on how to handle (and log) failures.
This commit adds new flags that allows changing the default behaviour,
therefore not to auto-retry.
@graywolf-at-work
Copy link
Contributor Author

Is it acceptable like this? If yes, could we also ask for release again?

If further changes are required, let me know.

@dcepelik please also have a look.

Copy link
Owner

@davissp14 davissp14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks pretty good. Few things that need to be addressed though.

lib/etcdv3.rb Show resolved Hide resolved
lib/etcdv3/connection_wrapper.rb Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
@graywolf-at-work
Copy link
Contributor Author

I think that should be all? :)

Copy link
Owner

@davissp14 davissp14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more small comments.

lib/etcdv3/connection_wrapper.rb Outdated Show resolved Hide resolved
lib/etcdv3.rb Outdated Show resolved Hide resolved
Copy link
Owner

@davissp14 davissp14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davissp14 davissp14 merged commit b6ba43c into davissp14:master Mar 11, 2019
@davissp14
Copy link
Owner

Thanks for working through this! Nice work! Will get a release up today.

@graywolf-at-work graywolf-at-work deleted the dont_use_recursion_for_reconnect branch March 13, 2019 14:50
@graywolf-at-work
Copy link
Contributor Author

Just reminding if you would be so kind to do the release for us :) Thank you

@davissp14
Copy link
Owner

Sorry for the delay, just pushed it up. Thank again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants