Remove automatic retries in case of errors #132

graywolf-at-work · 2019-02-11T14:05:02Z

Currently this gems tries to do automatic retries to the next (or same)
endpoint in case of an error. This has two issues, stack overflow (retry
is by recursion) and not respecting timeouts. Stack overflow is
solvable, however the timeout issue not really given current
architecture.

Cleanest solution is to just rotate the endpoints and move the
responsibility for retry to the calling application.

Fixes #130
Fixes #131

codecov · 2019-02-11T14:06:39Z

Codecov Report

Merging #132 into master will increase coverage by 0.07%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
+ Coverage   99.08%   99.15%   +0.07%     
==========================================
  Files          30       30              
  Lines        1639     1662      +23     
==========================================
+ Hits         1624     1648      +24     
+ Misses         15       14       -1

Impacted Files	Coverage Δ
spec/etcdv3/connection_wrapper_spec.rb	`100% <100%> (ø)`	⬆️
spec/helpers/connections.rb	`93.75% <100%> (ø)`	⬆️
lib/etcdv3/connection_wrapper.rb	`97.29% <100%> (+3.54%)`	⬆️
lib/etcdv3.rb	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a542e8...27153fa. Read the comment docs.

Currently this gems tries to do automatic retries to the next (or same) endpoint in case of an error. This has two issues, stack overflow (retry is by recursion) and not respecting timeouts. Stack overflow is solvable, however the timeout issue not really given current architecture. Cleanest solution is to just rotate the endpoints and move the responsibility for retry to the calling application. Fixes davissp14#130 Fixes davissp14#131

graywolf-at-work · 2019-02-11T14:47:58Z

Looks like the travis build passed https://travis-ci.org/davissp14/etcdv3-ruby/ but it's not updated here for some reason?

graywolf-at-work · 2019-02-14T11:12:33Z

Could we possibly get some feedback on this PR? :)

davissp14 · 2019-02-14T16:30:27Z

I'll do my best to take a look this evening. Sorry for the delay!

…

On Thu, Feb 14, 2019 at 5:12 AM graywolf-at-work ***@***.***> wrote: Could we possibly get some feedback on this PR? :) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#132 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ0fiSxHPzPDOztPZrN4KNlYsHmMe_cks5vNUShgaJpZM4a0QTt> .

-- Shaun Davis Software Engineer Compose - IBM <http://compose.io> Phone: (650)288-5791

davissp14 · 2019-02-19T22:06:35Z

Stack overflow is solvable, however the timeout issue not really given current
architecture.

What's currently in place should align pretty closely with what Etcd defines as the command-timeout. The dial-timeout, which I think you are referring too hasn't been implemented yet. The generic naming of timeout, probably makes things a little confusing.

Would the implementation of a dial-based timeout address your major concerns? Aside, from the obvious stack-overflow issue.

--command-timeout=5s -  timeout for short running command (excluding dial timeout)
--dial-timeout=2s   -  dial timeout for client connections

graywolf-at-work · 2019-02-20T07:35:49Z

The problem is that behaviour we want to have is that call to the etcd will finish inside X seconds. Sure, if X is for example 2, I could give it command_timeout: 1, dial_timeout: 1. That would work for example .put or .get probably fine.

But I'm not sure how to make it work for e.g. watch API. I mean, I want the watch to take at most 2 seconds and spent as much of that as possible watching. I could play around with something like command_timeout: 1.75, dial_timeout: 0.25 but that just seems so fragile, guessing how low can the dial timeout go.

How would you recommend to implement this using the two timeouts you propose?

EDIT: Also based on grpc documentation, it seems that timeout currently works more like --total-timeout from etcdctl v2 instead of --command-timeout from etcdctl v3 (at least when having just one endpoint). But it's possible I'm wrong.

davissp14 · 2019-02-20T17:56:03Z

But I'm not sure how to make it work for e.g. watch API. I mean, I want the watch to take at most 2 seconds and spent as much of that as possible watching. I could play around with something like command_timeout: 1.75, dial_timeout: 0.25 but that just seems so fragile, guessing how low can the dial timeout go.

If you're looking for a strict holistic timeout that includes both the time it takes to dial the connection and execution time for the command, I would probably just wrap the command in a Timeout block within your code.
http://ruby-doc.org/stdlib-2.6/libdoc/timeout/rdoc/Timeout.html

There's simply too many factors that could affect the time it takes to establish a new connection, and none of these factors are related to the watch command itself. So I think it would be unexpected in the general case for a watch(timeout=5) command to only run for 1 second because it took 4 seconds to prep the connection.

The dial_timeout implementation should be set on initialization and isn't something that should be dynamically set on a per-command basis.

I hope that makes sense!

graywolf-at-work · 2019-02-21T08:58:26Z

I'm looking into the Timeout.timeout and it looks weird

+[dockwolf@test devel]$ cat test.rb
require "etcdv3"
require "timeout"

conn = Etcdv3.new(endpoints: 'http://etcd01.de.showmax.io:2379')

begin
        Timeout.timeout(1) do
                conn.watch("foo", timeout: 5)
        end
rescue GRPC::DeadlineExceeded, Timeout::Error
        puts "Deadline"
end
+[dockwolf@test devel]$ time ruby test.rb
Deadline
E0221 08:55:18.278105569     252 backup_poller.cc:110]       run_poller: {"created":"@1550739318.278025670","description":"Shutting down timer system","file":"src/core/lib/iomgr/timer_generic.cc","file_line":704}

real    0m11.178s
user    0m0.129s
sys     0m0.020s

It takes 11 seconds compared to expected around 1. And there is the run_poller: line in console.

Compared with using timeout: 1 instead:

+[dockwolf@test devel]$ cat test.rb
require "etcdv3"

conn = Etcdv3.new(endpoints: 'http://etcd01.de.showmax.io:2379')

begin
        conn.watch("foo", timeout: 1)
rescue GRPC::DeadlineExceeded
        puts "Deadline"
end
+[dockwolf@test devel]$ time ruby test.rb
Deadline

real    0m1.126s
user    0m0.127s
sys     0m0.007s

That looks much more as expected.

Could you point me in the right direction of what I'm doing wrong?

davissp14 · 2019-02-21T21:46:13Z

@graywolf-at-work

Your results will be off because you're timing the execution of the entire script, rather than just the timeout block.

#!/usr/local/env ruby

require 'etcdv3'
require 'timeout'
require 'benchmark'

cl = Etcdv3.new(endpoints: "http://localhost:2379")

Benchmark.bm do |x|
  # 1 second timeout
  x.report do
    begin
      Timeout::timeout(1) do
        cl.watch('foo')
      end
    rescue Timeout::Error
    end
  end

  # 5 second timeout
  x.report do
    begin
      Timeout::timeout(5) do
        cl.watch('foo2')
      end
    rescue Timeout::Error
    end
  end

  # 10 second timeout
  x.report do
    begin
      Timeout::timeout(10) do
        cl.watch('foo3')
      end
    rescue Timeout::Error
    end
  end

end

Results

$ ruby timeout_test.rb
       user     system      total        real
   0.010000   0.010000   0.020000 (  1.007412)
   0.000000   0.010000   0.010000 (  5.023335)
   0.010000   0.010000   0.020000 ( 10.019187)

dcepelik · 2019-02-22T09:44:13Z

Hello again @davissp14,

So I think it would be unexpected in the general case for a watch(timeout=5) command to only run for 1 second because it took 4 seconds to prep the connection.

I believe this is the source of confusion.

In our perception, a method which takes a timeout argument is a signal to the caller that the function will not delay them for any reason by more than timeout seconds. Should the time-out occur, execution will be gracefully aborted and control will be returned to the caller to deal with the situation.

(It can usually delay them a little bit longer, but that extra overhead should be constant and cannot really be practically avoided, because you have to return from the function, perhaps do some cleanup etc.)

Providing strict guarantees to the caller is a great feature of code, because then, other code which needs to provide strict guarantees can be built atop. On the other hand, you cannot call code which isn't dependable from something that ought to be dependable.

If you're looking for a strict holistic timeout that includes both the time it takes to dial the connection and execution time for the command, I would probably just wrap the command in a Timeout block within your code.

Yes, you can achieve something that looks similar using Timeout#timeout, but you don't get the same guarantees.

Since the lower layer (GRPC in this case) is ultimately the part which implements the timeout-case behavior, it has got enough information to do proper clean-up of resources after failed connection attempts and unsuccessful commands. On the other hand, with Timeout#timeout, you get an exception thrown literally anywhere, possibly somewhere in native code, and to my best knowledge this is never really a safe way of cancellation.

(There are great articles which describe the issues in details. Even if Timeout#timeout worked flawlessly, it's a complicated device I'd rather avoid, because it may not work tomorrow.)

But I have to say that I understand your point too. Because the reconnect logic comes handy a lot of the time when you don't care all that much about having proper time-outs all the time, and having to retry yourself can be a bit of an overkill.

Hence I'd propose a compromise: keep the current behavior and add a allow_reconnect argument to Etcdv3#new which would default to true. true would basically mean to use current behavior (command may not take longer than timeout seconds, but if reconnection logic kicks in, it can add an arbitrary delay).

If allow_reconnect is set to false and current ETCD connection is bad, the end-points will be rotated and an exception will be raised, returning control to the caller to either retry or give up, whichever works better for them at the moment (because they have the information needed to decide that).

Would you fined that acceptable?

In any case, thank you very much.

dcepelik · 2019-03-04T16:04:41Z

@davissp14 Hi Shaun, would you be willing to further discuss this, please? The proposed behavior in my last comment is IMHO a nice compromise and we're blocked on this.

As usual, thanks for the time and energy you put into maintenance of this gem!

davissp14 · 2019-03-04T16:16:20Z

@dcepelik Sorry for the delay, life has been a little crazy as of late. The allow_reconnect flag sounds like a solid plan to me!

graywolf-at-work · 2019-03-04T16:31:43Z

Will implement tomorrow, thank you for greenlighting :)

This reverts commit 685f0ac.

Sometimes auto-retrying calls is not a good idea. There might be user requirements on precision of timeouts (they don't really work if multiple endpoints specified) or on how to handle (and log) failures. This commit adds new flags that allows changing the default behaviour, therefore not to auto-retry.

graywolf-at-work · 2019-03-06T16:36:42Z

Is it acceptable like this? If yes, could we also ask for release again?

If further changes are required, let me know.

@dcepelik please also have a look.

davissp14

Overall, looks pretty good. Few things that need to be addressed though.

lib/etcdv3.rb

lib/etcdv3/connection_wrapper.rb

README.md

graywolf-at-work · 2019-03-07T08:47:55Z

I think that should be all? :)

davissp14

Few more small comments.

lib/etcdv3/connection_wrapper.rb

lib/etcdv3.rb

davissp14

LGTM

davissp14 · 2019-03-11T15:51:26Z

Thanks for working through this! Nice work! Will get a release up today.

graywolf-at-work · 2019-03-15T12:13:07Z

Just reminding if you would be so kind to do the release for us :) Thank you

davissp14 · 2019-03-15T19:59:55Z

Sorry for the delay, just pushed it up. Thank again!

graywolf-at-work force-pushed the dont_use_recursion_for_reconnect branch from 5c3943d to 685f0ac Compare February 11, 2019 14:20

graywolf-at-work mentioned this pull request Feb 20, 2019

Timeout is not honored if error occurs #131

Closed

Tomas Volf added 4 commits March 6, 2019 16:41

Revert "Remove automatic retries in case of errors"

4d677f5

This reverts commit 685f0ac.

Add /.bundle and sort gitignore by alphabet

dd1cdc6

Increase code coverage

eda8599

davissp14 requested changes Mar 6, 2019

View reviewed changes

lib/etcdv3.rb Show resolved Hide resolved

lib/etcdv3/connection_wrapper.rb Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Show resolved Hide resolved

Tomas Volf added 2 commits March 6, 2019 20:43

Move where default value for allow_reconnect is set

747d1e9

Improve readme

6b847d5

davissp14 reviewed Mar 7, 2019

View reviewed changes

lib/etcdv3/connection_wrapper.rb Outdated Show resolved Hide resolved

lib/etcdv3.rb Outdated Show resolved Hide resolved

Tomas Volf added 2 commits March 8, 2019 20:36

Change keyword argument allow_reconnect to positional

cf2c58e

Use .fetch

27153fa

davissp14 approved these changes Mar 11, 2019

View reviewed changes

davissp14 merged commit b6ba43c into davissp14:master Mar 11, 2019

graywolf-at-work deleted the dont_use_recursion_for_reconnect branch March 13, 2019 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove automatic retries in case of errors #132

Remove automatic retries in case of errors #132

graywolf-at-work commented Feb 11, 2019

codecov bot commented Feb 11, 2019 •

edited

Loading

graywolf-at-work commented Feb 11, 2019

graywolf-at-work commented Feb 14, 2019

davissp14 commented Feb 14, 2019 via email

davissp14 commented Feb 19, 2019 •

edited

Loading

graywolf-at-work commented Feb 20, 2019 •

edited

Loading

davissp14 commented Feb 20, 2019 •

edited

Loading

graywolf-at-work commented Feb 21, 2019

davissp14 commented Feb 21, 2019

dcepelik commented Feb 22, 2019

dcepelik commented Mar 4, 2019

davissp14 commented Mar 4, 2019

graywolf-at-work commented Mar 4, 2019

graywolf-at-work commented Mar 6, 2019

davissp14 left a comment

graywolf-at-work commented Mar 7, 2019

davissp14 left a comment

davissp14 left a comment

davissp14 commented Mar 11, 2019

graywolf-at-work commented Mar 15, 2019

davissp14 commented Mar 15, 2019

Remove automatic retries in case of errors #132

Remove automatic retries in case of errors #132

Conversation

graywolf-at-work commented Feb 11, 2019

codecov bot commented Feb 11, 2019 • edited Loading

Codecov Report

graywolf-at-work commented Feb 11, 2019

graywolf-at-work commented Feb 14, 2019

davissp14 commented Feb 14, 2019 via email

davissp14 commented Feb 19, 2019 • edited Loading

graywolf-at-work commented Feb 20, 2019 • edited Loading

davissp14 commented Feb 20, 2019 • edited Loading

graywolf-at-work commented Feb 21, 2019

davissp14 commented Feb 21, 2019

dcepelik commented Feb 22, 2019

dcepelik commented Mar 4, 2019

davissp14 commented Mar 4, 2019

graywolf-at-work commented Mar 4, 2019

graywolf-at-work commented Mar 6, 2019

davissp14 left a comment

Choose a reason for hiding this comment

graywolf-at-work commented Mar 7, 2019

davissp14 left a comment

Choose a reason for hiding this comment

davissp14 left a comment

Choose a reason for hiding this comment

davissp14 commented Mar 11, 2019

graywolf-at-work commented Mar 15, 2019

davissp14 commented Mar 15, 2019

codecov bot commented Feb 11, 2019 •

edited

Loading

davissp14 commented Feb 19, 2019 •

edited

Loading

graywolf-at-work commented Feb 20, 2019 •

edited

Loading

davissp14 commented Feb 20, 2019 •

edited

Loading