Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify lease keep alive call to be compatible with grpc proxy #117

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jcalvert
Copy link

While attempting utilize the lease_keep_alive_once method, we discovered an incompatibility when connecting to an etcd gRPC proxy. All requests would fail with a Cancelled status like:

        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:26:in `check_status'
        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:209:in `block in read_loop'
        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:195:in `loop'
        from /home/jcalvert/.rvm/gems/ruby-2.4.3/gems/grpc-1.11.0-x86_64-linux/src/ruby/lib/grpc/generic/bidi_call.rb:195:in `read_loop'
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/lease.rb:38:in `each'     
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/lease.rb:38:in `lease_keep_alive_once'
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/connection.rb:23:in `call'       
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3/connection_wrapper.rb:21:in `handle'
        from /home/jcalvert/etcdv3-ruby/lib/etcdv3.rb:127:in `lease_keep_alive_once'

After some extensive debugging the culprit seems to be the EOF check the proxy makes here: - when the "request" object for the bidirectional stream has completed writing, this code point is reached and the receiving loop server side returns which then fires off a message to the stop channel - and so the context is marked as cancelled before the request even runs. For reference, we tried using the Go client against the etcd proxy and did not encounter this issue.

Looking at the comments here and here it indicates the issue in this case is one of needing to defer signaling that writing is complete until the code has actually received a response. Doing this allows requests that are proxied via gRPC to succeed as expected.

@codecov
Copy link

codecov bot commented May 21, 2018

Codecov Report

Merging #117 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #117      +/-   ##
==========================================
+ Coverage   98.35%   98.38%   +0.02%     
==========================================
  Files          24       24              
  Lines        1521     1545      +24     
==========================================
+ Hits         1496     1520      +24     
  Misses         25       25
Impacted Files Coverage Δ
lib/etcdv3/lease.rb 100% <100%> (ø) ⬆️
spec/etcdv3/lease_spec.rb 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b4373d...8335bc7. Read the comment docs.

@davissp14
Copy link
Owner

davissp14 commented May 21, 2018

What version of Etcd are you using out of curiosity?

@davissp14
Copy link
Owner

etcd-io/etcd#9751

@jcalvert
Copy link
Author

We've tested against 3.2 and 3.3.3 as well as master, also with the latest gRPC gem.

@davissp14
Copy link
Owner

davissp14 commented May 22, 2018

Good deal, I was able to reproduce the issue and am looking to work with the CoreOS guys to get this resolved upstream. If this issue is blocking you guys, I'm happy to get this merged and work to rip it out once it's fixed.

Although, I don't feel great about adding gRPC Proxy logic/specs into this project. I'd rather remain agnostic with regard to users topology/proxy choices. That being the case, I'd ask that you remove any specs/code that relates to the gRPC Proxy, and just add a comment to your workaround.

Does that make sense?

@jcalvert
Copy link
Author

We can just run our fork so no big rush. We can dump the proxy specific bits you mentioned for this. It's a particularly odd and unexpected interaction.

@davissp14
Copy link
Owner

Right on, I appreciate the help!

@jcalvert jcalvert force-pushed the lease_keep_alive_proxy branch from 6bf5ec7 to 1887cb0 Compare May 23, 2018 16:31
@jcalvert
Copy link
Author

jcalvert commented Jun 5, 2018

We'll need to update this shortly because of a bug in the way it is structured.

When code makes a bidi call, the write operation is run inside a thread - that means that if this call gets a DeadlineExceeded the enumerator doesn't complete and that write thread never exits. So when you have a poorly responding etcd server, you get an unbounded number of threads blocking and bad things happen. We need to ensure the enumerator completes so the GRPC write thread exits.

@davissp14
Copy link
Owner

@jcalvert Hey man, mind updating the parent etcd issue with your findings?

@mgates
Copy link
Contributor

mgates commented Jun 5, 2018

I think the etcd issue is still the same - this is an additional issue with error handling of bidi calls.

@jcalvert jcalvert force-pushed the lease_keep_alive_proxy branch from 1887cb0 to c780849 Compare June 5, 2018 22:01
@jcalvert jcalvert force-pushed the lease_keep_alive_proxy branch from c780849 to 8335bc7 Compare June 6, 2018 16:23
@jcalvert
Copy link
Author

jcalvert commented Jun 6, 2018

Updated again...looks like some of our forked stuff leaked over, so I cleaned that up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants