Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4021: Solve Poll timeout failure caused by POLLNVAL #1680

Open
wants to merge 1 commit into
base: branch-3.7
Choose a base branch
from

Conversation

qwedsazzcc
Copy link

Zookeeper c client request server by dns , when the zookeeper server and dns server down at the same time , c client poll will set revent to POLLNVAL . At this time The parameter timeout of poll is not valid. it cause very frequent requests and high cpu use

@qwedsazzcc qwedsazzcc changed the title Solve Poll timeout failure caused by POLLNVAL ZOOKEEPER-4021: Solve Poll timeout failure caused by POLLNVAL Apr 13, 2021
@maoling
Copy link
Member

maoling commented Jul 1, 2021

Copy link
Contributor

@ztzg ztzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @qwedsazzcc,

Thank you for your investigation and contribution. I understand what you are reporting, and imagine that your patch effectively makes the 100% CPU usage situation disappear—but unless I am missing something, it does not address the root cause.

Unfortunately, I have been unable to reproduce the specific sequence you are observing.

According to POSIX, POLLNVAL means "file descriptor not open," and while I could imagine some code closing the FD which is being polled, it is difficult to imagine how it would happen more than once and/or lead to a busy loop.

In your experience, is the problem deterministic? Is it "easy" to reproduce? Also: would you have log entries or some other kind of trace which might help figure out the exact conditions?

As for the error condition(s), which should indeed be handled, how about the following:

--- a/zookeeper-client/zookeeper-client-c/src/mt_adaptor.c
+++ b/zookeeper-client/zookeeper-client-c/src/mt_adaptor.c
@@ -388,7 +388,7 @@ void *do_io(void *v)
         timeout=tv.tv_sec * 1000 + (tv.tv_usec/1000);

         poll(fds,maxfd,timeout);
-        if (fd != -1) {
+        if (fd != 1 && (fds[1].revents&POLLNVAL) == 0) {
             interest=(fds[1].revents&POLLIN)?ZOOKEEPER_READ:0;
             interest|=((fds[1].revents&POLLOUT)||(fds[1].revents&POLLHUP))?ZOOKEEPER_WRITE:0;
         }

Not resetting interest should lead to check_events accessing the socket and noticing the closed state. Or does such a patch still cause 100% CPU usage?

Cheers, -D

@qwedsazzcc
Copy link
Author

Hi @qwedsazzcc,

Thank you for your investigation and contribution. I understand what you are reporting, and imagine that your patch effectively makes the 100% CPU usage situation disappear—but unless I am missing something, it does not address the root cause.

Unfortunately, I have been unable to reproduce the specific sequence you are observing.

According to POSIX, POLLNVAL means "file descriptor not open," and while I could imagine some code closing the FD which is being polled, it is difficult to imagine how it would happen more than once and/or lead to a busy loop.

In your experience, is the problem deterministic? Is it "easy" to reproduce? Also: would you have log entries or some other kind of trace which might help figure out the exact conditions?

As for the error condition(s), which should indeed be handled, how about the following:

--- a/zookeeper-client/zookeeper-client-c/src/mt_adaptor.c
+++ b/zookeeper-client/zookeeper-client-c/src/mt_adaptor.c
@@ -388,7 +388,7 @@ void *do_io(void *v)
         timeout=tv.tv_sec * 1000 + (tv.tv_usec/1000);

         poll(fds,maxfd,timeout);
-        if (fd != -1) {
+        if (fd != 1 && (fds[1].revents&POLLNVAL) == 0) {
             interest=(fds[1].revents&POLLIN)?ZOOKEEPER_READ:0;
             interest|=((fds[1].revents&POLLOUT)||(fds[1].revents&POLLHUP))?ZOOKEEPER_WRITE:0;
         }

Not resetting interest should lead to check_events accessing the socket and noticing the closed state. Or does such a patch still cause 100% CPU usage?

Cheers, -D

The error log in console is

2021-07-09 01:31:17,352:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,374:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,385:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,417:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,432:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,442:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,458:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,476:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,492:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,509:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,521:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,538:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,552:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,570:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,587:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,604:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,621:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,632:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,647:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,658:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,674:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,683:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,702:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,717:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,735:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,750:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,760:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,778:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,793:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,813:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,831:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,841:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,858:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,873:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,892:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,905:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,929:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2021-07-09 01:31:17,938:5615(0x7fa0f9ffb700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

It's call getaddrinfo too fast cause hight cpu use .

I run zookeeper server in local , and add a dns rule in hosts (127.0.0.1 zookeeper) , my process connected zookeeper with zookeeper:2181 , remove the rule in hosts and then stop zookeeper server immediately . It's might reproduce .

I used this

 if (fd != -1 && (fds[1].revents&POLLNVAL) == 0) {
             interest=(fds[1].revents&POLLIN)?ZOOKEEPER_READ:0;
             interest|=((fds[1].revents&POLLOUT)||(fds[1].revents&POLLHUP))?ZOOKEEPER_WRITE:0;
         }

It didn't solve my problem.

Thanks for your reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants