-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DWR / DWA not synced causing issues with timeouts #70
Comments
hi @SummaNetworks , as far as I understand the issue: |
Hi, I will try to explain :), |
hi @SummaNetworks
Follow that the processEvent function is not synced and cause that issue, I cound not reproduce the issue , I tried to synced that function and check the impact on legacy tests, it is ok, no delay or failing tests.
About the issue, i just wonder why DWA is back so fast, before finish processing DWR? |
Hi xphudin!!! At 10:54:31,568 we receive a DWA_EVENT 2016-11-08 10:54:31,568 DEBUG SCTPTransportServer$ServerAssociationListener SCTP Server received a message of length: [72] At 10:54:59,631 we send a DWR_EVENT over the SCTP connection: 2016-11-08 10:54:59,630 DEBUG [PeerImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=DOWN; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca; incCon{aaa://10.225.164.62:43395=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@6ed51aca} }_1-1) Send DWR message At 10:54:59,635 we receive the DWA_EVENT At 10:55:31:686 we send a DWR_EVENT but you can check on the pcap file that this DWR_EVENT doesn't appear, any clue? 2016-11-08 10:55:31,686 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Sending timeout event So before "sending" this ghost DWR_EVENT we don't have a DWA_EVENT in response, so the timer tries to send a new one DWR_EVENT at 10:55:41,701 but as you can see on the pcap file this event also doesn't appear 2016-11-08 10:55:41,701 DEBUG [PeerFSMImpl](FSM-SPeer{Uri=aaa://vcmS6a; State=null; con=null; incConnull }_1-0) Sending timeout event as you can see, there are HEARTBEAT and HEARTBEAT_ACK on the connection, so the connection is alive, but there is no trace of the two DWR_EVENTs, any clue? For this test we use only one thread to avoid the "synchronization" problem, but at the end the same problem, the connection SHUTDOWNs. Thanks |
hi @SummaNetworks Did you attach the pcap file ? |
It only accepts zip files :), here it is |
Unable to replicate, please let us know if it happens again and/or any steps to reproduce. |
When receiving DWA from a peer, the timeout event and the DWA event are coming almost at the same time, which causes the peer to be disconnected. Two threads are doing it at the same time, which we have a race condition. You can see the log that 2 threads are executing the Event, and the Timeout.
2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (Thread-14) Handling event with type [DWA_EVENT]
2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (Thread-14) Not performing validation to message since validator is DISABLED.
2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (Thread-14) Placing event [Event{name:DWA_EVENT, key:aaa://10.225.20.206:16018, object:MessageImpl{commandCode=280, flags=0}}] into linked blocking queue with remaining capacity: [10000].
2016-11-02 23:50:46,475 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_2-1) SPeer{Uri=aaa://10.225.20.206; State=OKAY; con=org.jdiameter.server.impl.io.sctp.SCTPServerConnection@162db65b; incCon{} } FSM switch state: OKAY -> SUSPECT
2016-11-02 23:50:46,477 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_10-9) Got Event [Event{name:DWA_EVENT, key:aaa://10.225.20.206:16018, object:MessageImpl{commandCode=280, flags=0}}] from Queue
2016-11-02 23:50:46,478 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_10-9) Process event [Event{name:DWA_EVENT, key:aaa://10.225.20.206:16018, object:MessageImpl{commandCode=280, flags=0}}]. Peer State is [OKAY]
2016-11-02 23:50:55,466 DEBUG [TimerTask] (SLEE-TimerFacility-thread-3) Task with id -6ee284d6:158250d0338:-7fbb is recurring, not removing it locally nor in the cluster
2016-11-02 23:50:55,466 DEBUG [TimerTask] (SLEE-TimerFacility-thread-3) Firing Timer with id -6ee284d6:158250d0338:-7fbb
2016-11-02 23:50:55,466 DEBUG [TimerFacilityTimerTask] (SLEE-TimerFacility-thread-3) Executing task with timer ID -6ee284d6:158250d0338:-7fbb
2016-11-02 23:50:55,466 DEBUG [TimerFacilityTimerTask] (SLEE-TimerFacility-thread-3) Delay till execution is 1
2016-11-02 23:50:55,466 DEBUG [TimerFacilityTimerTask] (SLEE-TimerFacility-thread-3) Remaining executions:2147483647
2016-11-02 23:51:15,519 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Sending timeout event
2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Handling event with type [TIMEOUT_EVENT]
2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Not performing validation to message since validator is DISABLED.
2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Placing event [Event{name:TIMEOUT_EVENT, key:null, object:null}] into linked blocking queue with remaining capacity: [10000].
2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Got Event [Event{name:TIMEOUT_EVENT, key:null, object:null}] from Queue
2016-11-02 23:51:15,520 DEBUG [PeerFSMImpl] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Process event [Event{name:TIMEOUT_EVENT, key:null, object:null}]. Peer State is [SUSPECT]
2016-11-02 23:51:15,520 DEBUG [SCTPServerConnection] (FSM-SPeer{Uri=aaa://10.225.20.206; State=null; con=null; incConnull }_8-7) Disconnecting SCTP Server Connection aaa://10.225.20.206:16018
The text was updated successfully, but these errors were encountered: