You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for publishing this outstanding work. However, I have some questions while reading your paper. Since the depth prediction network is fixed during the training of the stabilization network, I would like to understand why there is a spatial loss term L(t-1). According to my understanding, during inference, the stabilization network takes four depth inputs and outputs the depth for the target frame, without explicitly providing the depth for t-1. So, during training, why is there a spatial loss term L(t-1)? Does the stabilization network simultaneously output stabilization depth for all four frames? If not, does it involve inferring t-1 depth twice during each gradient backward pass – once for input t-4 to t-1, producing the depth for t-1, and another for input t-3 to t, producing the depth for t, and then calculating the loss?
Apart from this question, I would also like to understand how the temporal loss during training, which uses t-1 depth, is obtained.
Thank you for your clarification.
The text was updated successfully, but these errors were encountered:
Dear NVDS authors,
Thank you for publishing this outstanding work. However, I have some questions while reading your paper. Since the depth prediction network is fixed during the training of the stabilization network, I would like to understand why there is a spatial loss term L(t-1). According to my understanding, during inference, the stabilization network takes four depth inputs and outputs the depth for the target frame, without explicitly providing the depth for t-1. So, during training, why is there a spatial loss term L(t-1)? Does the stabilization network simultaneously output stabilization depth for all four frames? If not, does it involve inferring t-1 depth twice during each gradient backward pass – once for input t-4 to t-1, producing the depth for t-1, and another for input t-3 to t, producing the depth for t, and then calculating the loss?
Apart from this question, I would also like to understand how the temporal loss during training, which uses t-1 depth, is obtained.
Thank you for your clarification.
The text was updated successfully, but these errors were encountered: