Update README.md

wei-tim · Nov 15, 2019 · 669ec28 · 669ec28
1 parent 0bf045a
commit 669ec28
Showing 1 changed file with 14 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,21 @@
-# YOWO
+# You Only Watch Once (YOWO)
 
-In this work, we present ***YOWO*** (***Y**ou **O**nly **W**atch **O**nce*), a unified CNN architecture for real-time spatiotemporal action localization in video stream. *YOWO* is a single-stage framework, the input is a clip consisting of several successive frames in a video, while the output predicts bounding box positions as well as corresponding class labels in current frame. Afterwards, with specific strategy, these detections can be linked together to generate *Action Tubes* in the whole video.
-
-Since we do not separate human detection and action classification procedures, the whole network can be optimized by a joint loss in an end-to-end framework. We have carried out a series of comparative evaluations on two challenging representative datasets **UCF101-24** and **J-HMDB-21**. Our approach outperforms the other state-of-the-art results while retaining real-time capability, providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips.
-
-We show some detection results with our framework here.
+PyTorch implementation of the article "You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization". ***Code will uploaded soon!***
 
 <div align="center" style="width:image width px;">
-  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/biking.gif" width=300 alt="biking">
-  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/fencing.gif" width=300 alt="fencing">
+  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/biking.gif" width=240 alt="biking">
+  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/fencing.gif" width=240 alt="fencing">
+  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/golf_swing.gif" width=240 alt="golf-swing">
 </div>
 
-<div align="center" style="width:image width px;">
-  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/golf_swing.gif" width=300 alt="golf-swing">
-  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/catch.gif" width=300 alt="catch">
+<div align="center" style="width:image width px;"> 
+  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/catch.gif" width=240 alt="catch">
+  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/brush_hair.gif" width=240 alt="brush-hair">
+  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/pull_up.gif" width=240 alt="pull-up">
 </div>
 
-<div align="center" style="width:image width px;">
-  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/brush_hair.gif" width=300 alt="brush-hair">
-  <img  src="https://github.com/wei-tim/YOWO/blob/master/examples/pull_up.gif" width=300 alt="pull-up">
-</div>
+In this work, we present ***YOWO*** (***Y**ou **O**nly **W**atch **O**nce*), a unified CNN architecture for real-time spatiotemporal action localization in video stream. *YOWO* is a single-stage framework, the input is a clip consisting of several successive frames in a video, while the output predicts bounding box positions as well as corresponding class labels in current frame. Afterwards, with specific strategy, these detections can be linked together to generate *Action Tubes* in the whole video.
+
+Since we do not separate human detection and action classification procedures, the whole network can be optimized by a joint loss in an end-to-end framework. We have carried out a series of comparative evaluations on two challenging representative datasets **UCF101-24** and **J-HMDB-21**. Our approach outperforms the other state-of-the-art results while retaining real-time capability, providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips.
+
+