Skip to content

Commit

Permalink
MISC - fixed alignment between checkpoint and config in README
Browse files Browse the repository at this point in the history
  • Loading branch information
MidoAssran committed Feb 19, 2024
1 parent 2dfdfac commit b1b3bb7
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# V-JEPA: Video Joint Embedding Predictive Architecture

Official PyTorch codebase for the *video joint-embedding predictive architecture*, V-JEPA, a method for self-supervised learning of visual representations from video.
Official PyTorch codebase for the _video joint-embedding predictive architecture_, V-JEPA, a method for self-supervised learning of visual representations from video.

**[Meta AI Research, FAIR](https://ai.facebook.com/research/)**

Expand All @@ -23,7 +23,7 @@ V-JEPA pretraining is based solely on an unsupervised feature prediction objecti

## Visualizations
As opposed to generative methods that have a pixel decoder, V-JEPA has a predictor that makes predictions in latent space.
We train a conditional diffusion model to decode the V-JEPA feature-space predictions to interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process.
We train a conditional diffusion model to decode the V-JEPA feature-space predictions to interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process.
The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video.

The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video.
Expand Down Expand Up @@ -96,7 +96,7 @@ The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal
<td>224x224</td>
<td>80.8</td>
<td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/k400-probe.pth.tar">attentive probe checkpoint</a></td>
<td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_k400_16x8x3.yaml">configs</a></td>
<td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_k400_16x8x3.yaml">configs</a></td>
</tr>
<tr>
<td>ViT-H/16</td>
Expand All @@ -107,10 +107,10 @@ The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal
</tr>
<tr>
<td>ViT-H/16</td>
<td>224x224</td>
<td>384x384</td>
<td>81.9</td>
<td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/k400-probe.pth.tar">attentive probe checkpoint</a></td>
<td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_k400_16x8x3.yaml">configs</a></td>
<td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_k400_16x8x3.yaml">configs</a></td>
</tr>
</table>

Expand Down Expand Up @@ -171,7 +171,7 @@ The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal
</tr>
<tr>
<td>ViT-H/16</td>
<td>224x224</td>
<td>384x384</td>
<td>77.4</td>
<td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/in1k-probe.pth.tar">attentive probe checkpoint</a></td>
<td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_in1k.yaml">configs</a></td>
Expand Down Expand Up @@ -203,7 +203,7 @@ The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal
</tr>
<tr>
<td>ViT-H/16</td>
<td>224x224</td>
<td>384x384</td>
<td>62.8</td>
<td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/places-probe.pth.tar">attentive probe checkpoint</a></td>
<td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_places.yaml">configs</a></td>
Expand Down Expand Up @@ -235,7 +235,7 @@ The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal
</tr>
<tr>
<td>ViT-H/16</td>
<td>224x224</td>
<td>384x384</td>
<td>72.6</td>
<td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/inat-probe.pth.tar">attentive probe checkpoint</a></td>
<td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_inat.yaml">configs</a></td>
Expand Down Expand Up @@ -274,14 +274,14 @@ All experiment parameters are specified in config files (as opposed to command-l

### Video Datasets
V-JEPA pretraining and evaluations works with many standard video formats.
To make a video dataset compatible with the V-JEPA codebase, you simply need to create a ```.csv``` file with the following format, and then specify the path to this csv file in your config.
To make a video dataset compatible with the V-JEPA codebase, you simply need to create a `.csv` file with the following format, and then specify the path to this csv file in your config.
```
/absolute_file_path.[mp4, webvid, etc.] $integer_class_label
/absolute_file_path.[mp4, webvid, etc.] $integer_class_label
/absolute_file_path.[mp4, webvid, etc.] $integer_class_label
...
```
Since V-JEPA is entirely unsupervised, the pretraining code will disregard the ```$integer_class_label``` in the csv file.
Since V-JEPA is entirely unsupervised, the pretraining code will disregard the `$integer_class_label` in the csv file.
Thus, feel free to put a random value in this column.
However, if you wish to run a supervised video classification evaluation on your video dataset, you must replace ```$integer_class_label``` with the ground truth label for each video.

Expand Down

0 comments on commit b1b3bb7

Please sign in to comment.