Skip to content

Commit

Permalink
Switch to Vimeo for paper visualizations (rerun-io#3344)
Browse files Browse the repository at this point in the history
Replaces all Youtube links with Vimeo.

Closes rerun-io#3138.

Co-authored-by: Nikolaus West <[email protected]>
  • Loading branch information
roym899 and nikolausWest authored Sep 19, 2023
1 parent fa5723b commit 690fb56
Show file tree
Hide file tree
Showing 8 changed files with 45 additions and 37 deletions.
14 changes: 7 additions & 7 deletions examples/python/differentiable_blocks_world/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ thumbnail: https://static.rerun.io/fd44aa668cdebc6a4c14ff038e28f48cfb83c5ee_dbw_
thumbnail_dimensions: [480, 311]
---

Finding a textured mesh decomposition from a collection of posed images is a very challenging optimization problem. Differentiable Block Worlds by @t_monnier et al. shows impressive results using differentiable rendering. I visualized how this optimization works using the Rerun SDK.
Finding a textured mesh decomposition from a collection of posed images is a very challenging optimization problem. "Differentiable Block Worlds" by Tom Monnier et al. shows impressive results using differentiable rendering. Here we visualize how this optimization works using the Rerun SDK.

https://www.youtube.com/watch?v=Ztwak981Lqg?playlist=Ztwak981Lqg&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865326948?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:7309

In Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives the authors describe an optimization of a background icosphere, a ground plane, and multiple superquadrics. The goal is to find the shapes and textures that best explain the observations.
In "Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives" the authors describe an optimization of a background icosphere, a ground plane, and multiple superquadrics. The goal is to find the shapes and textures that best explain the observations.

<picture>
<source media="(max-width: 480px)" srcset="https://static.rerun.io/71b822942cb6ce044d6f5f177350c61f0ab31d80_dbw-overview_480w.png">
Expand All @@ -20,20 +20,20 @@ In “Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Pri
<img src="https://static.rerun.io/a8fea9769b734b2474a1e743259b3e4e68203c0f_dbw-overview_full.png" alt="">
</picture>

The optimization is initialized with an initial set of superquadrics (blocks), a ground plane, and a sphere for the background. From here, the optimization can only reduce the number of blocks, not add additional ones.
The optimization is initialized with an initial set of superquadrics ("blocks"), a ground plane, and a sphere for the background. From here, the optimization can only reduce the number of blocks, not add additional ones.

https://www.youtube.com/watch?v=bOon26Zdqpc?playlist=bOon26Zdqpc&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865327350?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6497

A key difference to other differentiable renderers is the addition of transparency handling. Each mesh has an opacity associated with it that is optimized. When the opacity becomes lower than a threshold the mesh is discarded in the visualization. This allows to optimize the number of meshes.

https://www.youtube.com/watch?v=d6LkS63eHXo?playlist=d6LkS63eHXo&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865327387?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:7037

To stabilize the optimization and avoid local minima, a 3-stage optimization is employed:
1. the texture resolution is reduced by a factor of 8,
2. the full resolution texture is optimized, and
3. transparency-based optimization is deactivated, only optimizing the opaque meshes from here.

https://www.youtube.com/watch?v=irxqjUGm34g?playlist=irxqjUGm34g&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865329177?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:8845

Check out the [project page](https://www.tmonnier.com/DBW/), which also contains examples of physical simulation and scene editing enabled by this kind of scene decomposition.

Expand Down
10 changes: 5 additions & 5 deletions examples/python/limap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ thumbnail_dimensions: [480, 277]

Human-made environments contain a lot of straight lines, which are currently not exploited by most mapping approaches. With their recent work "3D Line Mapping Revisited" Shaohui Liu et al. take steps towards changing that.

https://www.youtube.com/watch?v=UdDzfxDo7UQ?playlist=UdDzfxDo7UQ&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865327785?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:5819

The work covers all stages of line-based structure-from-motion: line detection, line matching, line triangulation, track building and joint optimization. As shown in the figure, detected points and their interaction with lines is also used to aid the reconstruction.

Expand All @@ -22,18 +22,18 @@ The work covers all stages of line-based structure-from-motion: line detection,

LIMAP matches detected 2D lines between images and computes 3D candidates for each match. These are scored, and only the best candidate one is kept (green in video). To remove duplicates and reduce noise candidates are grouped together when they likely belong to the same line.

https://www.youtube.com/watch?v=kyrD6IJKxg8?playlist=kyrD6IJKxg8&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865905458?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=1000:767

Focusing on a single line, LIMAP computes a score for each candidate (the brighter, the higher the cost). These scores are used to decide which line candidates belong to the same line. The final line shown in red is computed based on the candidates that were grouped together.

https://www.youtube.com/watch?v=JTOs_VVOS78?playlist=JTOs_VVOS78&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865973521?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=1000:767

Once the lines are found, LIMAP further uses point-line associations to jointly optimize lines and points. Often 3D points lie on lines or intersections thereof. Here we highlight the line-point associations in blue.

https://www.youtube.com/watch?v=0xZXPv1o7S0?playlist=0xZXPv1o7S0&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865973652?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=1000:767

Human-made environments often contain a lot of parallel and orthogonal lines. LIMAP allows to globally optimize the lines by detecting sets that are likely parallel or orthogonal. Here we visualize these parallel lines. Each color is associated with one vanishing point.

https://www.youtube.com/watch?v=qyWYq0arb-Y?playlist=qyWYq0arb-Y&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865973669?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=1000:767

There is a lot more to unpack, so check out the [paper](https://arxiv.org/abs/2303.17504) by Shaohui Liu, Yifan Yu, Rémi Pautrat, Marc Pollefeys, Viktor Larsson. It also gives an educational overview of the strengths and weaknesses of both line-based and point-based structure-from-motion.
15 changes: 10 additions & 5 deletions examples/python/mcc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,28 @@ thumbnail_dimensions: [480, 274]

By combining MetaAI's [Segment Anything Model (SAM)](https://github.com/facebookresearch/segment-anything) and [Multiview Compressive Coding (MCC)](https://github.com/facebookresearch/MCC) we can get a 3D object from a single image.

https://www.youtube.com/watch?v=kmgFTWBZhWU?playlist=kmgFTWBZhWU&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865973817?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:8133

The basic idea is to use SAM to create a generic object mask so we can exclude the background.

https://www.youtube.com/watch?v=7qosqFbesL0?playlist=7qosqFbesL0&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865973836?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:7941

The next step is to generate a depth image. Here we use the awesome [ZoeDepth](https://github.com/isl-org/ZoeDepth) to get realistic depth from the color image.

https://www.youtube.com/watch?v=d0u-MoNVR6o?playlist=d0u-MoNVR6o&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865973850?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:7941

With depth, color, and an object mask we have everything needed to create a colored point cloud of the object from a single view

https://www.youtube.com/watch?v=LI0mE7usguk?playlist=LI0mE7usguk&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865973862?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:11688

MCC encodes the colored points and then creates a reconstruction by sweeping through the volume, querying the network for occupancy and color at each point.

https://www.youtube.com/watch?v=RuHv9Nx6PvI?playlist=RuHv9Nx6PvI&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865973880?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=1:1

This is a really great example of how a lot of cool solutions are built these days; by stringing together more targeted pre-trained models.The details of the three building blocks can be found in the respective papers:
- [Segment Anything](https://arxiv.org/abs/2304.02643) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick
Expand Down
8 changes: 4 additions & 4 deletions examples/python/shape_pointe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ thumbnail_dimensions: [480, 293]

OpenAI has released two models for text-to-3d generation: Point-E and Shape-E. Both of these methods are fast and interesting but still low fidelity for now.

https://www.youtube.com/watch?v=f9QWkamyWZI?playlist=f9QWkamyWZI&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974160?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6545

First off, how do these two methods differ from each other? Point-E represents its 3D shapes via point clouds. It does so using a 3-step generation process: first, it generates a single synthetic view using a text-to-image diffusion model (in this case GLIDE).

Expand All @@ -22,7 +22,7 @@ First off, how do these two methods differ from each other? Point-E represents i

It then produces a coarse 3D point cloud using a second diffusion model which conditions on the generated image; third, it generates a fine 3D point cloud using an upsampling network. Finally, a another model is used to predict an SDF from the point cloud, and marching cubes turns it into a mesh. As you can tell, the results aren’t very high quality, but they are fast.

https://www.youtube.com/watch?v=37Rsi7bphQY?playlist=37Rsi7bphQY&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974180?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6095

Shap-E improves on this by representing 3D shapes implicitly. This is done in two stages. First, an encoder is trained that takes images or a point cloud as input and outputs the weights of a NeRF.

Expand All @@ -36,10 +36,10 @@ Shap-E improves on this by representing 3D shapes implicitly. This is done in tw

In the second stage, a diffusion model is trained on a dataset of NeRF weights generated by the previous encoder. This diffusion model is conditioned on either images or text descriptions. The resulting NeRF also outputs SDF values so that meshes can be extracted using marching cubes again. Here we see the prompt "a cheesburger" turn into a 3D mesh a set of images.

https://www.youtube.com/watch?v=oTVLrujriiQ?playlist=oTVLrujriiQ&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974191?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6545

When compared to Point-E on both image-to-mesh and text-to-mesh generation, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.

https://www.youtube.com/watch?v=DskRD5nioyA?playlist=DskRD5nioyA&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974209?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6545

Check out the respective papers to learn more about the details of both methods: "[Shap-E: Generating Conditional 3D Implicit Functions](https://arxiv.org/abs/2305.02463)" by Heewoo Jun and Alex Nichol; "[Point-E: A System for Generating 3D Point Clouds from Complex Prompts](https://arxiv.org/abs/2212.08751)" by Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen.
9 changes: 6 additions & 3 deletions examples/python/simplerecon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,18 @@ thumbnail_dimensions: [480, 271]

SimpleRecon is a back-to-basics approach for 3D scene reconstruction from posed monocular images by Niantic Labs. It offers state-of-the-art depth accuracy and competitive 3D scene reconstruction which makes it perfect for resource-constrained environments.

https://www.youtube.com/watch?v=TYR9_Ql0w7k?playlist=TYR9_Ql0w7k&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865974318?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:7627

SimpleRecon's key contributions include using a 2D CNN with a cost volume, incorporating metadata via MLP, and avoiding computational costs of 3D convolutions. The different frustrums in the visualization show each source frame used to compute the cost volume. These source frames have their features extracted and back-projected into the current frames depth plane hypothesis.

https://www.youtube.com/watch?v=g0dzm-k1-K8?playlist=g0dzm-k1-K8&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865974327?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6522

SimpleRecon only uses camera poses, depths, and surface normals (generated from depth) for supervision allowing for out-of-distribution inference e.g. from an ARKit compatible iPhone.

https://www.youtube.com/watch?v=OYsErbNdQSs?playlist=OYsErbNdQSs&loop=1&hd=1&rel=0&autoplay=1

https://vimeo.com/865974337?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:11111

The method works well for applications such as robotic navigation, autonomous driving, and AR. It takes input images, their intrinsics, and relative camera poses to predict dense depth maps, combining monocular depth estimation and MVS via plane sweep.

Expand Down
12 changes: 6 additions & 6 deletions examples/python/slahmr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ thumbnail_dimensions: [480, 293]

SLAHMR robustly tracks the motion of multiple moving people filmed with a moving camera and works well on “in-the-wild” videos. It’s a great showcase of how to build working computer vision systems by intelligently combining several single purpose models.

https://www.youtube.com/watch?v=eGR4H0KkofA?playlist=eGR4H0KkofA&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974657?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6835

“Decoupling Human and Camera Motion from Videos in the Wild” (SLAHMR) combines the outputs of ViTPose, PHALP, DROID-SLAM, HuMoR, and SMPL over three optimization stages. It’s interesting to see how it becomes more and more consistent with each step.

Expand All @@ -22,22 +22,22 @@ https://www.youtube.com/watch?v=eGR4H0KkofA?playlist=eGR4H0KkofA&loop=1&hd=1&rel

Input to the method is a video sequence. ViTPose is used to detect 2D skeletons, PHALP for 3D shape and pose estimation of the humans, and DROID-SLAM to estimate the camera trajectory. Note that the 3D poses are initially quite noisy and inconsistent.

https://www.youtube.com/watch?v=84hWddApYtI?playlist=84hWddApYtI&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974668?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6835

In the first stage, the 3D translation and rotation predicted by PHALP is optimized to better match the 2D keypoints from ViTPose. (left = before, right = after)

https://www.youtube.com/watch?v=iYy1sfDZsEc?playlist=iYy1sfDZsEc&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974684?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6835

In the second stage, in addition to 3D translation and rotation, the scale of the world, and the shape and pose of the bodies is optimized. To do so, in addition to the previous optimization term, a prior on joint smoothness, body shape, and body pose are added. (left = before, right = after)

https://www.youtube.com/watch?v=XXMKn29MlRI?playlist=XXMKn29MlRI&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974714?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6835

This step is crucial in that it finds the correct scale such that the humans don't drift in the 3D world. This can best be seen by overlaying the two estimates (the highlighted data is before optimization).

https://www.youtube.com/watch?v=FFHWNnZzUhA?playlist=FFHWNnZzUhA&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974747?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6835

Finally, in the third stage, a motion prior (HuMoR) is added to the optimization, and the ground plane is estimated to enforce realistic ground contact. This step further removes some jerky and unrealistic motions. Compare the highlighted blue figure. (left = before, right = after)

https://www.youtube.com/watch?v=6rsgOXekhWI?playlist=6rsgOXekhWI&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865974760?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6835

For more details check out the [paper](https://arxiv.org/abs/2302.12827) by Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa.
8 changes: 4 additions & 4 deletions examples/python/tapir/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ thumbnail_dimensions: [480, 288]

Tracking any point in a video is a fundamental problem in computer vision. The paper “TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement” by Carl Doersch et al. significantly improved over prior state-of-the-art.

https://www.youtube.com/watch?v=5EixnuJnFdo?playlist=5EixnuJnFdo&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865975034?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:9015

“TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement” proposes a two-stage approach:
1. compare the query point's feature with the target image features to estimate an initial track, and
Expand All @@ -23,14 +23,14 @@ https://www.youtube.com/watch?v=5EixnuJnFdo?playlist=5EixnuJnFdo&loop=1&hd=1&rel

In the first stage the image features in the query image at the query point are compared to the feature maps of the other images using the dot product. The resulting similarity map (or “cost volume”) gives a high score for similar image features.

https://www.youtube.com/watch?v=dqvcIlk55AM?playlist=dqvcIlk55AM&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865975051?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=1:1

From here, the position of the point is predicted as a heatmap. In addition, the probabilities that the point is occluded and whether its position is accurate are predicted. Only when predicted as non-occluded and accurate a point is classified as visible for a given frame.

https://www.youtube.com/watch?v=T7w8dXEGFzY?playlist=T7w8dXEGFzY&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865975071?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:5052

The previous step gives an initial track but it is still noisy since the inference is done on a per-frame basis. Next, the position, occlusion and accuracy probabilities are iteratively refined using a spatially and temporally local feature volumes.

https://www.youtube.com/watch?v=mVA_svY5wC4?playlist=mVA_svY5wC4&loop=1&hd=1&rel=0&autoplay=1
https://vimeo.com/865975078?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:6699

Check out the [paper](https://arxiv.org/abs/2306.08637) by Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. It also includes a nice visual comparison to previous approaches.
Loading

0 comments on commit 690fb56

Please sign in to comment.