This baseline is not documented on the main paper, due to space. It is constructed by taking CLIP (specifically StreetCLIP) and rescaling an input image to 334 x 334 (by center cropping) and then computing I) a ranking of each patch and II) clustering them top-1000 by features per position. It is exactly our algorithm from the main paper but adapted to CLIP.
To rank each patch we try 3 ranking scores:
a) Difference: sim(patch, f"{country}") - sim(patch, f"")
.
b) Softmax: softmax([sim(patch, f"{country}"), sim(patch, f"")])[0]
.
c) Similarity: sim(patch, f"{country}")
.
To extract features for clustering we simply rescale and upscale token features and take their l2-normalized average, that corresponds to the input patch, similar to what we do for our dift features.
Results for our 10 mined countries can be located on our suppmat. Although the speed of this algorithm is extremely fast: 30 minutes per country on vector parallelization of 32 cpus**(!)**, the results are unfortunately not that satisfying. Note that, to the best of our knowledge, because of the learned positional embedding of clip this method can't be extended to arbitrary size images.