Skip to content

Latest commit

 

History

History

clipmining

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

CLIP Baseline

This baseline is not documented on the main paper, due to space. It is constructed by taking CLIP (specifically StreetCLIP) and rescaling an input image to 334 x 334 (by center cropping) and then computing I) a ranking of each patch and II) clustering them top-1000 by features per position. It is exactly our algorithm from the main paper but adapted to CLIP.

I. Ranking each patch

To rank each patch we try 3 ranking scores:

a) Difference: sim(patch, f"{country}") - sim(patch, f"") .
b) Softmax: softmax([sim(patch, f"{country}"), sim(patch, f"")])[0].
c) Similarity: sim(patch, f"{country}").

II. Clustering

To extract features for clustering we simply rescale and upscale token features and take their l2-normalized average, that corresponds to the input patch, similar to what we do for our dift features.

Results

Results for our 10 mined countries can be located on our suppmat. Although the speed of this algorithm is extremely fast: 30 minutes per country on vector parallelization of 32 cpus**(!)**, the results are unfortunately not that satisfying. Note that, to the best of our knowledge, because of the learned positional embedding of clip this method can't be extended to arbitrary size images.