这是一种使用以下方法确定重复和接近重复图像的定量方法:sentence-transformers https://github.com/UKPLab/sentence-transformers库,它提供了一种计算图像的密集矢量表示的简单方法。我们可以使用OpenAI 对比语言-图像预训练(CLIP)模型 https://github.com/openai/CLIP这是一个已经在各种(图像、文本)对上进行训练的神经网络。为了找到图像重复项和近似重复项,我们将所有图像编码到向量空间中,然后找到与图像非常相似的区域相对应的高密度区域。
当比较两个图像时,它们会得到一个分数0
to 1.00
。我们可以使用阈值参数来识别两个图像相似或不同。通过设置较低的阈值,您将获得更大的聚类,其中相似图像更少。重复图像的得分为1.00
这意味着两个图像完全相同。为了找到接近重复的图像,我们可以将阈值设置为任意值,例如0.9
。例如,如果确定的两幅图像之间的分数大于0.9
那么我们可以得出结论,它们几乎是重复的图像。
一个例子:
该数据集有 5 张图像,请注意猫 #1 的重复项,而其他的则不同。
查找重复图像
Score: 100.000%
.\cat1 copy.jpg
.\cat1.jpg
cat1 及其副本是相同的。
查找近似重复的图像
Score: 91.116%
.\cat1 copy.jpg
.\cat2.jpg
Score: 91.116%
.\cat1.jpg
.\cat2.jpg
Score: 91.097%
.\bear1.jpg
.\bear2.jpg
Score: 59.086%
.\bear2.jpg
.\cat2.jpg
Score: 56.025%
.\bear1.jpg
.\cat2.jpg
Score: 53.659%
.\bear1.jpg
.\cat1 copy.jpg
Score: 53.659%
.\bear1.jpg
.\cat1.jpg
Score: 53.225%
.\bear2.jpg
.\cat1.jpg
我们得到了不同图像之间更有趣的分数比较结果。得分越高,相似度越高;分数越低,相似度越低。使用阈值0.9
或 90%,我们可以过滤掉几乎重复的图像。
仅两个图像之间的比较
Score: 91.097%
.\bear1.jpg
.\bear2.jpg
Score: 91.116%
.\cat1.jpg
.\cat2.jpg
Score: 93.715%
.\tower1.jpg
.\tower2.jpg
Code
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import os
# Load the OpenAI CLIP Model
print('Loading CLIP Model...')
model = SentenceTransformer('clip-ViT-B-32')
# Next we compute the embeddings
# To encode an image, you can use the following code:
# from PIL import Image
# encoded_image = model.encode(Image.open(filepath))
image_names = list(glob.glob('./*.jpg'))
print("Images:", len(image_names))
encoded_image = model.encode([Image.open(filepath) for filepath in image_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)
# Now we run the clustering algorithm. This function compares images aganist
# all other images and returns a list with the pairs that have the highest
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)
NUM_SIMILAR_IMAGES = 10
# =================
# DUPLICATES
# =================
print('Finding duplicate images...')
# Filter list for duplicates. Results are triplets (score, image_id1, image_id2) and is scorted in decreasing order
# A duplicate image will have a score of 1.00
duplicates = [image for image in processed_images if image[0] >= 1]
# Output the top X duplicate images
for score, image_id1, image_id2 in duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
print(image_names[image_id1])
print(image_names[image_id2])
# =================
# NEAR DUPLICATES
# =================
print('Finding near duplicate images...')
# Use a threshold parameter to identify two images as similar. By setting the threshold lower,
# you will get larger clusters which have less similar images in it. Threshold 0 - 1.00
# A threshold of 1.00 means the two images are exactly the same. Since we are finding near
# duplicate images, we can set it at 0.99 or any number 0 < X < 1.00.
threshold = 0.99
near_duplicates = [image for image in processed_images if image[0] < threshold]
for score, image_id1, image_id2 in near_duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
print(image_names[image_id1])
print(image_names[image_id2])