# Automatic photo categorization


Goals:
  - Categorize photos into semantically similar groups.
  - Mark similar photos for removal.


## Table of contents
 1. [Features](#features)
 2. [Clustering](#clustering)
 3. [Deduplication](#deduplication)

In [None]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
%matplotlib inline

from tqdm import tqdm
#from tqdm.notebook import tqdm

from toolz import compose
from toolz.curried import map, filter

In [None]:
from photocat import fs, photo, group

INPUT_DIR = 'data/photos'

<a name="features"></a>
## Features

Extract features from EXIF data and YOLOv3 output.

In [None]:
def show_photos(photos, n_row, n_col, size=4):
    _, axs = plt.subplots(n_row, n_col, figsize=(n_col*size, n_row*size))
    axs = axs.flatten()
    for p, ax in zip(photos, axs):
        ax.imshow(p.thumbnail)
    plt.show()

photos = compose(
    list,
    tqdm,
    map(lambda f: photo.Photo(f)),
    fs.list_images
)(INPUT_DIR)

show_photos(photos[0:24], 6, 4)


<a name="clustering"></a>
## Clustering

Normalize features and cluster with DBSCAN.

<a name="deduplication"></a>
## Deduplication

Use eucledian distance between outputs of topmost YOLOv3 layers as a metric for photo similarity.