ship-vs-iceberg/README.md

# Ship vs iceberg discriminator

TL;DR: Discriminate between ships and icebergs from SAR imagery.


## Approach

Data augmentation and parameter sharing. CNN and ResNets.


## Data directory

The data directory is expected to have the structure as shown below:

    data/
    ├── params
    │   ├── base_cnn-scaling.pkl
    │   ├── base_cnn-weights-loss.h5
    │   ├── base_cnn-weights-val_loss.h5
    │   ├── icenet-weights-loss.h5
    │   └── icenet-weights-val_loss.h5
    ├── predictions
    │   └── icenet-dev.csv
    ├── sample_submission.csv
    ├── test.json
    └── train.json

where `{train,test}.json` is the data from the
[kaggle website](https://www.kaggle.com/c/statoil-iceberg-classifier-challenge).


## Log

### Residual base CNN

Summary:

  * Test loss: 0.5099
  * Test accuracy: 0.7932
  * Epochs: 100
  * Best val loss at epoch 70 (converged until 100, did not overfit)

Comments:

  * Low variance -- training loss is consistently a bit lower than validation
    loss.
  * Since images are "artificially" labeled, it is hard to say what the bias is.
    There should be some bias since this network does not overfit, and it also
    looks like training converges after 100 epochs (with decaying learning rate).
  * There may also be labeling noise. It is indeed suspicious that the validation
    loss converges with very low variance. Perhaps revisit the labeling
    approach for the base generator.
  * Conclusion: Check labeling, then bring out the big guns and expand the
    residual net.

With this model as a basis for the 9 regions, followed by a reshape, conv and
two dense layers, yields ok performance: Around 0.20 loss after few epochs.

However, validation loss is often lower than training loss. It might be that
the two distributions are not the same for both networks -- check the random
seed and verify! Might also be noisy training (because of augmentation).