## Summary

In summary:

We introduced the problem of Image Classification, in which we are given a set of images that are all labeled with a single category. We are then asked to predict these categories for a novel set of test images and measure the accuracy of the predictions.
We introduced a simple classifier called the Nearest Neighbor classifier. We saw that there are multiple hyper-parameters (such as value of k, or the type of distance used to compare examples) that are associated with this classifier and that there was no obvious way of choosing them.
We saw that the correct way to set these hyperparameters is to split your training data into two: a training set and a fake test set, which we call validation set. We try different hyperparameter values and keep the values that lead to the best performance on the validation set.
If the lack of training data is a concern, we discussed a procedure called cross-validation, which can help reduce noise in estimating which hyperparameters work best.
Once the best hyperparameters are found, we fix them and perform a single evaluation on the actual test set.
We saw that Nearest Neighbor can get us about 40% accuracy on CIFAR-10. It is simple to implement but requires us to store the entire training set and it is expensive to evaluate on a test image.

Finally, we saw that the use of L1 or L2 distances on raw pixel values is not adequate since the distances correlate more strongly with backgrounds and color distributions of images than with their semantic content.
In next lectures we will embark on addressing these challenges and eventually arrive at solutions that give 90% accuracies, allow us to completely discard the training set once learning is complete, and they will allow us to evaluate a test image in less than a millisecond.

### Summary: Applying kNN in practice

If you wish to apply kNN in practice (hopefully not on images, or perhaps as only a baseline) proceed as follows:

Preprocess your data: Normalize the features in your data (e.g. one pixel in images) to have zero mean and unit variance. We will cover this in more detail in later sections, and chose not to cover data normalization in this section because pixels in images are usually homogeneous and do not exhibit widely different distributions, alleviating the need for data normalization.

If your data is very high-dimensional, consider using a dimensionality reduction technique such as PCA (wiki ref, CS229ref, blog ref) or even Random Projections.
Split your training data randomly into train/val splits. As a rule of thumb, between 70-90% of your data usually goes to the train split. This setting depends on how many hyperparameters you have and how much of an influence you expect them to have. If there are many hyperparameters to estimate, you should err on the side of having larger validation set to estimate them effectively. If you are concerned about the size of your validation data, it is best to split the training data into folds and perform cross-validation. If you can afford the computational budget it is always safer to go with cross-validation (the more folds the better, but more expensive).
Train and evaluate the kNN classifier on the validation data (for all folds, if doing cross-validation) for many choices of k (e.g. the more the better) and across different distance types (L1 and L2 are good candidates)
If your kNN classifier is running too long, consider using an Approximate Nearest Neighbor library (e.g. FLANN) to accelerate the retrieval (at cost of some accuracy).

Take note of the hyperparameters that gave the best results. There is a question of whether you should use the full training set with the best hyperparameters, since the optimal hyperparameters might change if you were to fold the validation data into your training set (since the size of the data would be larger). In practice it is cleaner to not use the validation data in the final classifier and consider it to be burned on estimating the hyperparameters. Evaluate the best model on the test set. Report the test set accuracy and declare the result to be the performance of the kNN classifier on your data.

Here are some (optional) links you may find interesting for further reading:

A Few Useful Things to Know about Machine Learning, where especially section 6 is related but the whole paper is a warmly recommended reading.
Recognizing and Learning Object Categories, a short course of object categorization at ICCV 2005.

## CNN

1. 局部感受野, 注意步长, 滤波器大小, zero padding大小.
2. 共享权重, 减少参数数目. 在整个图像上学习一个特征, 因为图像的翻译不变形结构.
3. 混合, 进一步减少参数数目.

CNN由一个序列的层构成, 最常见的包括卷积层, 混合层, 全连接层, relu. 每一个层的输出是一个三维的结构, 输出也是一个三维结构, 中间使用的是一些可微的函数. 有的层可以没有参数, 如RELU, 混合, 有的层没有超参数. 超参数是用来自己设置的, 比如高斯核函数的超参数, 训练速率. 卷积层相当于学习一个滤波器, 当遇到图像的某个特征的时候将会激活, 比如某个方向上的角. 然后一个卷积层一般有很多卷积核, 就可以学习很多个图像的特征. 感受野的大小就是超参数, 感受野是局部的, 但是对于深度是全部的.

1. 输出三维结构的深度, 相当于我们想要学习的滤波器的个数, 每个滤波器想要从图像中学习不同的特征.
2. 步长

$$\frac{(W-F+2P)}{S}+1$$

INPUT -> [[CONV -> RELU]N -> POOL?]M -> [FC -> RELU]*K -> FC

1. 后者直接计算局部的线性映射, 而前者可以学习到非线性的特征.
2. 后者的参数更多.

1. 输出层最好能除以2很多倍. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512.
2. 卷积层应该不改变输入的维数, 用多层小的卷积层叠加而不是大的单个卷积层. 一般大的卷积层只在第一层卷积层使用.
3. 混合层最常用2X2的最大混合. 可以丢弃75%的激活层. 大于3X3的效果都很差.
4. 步长最好用1, 使得卷积层只改变数据的深度, 用混合层进行采样降维.

1. LeNet(第一个)
2. AlexNet(ILSVRC 2012 winner), 使用很深的堆叠卷积网络, 在这之前一般卷积层都马上接一个混合层.
3. ZF Net(ILSVRC 2013 winner), 对AlexNet的超参数的修改, 扩大了中间卷积层的大小, 并且让第一层的步长和卷积核大小变小了.
4. GoogLeNet(ILSVRC 2014 winner), 引入感知单元, 显著性减低了网络的参数数目, (4M, 而AlexNet有60M), 最后使用平均值混合而不是全连接层, 减少了很多参数.(Inception-4)
5. VGGNet(很深的网络, 很小的卷积核, ILSVRC 2014 runner-up), 缺点是占用很多内存.
6. ResNet(目前最好用的, ILSVRC 2015 winner), 跳跃连接, batch正则化, 末端无全连接层.

INPUT: [224x224x3]        memory:  224*224*3=150K   weights: 0
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64]  memory:  112*112*64=800K   weights: 0
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128]  memory:  56*56*128=400K   weights: 0
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256]  memory:  28*28*256=200K   weights: 0
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512]  memory:  14*14*512=100K   weights: 0
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512]  memory:  7*7*512=25K  weights: 0
FC: [1x1x4096]  memory:  4096  weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096]  memory:  4096  weights: 4096*4096 = 16,777,216
FC: [1x1x1000]  memory:  1000 weights: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters