In summary:

We introduced the problem of Image Classification, in which we are given a set of images that are all labeled with a single category. We are then asked to predict these categories for a novel set of test images and measure the accuracy of the predictions.
We introduced a simple classifier called the Nearest Neighbor classifier. We saw that there are multiple hyper-parameters (such as value of k, or the type of distance used to compare examples) that are associated with this classifier and that there was no obvious way of choosing them.
We saw that the correct way to set these hyperparameters is to split your training data into two: a training set and a fake test set, which we call validation set. We try different hyperparameter values and keep the values that lead to the best performance on the validation set.
If the lack of training data is a concern, we discussed a procedure called cross-validation, which can help reduce noise in estimating which hyperparameters work best.
Once the best hyperparameters are found, we fix them and perform a single evaluation on the actual test set.
We saw that Nearest Neighbor can get us about 40% accuracy on CIFAR-10. It is simple to implement but requires us to store the entire training set and it is expensive to evaluate on a test image.

Finally, we saw that the use of L1 or L2 distances on raw pixel values is not adequate since the distances correlate more strongly with backgrounds and color distributions of images than with their semantic content.
In next lectures we will embark on addressing these challenges and eventually arrive at solutions that give 90% accuracies, allow us to completely discard the training set once learning is complete, and they will allow us to evaluate a test image in less than a millisecond.

Summary: Applying kNN in practice

If you wish to apply kNN in practice (hopefully not on images, or perhaps as only a baseline) proceed as follows:

Preprocess your data: Normalize the features in your data (e.g. one pixel in images) to have zero mean and unit variance. We will cover this in more detail in later sections, and chose not to cover data normalization in this section because pixels in images are usually homogeneous and do not exhibit widely different distributions, alleviating the need for data normalization.

If your data is very high-dimensional, consider using a dimensionality reduction technique such as PCA (wiki ref, CS229ref, blog ref) or even Random Projections.
Split your training data randomly into train/val splits. As a rule of thumb, between 70-90% of your data usually goes to the train split. This setting depends on how many hyperparameters you have and how much of an influence you expect them to have. If there are many hyperparameters to estimate, you should err on the side of having larger validation set to estimate them effectively. If you are concerned about the size of your validation data, it is best to split the training data into folds and perform cross-validation. If you can afford the computational budget it is always safer to go with cross-validation (the more folds the better, but more expensive).
Train and evaluate the kNN classifier on the validation data (for all folds, if doing cross-validation) for many choices of k (e.g. the more the better) and across different distance types (L1 and L2 are good candidates)
If your kNN classifier is running too long, consider using an Approximate Nearest Neighbor library (e.g. FLANN) to accelerate the retrieval (at cost of some accuracy).

Take note of the hyperparameters that gave the best results. There is a question of whether you should use the full training set with the best hyperparameters, since the optimal hyperparameters might change if you were to fold the validation data into your training set (since the size of the data would be larger). In practice it is cleaner to not use the validation data in the final classifier and consider it to be burned on estimating the hyperparameters. Evaluate the best model on the test set. Report the test set accuracy and declare the result to be the performance of the kNN classifier on your data.

Further Reading

Here are some (optional) links you may find interesting for further reading:

A Few Useful Things to Know about Machine Learning, where especially section 6 is related but the whole paper is a warmly recommended reading.
Recognizing and Learning Object Categories, a short course of object categorization at ICCV 2005.


降纬: t-SNE PCA Random Projection




正常的神经网络只使用全连接层, 导致参数过多, 而卷积神经网络利用了输入是图像这一特性, 发展了三维的层, 卷积层和混合层, 然后主要利用了三个特性.

  1. 局部感受野, 注意步长, 滤波器大小, zero padding大小.
  2. 共享权重, 减少参数数目. 在整个图像上学习一个特征, 因为图像的翻译不变形结构.
  3. 混合, 进一步减少参数数目.

CNN由一个序列的层构成, 最常见的包括卷积层, 混合层, 全连接层, relu. 每一个层的输出是一个三维的结构, 输出也是一个三维结构, 中间使用的是一些可微的函数. 有的层可以没有参数, 如RELU, 混合, 有的层没有超参数. 超参数是用来自己设置的, 比如高斯核函数的超参数, 训练速率. 卷积层相当于学习一个滤波器, 当遇到图像的某个特征的时候将会激活, 比如某个方向上的角. 然后一个卷积层一般有很多卷积核, 就可以学习很多个图像的特征. 感受野的大小就是超参数, 感受野是局部的, 但是对于深度是全部的.


  1. 输出三维结构的深度, 相当于我们想要学习的滤波器的个数, 每个滤波器想要从图像中学习不同的特征.
  2. 步长
  3. 0-padding, 在图像的四周用0-padding.



其中W是原来的维度, F是滤波器的维度, S是步长, P是padding的个数.



叠加性的卷积层可以构造更加复杂的特征, 然后由pool进行破坏(近似).


INPUT -> [[CONV -> RELU]N -> POOL?]M -> [FC -> RELU]*K -> FC

一般的滤波器都用比较小的, 然后多层堆叠起来, 如不是直接用比较大的. 用后者的坏处是

  1. 后者直接计算局部的线性映射, 而前者可以学习到非线性的特征.
  2. 后者的参数更多.

总的来说就是更强的表达能力和更少的参数, 当然代价是内存变大. 反向传播的时候需要更多的中间项, 这里需要注意的是虽然卷积层相比全连接层参数变少了, 但是占用的内存显著增多了, 因为连接单元增多了很多.(现在有一些进展不用线性的层来连接, 而是一些更加复杂的连接结构)

在现实中, 最好用IMAGENET中表现不错的网络, 这些网络的性能都非常不错, 直接下下来训练好的网络然后训练自己的数据就可以了, 而不是自己设计和训练一个网络.



  1. 输出层最好能除以2很多倍. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512.
  2. 卷积层应该不改变输入的维数, 用多层小的卷积层叠加而不是大的单个卷积层. 一般大的卷积层只在第一层卷积层使用.
  3. 混合层最常用2X2的最大混合. 可以丢弃75%的激活层. 大于3X3的效果都很差.
  4. 步长最好用1, 使得卷积层只改变数据的深度, 用混合层进行采样降维.
  5. 要用zero-padding, 一是保持维度, 二是避免丢去边缘信息.

注意计算内存占用可以用计算矩阵的大小, 乘以4(一般浮点数是四个字节, double是8个字节.), GPU的内存一般很小, 最多12GB. 这时可以用大的滤波器, 减低维度以减少内存占用, 一般在第一层使用.


  1. LeNet(第一个)
  2. AlexNet(ILSVRC 2012 winner), 使用很深的堆叠卷积网络, 在这之前一般卷积层都马上接一个混合层.
  3. ZF Net(ILSVRC 2013 winner), 对AlexNet的超参数的修改, 扩大了中间卷积层的大小, 并且让第一层的步长和卷积核大小变小了.
  4. GoogLeNet(ILSVRC 2014 winner), 引入感知单元, 显著性减低了网络的参数数目, (4M, 而AlexNet有60M), 最后使用平均值混合而不是全连接层, 减少了很多参数.(Inception-4)
  5. VGGNet(很深的网络, 很小的卷积核, ILSVRC 2014 runner-up), 缺点是占用很多内存.
  6. ResNet(目前最好用的, ILSVRC 2015 winner), 跳跃连接, batch正则化, 末端无全连接层.

有一些idea是不用混合层, 为了减少参数, 可以用步长更大的卷积. 同时, 丢掉混合层对于生成模型比如VAEs或者GANs非常重要, 似乎可以预见的是未来会用到比较少的混合层. 混合一般用F=2, S=2, 或者F=3, S=2, 其他的不常用, 一般用max, 其他的有average和L2, 不常用.

另外存在正则化层(normalization Layer), 一般不太好用, 效果差.


INPUT: [224x224x3]        memory:  224*224*3=150K   weights: 0
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64]  memory:  112*112*64=800K   weights: 0
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128]  memory:  56*56*128=400K   weights: 0
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256]  memory:  28*28*256=200K   weights: 0
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512]  memory:  14*14*512=100K   weights: 0
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512]  memory:  7*7*512=25K  weights: 0
FC: [1x1x4096]  memory:  4096  weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096]  memory:  4096  weights: 4096*4096 = 16,777,216
FC: [1x1x1000]  memory:  1000 weights: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

卷积网路最需要注意的是内存, The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of.由于还需要保存各种各样的参数, 数据, 梯度和一些优化的信息, 一般估计内存占用是网络大小的3倍以上. 如果内存不够, 最常见的是减少batch size.