• [✔️] R-CNN (Slow)
• [✔️] SPP(空间金字塔)-Net (Feature Map, SPP Layer)
• [✔️] Fast R-CNN (Single SPP, End-to-End)
• [✔️] Faster R-CNN (Region proposal Layer, anchor)
• [✔️] YOLO (You Only Look Once)
• [➖] SSD
• [✔️] Selective Search (Raw)
• [✔️] Non-maximum suppression (Suppress overlap)

1. 目标检测(不仅分类, 而用方框标注出感兴趣物体的位置)
2. 目标分割(不仅是目标检测, 还要输出物体的轮廓)
3. 图像分类
4. 图像标注(captioning, 不知道怎么翻译..)
5. 图像深度估计

### obeject Detection

1. R-CNN
2. SSP-net
3. fast R-CNN
4. faster R-CNN
6. YOLO(You Only Look Once)
7. SSD(Single Shot Detection)

### 2014: R-CNN

R-CNN 的目标是：导入一张图片，通过方框正确识别主要物体在图像的哪个地方。

1. 选择2000张不同最有可能出现感兴趣物体的区域.
2. 将每个区域封装成可以提供给CNN的图像的尺寸.
3. 将CNN提取的特征(4096 embending)提供给SVM进行分类
4. 用线性回归模型找到更紧密的边框.
• Fine-tune network with softmax classifier (log loss)
• Train postHhoc linear SVMs (hinge loss)
• Train postHhoc boundingHbox regressors (squared loss)

#### 训练

1. SVM使用的是one vs all分类器, 即训练N+1个线性分类器. 每一类一个, 再加一个背景.
2. 训练的时候并非把2000张图片都导入进行训练, 而是选择positive ROIs, 即与ground truth区域有足够大的重叠, 先进行训练后. 在把负样本导入进行重新一轮训练.
3. 线性分类器的参数就是4096的向量加一个bias项,

#### 测试与可视化

1. 生成很多ROIs后进行分类, 对于分类置信度很低(一般用0.5)的与分类为背景的直接去掉, 注意这里需要使用非最大抑制进行单一检测(避免重复检测).

1. 一般用来算法衡量的指标是mAP(mean average Precision, 即对不同类画出PR(precision-recall)图, 计算出面积的平均值, 具体的见论文)

PR图如下

Results using 200 ROIs (this number is too low to get good accuracy but for demo purposes allows for fast training and scoring):

|—|—|—|—|—|—|—
|Training set|0.91 |0.76 |0.46 |0.81 |…|0.62
|Test Set| 0.64 |1.00 |0.64 |1.00 | |0.62

Results using 2000 ROIs:

|—|—|—|—|—|—|—
|Training set|1.00 |0.76 |1.00 |1.00 |…|0.89
|Test Set| 1.00 |0.55 |0.64 |1.00 | |0.88

#### 选择搜索实现

Goals:

1. Detect objects at any scale.
- Hierarchical algorithms are good at this.
2. Consider multiple grouping criteria.
- Detect differences in color, texture, brightness, etc.
3. Be fast.

Step 1: Generate initial sub-segmentation
Goal: Generate many regions, each of which belongs to at most one object.

Step 2: Recursively combine similar regions into larger ones.
Greedy algorithm:

1. From set of regions, choose two that are most similar.
2. Combine them into a single, larger region.
3. Repeat until only one region remains.
This yields a hierarchy of successively larger regions, just like we want.

Step 3: Use the generated regions to produce candidate object locations.

Goals:

1. Use multiple grouping criteria.
2. Lead to a balanced hierarchy of small to large objects.
3. Be efficient to compute: should be able to quickly combine measurements in two regions.

Two-pronged approach:

1. Choose a color space that captures interesting things.
- Different color spaces have different invariants, and different
responses to changes in color.
2. Choose a similarity metric for that space that captures everything we’re interested: color, texture, size, and shape.

- RGB, 缺点是亮度的改变会影响三个通道.

• HSV(hue, saturation, value), This color space describes colors (hue or tint) in terms of their shade (saturation or amount of gray) and their brightness value.

• Lab uses a lightness channel and two color channels (a and b). It’s calibrated to be perceptually uniform. Like HSV, it’s also somewhat invariant to changes in brightness and shadow.

• 颜色相似度

对每一个通道的颜色, 创建25 bins的直方图, 这样对一张图片可以得到75维的颜色特征, 接着用一下公式计算颜色相似度.

$$s_{colour}(r_i,r_j)=\sum^n_{k=1}min(c_i^k,c_j^k)$$

• 纹理相似度

可以利用类HOG特征来对纹理相似度进行度量. 算法如下
1. 在每一个通道下, 计算图片在8个方向的高斯导数.
2. 对每一个方向导数构造10bin的直方图, 得到一共240维的特征.
• 大小相似度

$$s_{size}(r_i,r_j)=1-\frac{size(r_i)+size(r_j)}{size(im)}$$

• 形状匹配程度

$$fill(r_i,r_j) = 1-\frac{size(BB_{ij})-size(r_i)-size(r_j)}{size(im)}$$

$$s_(r_i,r_j) = a_1s_{color}(r_i,r_j)+a_2s_{texture}(r_i,r_j)+a_3s_{size}+a_4s_{fill}(r_i,r_j)$$

$$ABO=\frac{1}{|G^c|}\sum_{g_i^c\in G^c}max_{l_j\in L}Overlap(g_i^c,l_j)$$

#### 总结

R-CNN虽然效果不错, 但缺点是计算速度非常慢, 不仅仅是训练速度慢, 训练好后对一个图像进行检测可能需要几分钟, 这很难说有用, 因此后有了很多改进.

### 2014: SPP-Net

SPP网路与R-CNN不同主要在于其在Selective search中先把原来的图像利用卷积映射到一个特征图上(利用Conv5的特征), 然后在利用特征图将候选区域直接映射到特征上, 避免了重复计算. 由于不同候选区域的图像大小不一样, 卷积后的结果不一致, 需要利用Spatial Pyramid Pooling (SPP)layer, 映射到相同大小的特征上, 在连接一个全连接层, 后面仍然在利用SVM和一个线性回归器进行训练, 这样同样要同时训练3个模型, 整个流程如下.

SPP Net的主要改进是使得检测速度得到了很大的提升, 即 makes testing fast, 但同时使得提取特征的卷积网络不能得到训练.

### 2015: Fast R-CNN

1. RoI (Region of Interest) Pooling

ROI pooling layer实际上是SPP-NET的一个精简版，SPP-NET对每个proposal使用了不同大小的金字塔映射，而ROI pooling layer只需要下采样到一个7x7的特征图。对于VGG16网络conv5_3有512个特征图，这样所有region proposal对应了一个77512维度的特征向量作为全连接层的输入。

1. 把不同模型整合为一个网络

Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe).

### 2016：Faster R-CNN

RPN的核心思想是使用卷积神经网络直接产生region proposal，使用的方法本质上就是滑动窗口。RPN的设计比较巧妙，RPN只需在最后的卷积层上滑动一遍，因为anchor机制和边框回归可以得到多尺度多长宽比的region proposal。

Faster R-CNN works to combat the somewhat complex training pipeline that both R-CNN and Fast R-CNN exhibited. The authors insert a region proposal network (RPN) after the last convolutional layer. This network is able to just look at the last convolutional feature map and produce region proposals from that. From that stage, the same pipeline as R-CNN is used (ROI pooling, FC, and then classification and regression heads).

### 2016: YOLO

YOLO是一个可以一次性预测多个Box位置和类别的卷积神经网络，能够实现端到端的目标检测和识别，其最大的优势就是速度快。事实上，目标检测的本质就是回归，因此一个实现回归功能的CNN并不需要复杂的设计过程。YOLO没有选择滑窗或提取proposal的方式训练网络，而是直接选用整图训练模型。这样做的好处在于可以更好的区分目标和背景区域，相比之下，采用proposal训练方式的Fast-R-CNN常常把背景区域误检为特定目标。当然,YOLO在提升检测速度的同时牺牲了一些精度。下图所示是YOLO检测系统流程：

1. 给个一个输入图像，首先将图像划分成7*7的网格
2. 对于每个网格，我们都预测2个边框（包括每个边框是目标的置信度以及每个边框区域在多个类别上的概率）
3. 根据上一步可以预测出772个目标窗口，然后根据阈值去除可能性比较低的目标窗口，最后NMS去除冗余窗口即可。可以看到整个过程非常简单，不需要中间的region proposal在找目标，直接回归便完成了位置和类别的判定。

### Generating Image Descriptions

“Using this training data, a deep neural network “infers the latent alignment between segments of the sentences and the region that they describe” (quote from the paper)”

### 资料

conference

• CVPR - Computer Vision and Pattern Recognition
• ICCV - International Conference on Computer Vision
• ECCV - European Conference on Computer Vision
• BMVC - British Machine Vision Conference
• ICIP - IEEE International Conference on Image Processing

Textbooks

• Computer Vision: A Modern Approach (2nd Edition) by David A. Forsyth and Jean Ponce
• Computer Vision by Linda G. Shapiro and George C. Stockman
• Computer Vision: Algorithms and Applications by Richard Szeliski
• Algorithms for Image Processing and Computer Vision by J. R. Parker
• Computer Vision: Models, Learning, and Inference by Dr Simon J. D. Prince
• Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities by E. R. Davies

Beginner Books

• Programming Computer Vision with Python: Tools and algorithms for analyzing images by Jan Erik Solem
• Practical Computer Vision with SimpleCV : The Simple Way to Make Technology See by Kurt Demaagd, Anthony Oliver, Nathan Oostendorp, and Katherine Scott
• OpenCV Computer Vision with Python by Joseph Howse
• Learning OpenCV: Computer Vision with the OpenCV Library by Gary Bradski and Adrian Kaehler
• OpenCV 2 Computer Vision Application Programming Cookbook by Robert Laganière
• Mastering OpenCV with Practical Computer Vision Projects by Daniel Lélis Baggio, Shervin Emami,

David Millán Escrivá, Khvedchenia Ievgen, Jasonl Saragih, and Roy Shilkrot
• SciPy and NumPy: An Overview for Developers by Eli Bressert

Python Libraries

When I ﬁrst became interested in computer vision and image search engines over eight
years ago, I had no idea where to start. I didn’t know which language to use, I didn’t
know which libraries to install, and the libraries I found I didn’t know how to use. I WISH
there had been a list like this one, detailing the best Python libraries to use for image
processing, computer vision, and image search engines.
This list is by no means complete or exhaustive. It’s just my favorite Python libraries that
I use each and everyday for computer vision and image search engines. If you think that
I’ve left an important one out, please leave me an email at adrian@pyimagesearch.com.

NumPy
NumPy is a library for the Python programming language that (among other things)
provides support for large, multi-dimensional arrays. Why is that important? Using
NumPy, we can express images as multi-dimensional arrays. Representing images as
NumPy arrays is not only computational and resource efﬁcient, but many other image
processing and machine learning libraries use NumPy array representations as well.
Furthermore, by using NumPy’s built-in high-level mathematical functions, we can
quickly perform numerical analysis on an image.

SciPy
Going hand-in-hand with NumPy, we also have SciPy. SciPy adds further support for
scientiﬁc and technical computing. One of my favorite sub-packages of SciPy is the
spatial package which includes a vast amount of distance functions and a kd-tree
implementation. Why are distance functions important? When we “describe” an image,
we perform feature extraction. Normally after feature extraction an image is represented
by a vector (a list) of numbers. In order to compare two images, we rely on distance
functions, such as the Euclidean distance. To compare two arbitrary feature vectors, we
simply compute the distance between their feature vectors. In the case of the Euclidean
distance, the smaller the distance the more “similar” the two images are.

matplotlib
Simply put, matplotlib is a plotting library. If you’ve ever used MATLAB before, you’ll
probably feel very comfortable in the matplotlib environment. When analyzing images,
we’ll make use of matplotlib, whether plotting the overall accuracy of search systems or
simply viewing the image itself, matplotlib is a great tool to have in your toolbox.

PIL and Pillow
These two packages are good and what they do: simple image manipulations, such as
resizing, rotation, etc. If you need to do some quick and dirty image manipulations
deﬁnitely check out PIL and Pillow, but if you’re serious about learning about image
processing, computer vision, and image search engines, I would highly recommend that

OpenCV
If NumPy’s main goal is large, efﬁcient, multi-dimensional array representations, then,
by far, the main goal of OpenCV is real-time image processing. This library has been
around since 1999, but it wasn’t until the 2.0 release in 2009 did we see the incredible
NumPy support. The library itself is written in C/C++, but Python bindings are provided
when running the installer. OpenCV is hands down my favorite computer vision library,
but it does have a learning curve. Be prepared to spend a fair amount of time learning
the intricacies of the library and browsing the docs (which have gotten substantially
better now that NumPy support has been added). If you are still testing the computer
vision waters, you might want to check out the SimpleCV library mentioned below,
which has a substantially smaller learning curve.

SimpleCV
The goal of SimpleCV is to get you involved in image processing and computer vision
as soon as possible. And they do a great job at it. The learning curve is substantially
smaller than that of OpenCV, and as their tagline says, “it’s computer vision made
easy”. That all said, because the learning curve is smaller, you don’t have access to as
many of the raw, powerful techniques supplied by OpenCV. If you’re just testing the
waters, deﬁnitely try this library out.

mahotas
Mahotas, just as OpenCV and SimpleCV, rely on NumPy arrays. Much of the
functionality implemented in Mahotas can be found in OpenCV and/or SimpleCV, but in
some cases, the Mahotas interface is just easier to use, especially when it comes to
their features package.

scikit-learn
Alright, you got me, Scikit-learn isn’t an image processing or computer vision library —
it’s a machine learning library. That said, you can’t have advanced computer vision
techniques without some sort of machine learning, whether it be clustering, vector
quantization, classiﬁcation models, etc. Scikit-learn also includes a handful of image
feature extraction functions as well.

scikit-image
Scikit-image is fantastic, but you have to know what you are doing to effectively use this
library – and I don’t mean this in a “there is a steep learning curve” type of way. The
learning curve is actually quite low, especially if you check out their gallery. The
algorithms included in scikit-image (I would argue) follow closer to the state-of-the-art in
computer vision. New algorithms right from academic papers can be found in scikit-
image, but in order to (effectively) use these algorithms, you need to have developed
some rigor and understanding in the computer vision ﬁeld. If you already have some
experience in computer vision and image processing, deﬁnitely check out scikit-image;
otherwise, I would continue working with OpenCV and SimpleCV to start.

ilastik
I’ll be honest. I’ve never used ilastik. But through my experiences at computer vision
conferences, I’ve met a fair amount of people who do, so I felt compelled to put it in this
list. Ilastik is mainly for image segmentation and classiﬁcation and is especially geared
towards the scientiﬁc community.
pprocess
Extracting features from images is inherently a parallelizable task. You can reduce the
amount of time it takes to extract features from an entire dataset by using a
multithreading/multitasking library. My favorite is pprocess, due to the simple nature I
need it for, but you can use your favorite.
h5py
The h5py library is the de-facto standard in Python to store large numerical datasets.
The best part? It provides support for NumPy arrays. So, if you have a large dataset
represented as a NumPy array, and it won’t ﬁt into memory, or if you want efﬁcient,
persistent storage of NumPy arrays, then h5py is the way to go. One of my favorite
techniques is to store my extracted features in a h5py dataset and then apply scikit-
learn’s MiniBatchKMeans to cluster the features. The entire dataset never has to be
entirely loaded off disk at once and the memory footprint is extremely small, even for
thousands of feature vectors.