计算机视觉技术调查报告

发表于 2017-07-14 分类于算法与硬件阅读次数：

本资源来自网络，侵权请按ALT+F4

0x01 深度学习基础框架

1.人脸检测、跟踪、识别、三维建模的开源框架、算法、论文

(1) 人脸检测

在人脸检测方面常用的用两个，一个是 Tinyface 能检测到比较小的人脸。可以先玩通demo
另外一篇更为常用，如果你们对固定场景，比如视频对话，检测效果不错。而且他们能标注人脸关键点。
文章：Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networksb((Code)) 很多最近的创业公司用这个，但是如果场景特殊需要 re-train 一下

(2) 人脸跟踪

因为这个已经不是学术前沿问题，所以 CVPR、ICCV 上没有文章研究这一块
人脸跟踪这一块 openCV 有一个比较好的教程 http://opencv-java-tutorials.readthedocs.io/en/latest/06-face-detection-and-tracking.html
但是，工程上通用的做法还是，逐帧用最好的 face detector(比如 tiny face)检测后，用 optical flow 串起来。因为现在 face detection 已经很快了，没必要用 tracking 来加速，能做到很快。具体怎么弄我们可以当面讨论。
我的学生找了一下开源库（仅作参考，不建议用），但是不是正规的文章 https://github.com/kylemcdonald/ofxFaceTracker

(3) 人脸识别

主要人脸识别的文章都在 LFW 上， http://vis-www.cs.umass.edu/lfw/results.html 我们主要可以看 Table 6 Mean classification accuracy û and standard error of the mean SE。
不过这些靠前的方法很多没有代码。目前创业公司普遍用[1]。这个我指导过别人使用，[2]的口碑也不错但不是最新的。以建议你们用[1]加大数据 train 就行了。以后，有更好开源代码出来，我再更新。

[1] A Discriminative Feature Learning Approach for Deep Face Recognition[C] Yandong Wen, Kaipeng Zhang, Zhifeng Li, Yu Qia. ECCV 2016. ((Code))
[2] OpenFace: A general-purpose face recognition library with mobile applications Amos, Brandon and Bartosz Ludwiczuk and Satyanarayanan, Mahadev， CMU-CS-16-118, CMU School of Computer Science，2016 ((Code))

(4) 人脸三维建模

这一块属于比较前沿的方面，所以基本没公开数据库，也没公开代码

业内公认比较好的是：
文章：Real-time Facial Animation on Mobile Devices (Code)

下面这两篇是基于关键点的
文章：3D Shape Regression for Real-time Facial Animation (Code)

文章：Face Alignment Across Large Poses: A 3D Solution (Code)
注：这问题属于坑比较多的，大家也不放代码用来赚钱，目前，是浙大的 Zhou Kun 有成熟的技术，需要购买的话，我可以去联系。

2.物体探测、物体定位、视频中的物体探测、场景分类、场景分析的算法、论文

(1) 物体探测和物体定位

物体探测也称物体检测（object detection）包括物体定位（object localization）和物体识别（object recognition）两部分。一般讲是先定位物体在哪里，然后识别是什么（猫，狗，车）。但是自从 faster RCNN 后，物体定位和物体识别就同时一起做了。目前主要的开源代码是 SSD，faster RCNN, Yolo。各有优劣，SSD 和 faster RCNN 是 recall 比较高，Yolo 是 precision 比较高。综合上来看，如果一定要选一个的话，我推荐 Yolo。注意，目前的 object detection 是用 mAP 来衡量，但 mAP 差个几个点范围内很难说明实际效果的好坏。我们组写了一个 object detection 的详细结束文件（和这份文件一起交付）。

另外还有一篇是刚刚出来的 Mask RCNN，性能比现在的物体检测器都好。文章在这里 https://arxiv.org/abs/1703.06870

具体中文讨论在这里 https://www.zhihu.com/question/57403701 但是，没有代码，我们组也在复现中

(2) 视频中的物体探测

学术上这个被称为多物体跟踪（mutli-object tracking），他的基本原理是跟踪和物体检测联合训练。主要的算法可以在这里两个网站上查到：
这个是专门做多物体跟踪的： https://motchallenge.net/
里面有一些有文章
另一个网站是： http://www.cvlibs.net/datasets/kitti/eval_tracking.php
这个虽然是无人车的，但原理差不多，你们可以用他们的模型

如果代码的话推荐这篇，是我朋友的工作， http://yuxng.github.io/
这两篇文章都有代码，
Learning to Track: Online Multi-Object Tracking by Decision Making.
Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection.（这篇没有 model，需要的话，我可以发给你们做研究用，但是如果要商用请和作者联系。）

另外 multi-object tracking 都比较慢，如果要快点的话，可以使用 yolo 每帧做处理，然后简单复现这篇文章，就行了。
Seq-NMS for Video Object Detection https://arxiv.org/abs/1602.08465

(3) 场景分类、场景分析

场景分类本质上一个图像分类问题，你们只要对图像的场景打上多类标签，进入分类器训练就行了，目前最好的分类器，在 ResNet 之后有两个更好 ResNext，kaiming he 之后的作品。

• Aggregated Residual Transformations for Deep Neural Networks, (Code)
• Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (Code)

主流的场景分类数据库是 MIT scene dataset你们可以好好用来做 pre-train
这里有一个 demo，比较 cool，你们可以直接用 http://places.csail.mit.edu/demo.html

3.图片场景描述的算法、论文

这个在学术上叫 image-captioning 对于这个问题 coco 有一个排名，你们可以在这个排名上找到比较靠前的文章和代码，但是衡量好坏的 metric 比较受诟病。也就是分数高的不一定效果好，所以这个还是你要自己感受一下。 http://mscoco.org/dataset/#captions-leaderboard

我先简单列出几个有代表性的，且有代码的

• 论文：Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and Tell: A Neural Image Caption Generator, arXiv:1411.4555. (Code)
• 论文：Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Description, CVPR, 2015. (Code)
• 论文：Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, arXiv:1502.03044 / ICML 2015 (Code)
• 论文：Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel, Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, arXiv:1411.2539. (Code)
• Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell, Long-term Recurrent Convolutional Networks for Visual Recognition and Description, arXiv:1411.4389(Code)
• 论文：Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C. Lawrence Zitnick, Exploring Nearest Neighbor Approaches for Image Captioning, arXiv:1505.04467 代码：虽然没代码但是实现起来比较简单

4.视频场景描述的算法、论文

我推荐几篇有代码的文章

• Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko, Sequence to Sequence–Video to Text, arXiv:1505.00487. (Code)
• Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Translating Videos to Natural Language Using Deep Recurrent Neural Networks, arXiv:1412.4729. (Code)
• Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, Describing Videos by Exploiting Temporal Structure, arXiv:1502.08029 (Code)

5.人脸、物体、场景等算法的训练集大小、训练周期及能达到的效果，一些调参、训练技巧等

• 人脸数据库 LFW，http://vis-www.cs.umass.edu/lfw/ 6千对人脸图片，用于验证（判断是否为某个人）
• FDDB，http://vis-www.cs.umass.edu/fddb/ 2800 张用于检测，测试
• CELEBA http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html 20 万多张，用于配准，检测，属性分析，
• AFLW https://lrs.icg.tugraz.at/research/aflw/ 2 万 3 千多图片，有检测，经常用于训练。
• megaface http://megaface.cs.washington.edu/ 100 万张图片，人脸识别和验证都有（现阶段比较热门的数据集）
• 中科院数据库 http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html 50 万人脸验证数据，1 万多个人
• 物体检测数据库 如果你们做小规模验证可以使用 VOC dataset http://host.robots.ox.ac.uk/pascal/VOC/ 大概是万级别的数量
• 如果大规模话有两个数据集 Imagenet http://image-net.org/ 和 COCO http://mscoco.org/ 这两个都百万级别的
• 调参技巧这个很难说，要看具体情况，很难说给出一个普遍的定论。主要是 learning rate 吧，开始的时候比较大，后面比较稳定的时候慢慢减小。

6 自然场景中的文字检测的算法

这个问题的话，大量的算法和论文库在这里： https://github.com/chongyangtao/Awesome-Scene-Text-Recognition
把这些文章和代码看完就差不多了。这一块做得最好的华中科技大学的 xiang bai。需要购买技术的话，我可以联系。

如果代码推荐的话，可以试一下
https://github.com/baidu-research/warp-ctc
https://github.com/bgshih/crnn

可以是这样两篇文章看看：
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition.

7 自然场景中人物（且有遮挡，如人脸被遮挡）较多的情况下实时检测及再识别。

这个学术界有一个专门的 topic，叫做 Person Re-identification。这一块是中山大学的 weishi zhen 做得最好，需要购买技术，我可以办你们联系。我做了一个调研如下（推荐第一个），

[1] Xiao T, Li H, Ouyang W, et al. Learning deep feature representations with domain guided dropout for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1249-1258., (Code)

[2] Yang Yang, LongyinWen, Siwei Lyu, Stan Z. Li,Unsupervised Learning of Multi-Level Descriptors for Person Re-Identification, Association for the Advancement of Artificial Intelligence (AAAI), San Francisco, California, USA, 2017
(Code): need email to author

[3] Matsukawa T, Okabe T, Suzuki E, et al. Hierarchical gaussian descriptor for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1363-1372., (Code)

[4] You J, Wu A, Li X, et al. Top-push video-based person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1345-1353. , (Code)

[5] Zheng Z, Zheng L, Yang Y. A Discriminatively Learned CNN Embedding for Person Re-identification[J]. arXiv preprint arXiv:1611.05666, 2016., (Code)

[6] Ahmed E, Jones M, Marks T K. An improved deep learning architecture for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3908-3916., (Code)

[7] Liao S, Hu Y, Zhu X, et al. Person re-identification by local maximal occurrence representation and metric learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2197-2206., (Code)

[8] Zheng L, Wang S, Tian L, et al. Query-adaptive late fusion for image search and person re-identification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1741-1750. ((Code) google drive, (Code) baidu pan)

[9] Yang Y, Yang J, Yan J, et al. Salient color names for person re-identification[C]//European Conference on Computer Vision. Springer International Publishing, 2014: 536-551., (Code): need email to author

[10] Bazzani L, Cristani M, Murino V. Symmetry-driven accumulation of local features for human characterization and re-identification[J]. Computer Vision and Image Understanding, 2013, 117(2): 130-144., (Code)

[11] Xiong F, Gou M, Camps O, et al. Person re-identification using kernel-based metric learning methods[C]//European conference on computer vision. Springer International Publishing, 2014: 1-16., (Code)

[12] Zhao R, Ouyang W, Wang X. Unsupervised salience learning for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013: 3586-3593., (Code)

[13] Farenzena M, Bazzani L, Perina A, et al. Person re-identification by symmetry-driven accumulation of local features[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 2360-2367. , (Code)

0x02 相关问题

1、低分辨率下的视频人脸识别（32*32 以下）
答：这种一般没办法弄，你们可以尝试的做去 block，或者去噪预处理，会有一点点提升。

2、视屏中，带遮挡（遮挡会移动）的人脸识别
答：如果遮挡不大的话，深度学习对这方面是有一定鲁棒性的

3、现有的人脸检测准确率不高的情况下，是否有办法通过结合深度学习和传统人脸检测方法来达到快速的高效方法
答：如果，你们人脸区间固定（比如人脸登录），可以先提取深度学习特征，然后做 add-boosting.做检测框。 Fast-RCNN 差不多能用。 To Boost or Not to Boost? On the Limits of Boosted Trees for Object Detection

4、在训练数据不全面的情况（如不存在带眼镜的情况下），如果被识别人带了眼镜等饰物，如何不受干扰。（对于眼镜的情况，除了在数据增广中寻找眼睛加上遮挡，还有其他办法吗）
答：这个使用办法解决的，一般用深度学习产生去眼镜照片，再识别。 Robust Deep Auto-encoder for Occluded Face Recognition

0x03 学术前沿

1.当前图像、视频领域的学术前沿研究方向、算法框架及相关论文

目前，比较前沿的大概有几个方向：

(1) 生成对抗模型：

• Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks, NIPS, 2014. （最早的一篇）
• Mehdi Mirza, Simon Osindero， Conditional Generative Adversarial Nets，arXiv:1411.1784 [cs.LG] （比较出名的一篇）

这些都是近期作品

• Jost Tobias Springenberg, “Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks”, ICLR 2016
• Harrison Edwards, Amos Storkey, “Censoring Representations with an Adversary”, ICLR 2016,
• Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A. Efros, “Generative Visual Manipulation on the Natural Image Manifold”, ECCV 2016.
• Mixing Convolutional and Adversarial Networks ◦Alec Radford, Luke Metz, Soumith Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016.

(2) 深度增强学习

• Reinforcement Learning Course by David Silver
• Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
• Highly related with Silver’s course, you can read/skip the corresponding chapters while taking the courses

一些开始的论文 f Deep Reinforcement Learning (DQN)
• Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013, December 20). Playing Atari with Deep Reinforcement Learning. arXiv.org. SJTU Machine Vision and Intelligence Group page 7
• Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

一些最近的论文
• Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling Network Architectures for Deep Reinforcement Learning. CoRR.
• van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. AAAI.
• Hausknecht, M. J., & Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI.
• Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., & Munos, R. (2015). Increasing the Action Gap - New Operators for Reinforcement Learning. CoRR, cs.AI.
• Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016, February 15). Deep Exploration via Bootstrapped DQN. arXiv.org.
• Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized Experience Replay. CoRR.
• Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., et al. (2016, February 5). Asynchronous Methods for Deep Reinforcement Learning. arXiv.org.

一些应用有关的论文
• Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
• Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in Neural Information Processing Systems (pp. 2204-2212). SJTU Machine Vision and Intelligence Group page 8
• Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2016, September 17). Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning. arXiv.org.

(3) 人体姿态估计

代表文章有两篇：
• Haoshu Fang, Shuqin Xie, Cewu Lu， RMPE: Regional Multi-person Pose Estimation， arXiv:1612.00137 [cs.CV]
• Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh， Realtime Multi-person 2D Pose Estimation using Part Affinity Fields， CVPR 2017

(4) 视觉问答

这方面的一些文章

• Xiong, Caiming, Stephen Merity, and Richard Socher. “Dynamic Memory Networks for Visual and Textual Question Answering.” arXiv:1603.01417 (2016).
• Mateusz Malinowski, Marcus Rohrbach, Mario Fritz, Ask Your Neurons: A Neural-based Approach to Answering Questions about Images, arXiv:1505.01121.
• Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, CVPR, 2015 SUNw:Scene Understanding workshop.
• Hauyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu, Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, arXiv:1505.05612.
• Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han, Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction, arXiv:1511.05765
• Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). Stacked Attention Networks for Image Question Answering. arXiv:1511.02274.
• Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Multimodal Residual Learning for Visual QA, arXiv:1606:01455
• Hyeonwoo Noh and Bohyung Han, Training Recurrent Answering Units with Joint Loss Minimization for VQA, arXiv:1606.03647
• Jin-Hwa Kim, Kyoung Woon On, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Hadamard Product for Low-rank Bilinear Pooling, arXiv:1610.04325.

(5) Mask-R-CNN

另外还有一篇是刚刚出来的 Mask RCNN，也是比较火。

文章在这里 https://arxiv.org/abs/1703.06870
具体中文讨论在这里 https://www.zhihu.com/question/57403701
但是，没有代码，我们组也在复现中