

0x01 深度学习基础框架


(1) 人脸检测

在人脸检测方面常用的用两个,一个是 Tinyface 能检测到比较小的人脸。可以先玩通demo
文章:Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networksb((Code)) 很多最近的创业公司用这个,但是如果场景特殊需要 re-train 一下

(2) 人脸跟踪

因为这个已经不是学术前沿问题,所以 CVPR、ICCV 上没有文章研究这一块
人脸跟踪这一块 openCV 有一个比较好的教程 http://opencv-java-tutorials.readthedocs.io/en/latest/06-face-detection-and-tracking.html
但是,工程上通用的做法还是,逐帧用最好的 face detector(比如 tiny face)检测后,用 optical flow 串起来。因为现在 face detection 已经很快了,没必要用 tracking 来加速,能做到很快。具体怎么弄我们可以当面讨论。
我的学生找了一下开源库(仅作参考,不建议用),但是不是正规的文章 https://github.com/kylemcdonald/ofxFaceTracker

(3) 人脸识别

主要人脸识别的文章都在 LFW 上, http://vis-www.cs.umass.edu/lfw/results.html 我们主要可以看 Table 6 Mean classification accuracy û and standard error of the mean SE。
不过这些靠前的方法很多没有代码。目前创业公司普遍用[1]。这个我指导过别人使用,[2]的口碑也不错但不是最新的。以建议你们用[1]加大数据 train 就行了。以后,有更好开源代码出来,我再更新。

[1] A Discriminative Feature Learning Approach for Deep Face Recognition[C] Yandong Wen, Kaipeng Zhang, Zhifeng Li, Yu Qia. ECCV 2016. ((Code))
[2] OpenFace: A general-purpose face recognition library with mobile applications Amos, Brandon and Bartosz Ludwiczuk and Satyanarayanan, Mahadev, CMU-CS-16-118, CMU School of Computer Science,2016 ((Code))

(4) 人脸三维建模


文章:Real-time Facial Animation on Mobile Devices (Code)

文章:3D Shape Regression for Real-time Facial Animation (Code)

文章:Face Alignment Across Large Poses: A 3D Solution (Code)
注:这问题属于坑比较多的,大家也不放代码用来赚钱,目前,是浙大的 Zhou Kun 有成熟的技术,需要购买的话,我可以去联系。


(1) 物体探测和物体定位

物体探测也称物体检测(object detection)包括物体定位(object localization)和物体识别(object recognition)两部分。一般讲是先定位物体在哪里,然后识别是什么(猫,狗,车)。但是自从 faster RCNN 后,物体定位和物体识别就同时一起做了。目前主要的开源代码是 SSD,faster RCNN, Yolo。各有优劣,SSD 和 faster RCNN 是 recall 比较高,Yolo 是 precision 比较高。 综合上来看,如果一定要选一个的话,我推荐 Yolo。注意,目前的 object detection 是用 mAP 来衡量,但 mAP 差个几个点范围内很难说明实际效果的好坏。我们组写了一个 object detection 的详细结束文件(和这份文件一起交付)。

另外还有一篇是刚刚出来的 Mask RCNN,性能比现在的物体检测器都好。 文章在这里 https://arxiv.org/abs/1703.06870

具体中文讨论在这里 https://www.zhihu.com/question/57403701 但是,没有代码,我们组也在复现中

(2) 视频中的物体探测

学术上这个被称为多物体跟踪(mutli-object tracking),他的基本原理是跟踪和物体检测联合训练。 主要的算法可以在这里两个网站上查到:
这个是专门做多物体跟踪的: https://motchallenge.net/
另一个网站是: http://www.cvlibs.net/datasets/kitti/eval_tracking.php

如果代码的话推荐这篇,是我朋友的工作, http://yuxng.github.io/
Learning to Track: Online Multi-Object Tracking by Decision Making.
Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection.(这篇没有 model,需要的话,我可以发给你们做研究用,但是如果要商用请和作者联系。)

另外 multi-object tracking 都比较慢,如果要快点的话,可以使用 yolo 每帧做处理,然后简单复现这篇文章,就行了。
Seq-NMS for Video Object Detection https://arxiv.org/abs/1602.08465

(3) 场景分类、场景分析

场景分类本质上一个图像分类问题,你们只要对图像的场景打上多类标签,进入分类器训练就行了,目前最好的分类器,在 ResNet 之后有两个更好 ResNext,kaiming he 之后的作品。

Aggregated Residual Transformations for Deep Neural Networks, (Code)
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (Code)

主流的场景分类数据库是 MIT scene dataset你们可以好好用来做 pre-train
这里有一个 demo,比较 cool,你们可以直接用 http://places.csail.mit.edu/demo.html


这个在学术上叫 image-captioning 对于这个问题 coco 有一个排名,你们可以在这个排名上找到比较靠前的文章和代码,但是衡量好坏的 metric 比较受诟病。也就是分数高的不一定效果好,所以这个还是你要自己感受一下。 http://mscoco.org/dataset/#captions-leaderboard


• 论文:Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and Tell: A Neural Image Caption Generator, arXiv:1411.4555. (Code)
• 论文:Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Description, CVPR, 2015. (Code)
• 论文:Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, arXiv:1502.03044 / ICML 2015 (Code)
• 论文:Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel, Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, arXiv:1411.2539. (Code)
• Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell, Long-term Recurrent Convolutional Networks for Visual Recognition and Description, arXiv:1411.4389(Code)
• 论文:Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C. Lawrence Zitnick, Exploring Nearest Neighbor Approaches for Image Captioning, arXiv:1505.04467 代码:虽然没代码但是实现起来比较简单



• Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko, Sequence to Sequence–Video to Text, arXiv:1505.00487. (Code)
• Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Translating Videos to Natural Language Using Deep Recurrent Neural Networks, arXiv:1412.4729. (Code)
• Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, Describing Videos by Exploiting Temporal Structure, arXiv:1502.08029 (Code)


人脸数据库 LFW,http://vis-www.cs.umass.edu/lfw/ 6千对人脸图片,用于验证(判断是否为某个人)
FDDBhttp://vis-www.cs.umass.edu/fddb/ 2800 张用于检测,测试
CELEBA http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html 20 万多张,用于配准,检测,属性分析,
AFLW https://lrs.icg.tugraz.at/research/aflw/ 2 万 3 千多图片,有检测,经常用于训练。
megaface http://megaface.cs.washington.edu/ 100 万张图片,人脸识别和验证都有(现阶段比较热门的数据集)
中科院数据库 http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html 50 万人脸验证数据,1 万多个人
物体检测数据库 如果你们做小规模验证可以使用 VOC dataset http://host.robots.ox.ac.uk/pascal/VOC/ 大概是万级别的数量
• 如果大规模话有两个数据集 Imagenet http://image-net.org/COCO http://mscoco.org/ 这两个都百万级别的
• 调参技巧这个很难说,要看具体情况,很难说给出一个普遍的定论。主要是 learning rate 吧,开始的时候比较大,后面比较稳定的时候慢慢减小。

6 自然场景中的文字检测的算法

这个问题的话,大量的算法和论文库在这里: https://github.com/chongyangtao/Awesome-Scene-Text-Recognition
把这些文章和代码看完就差不多了。这一块做得最好的华中科技大学的 xiang bai。需要购买技术的话,我可以联系。


Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition.

7 自然场景中人物(且有遮挡,如人脸被遮挡)较多的情况下实时检测及再识别。

这个学术界有一个专门的 topic,叫做 Person Re-identification。这一块是中山大学的 weishi zhen 做得最好,需要购买技术,我可以办你们联系。我做了一个调研如下(推荐第一个),

[1] Xiao T, Li H, Ouyang W, et al. Learning deep feature representations with domain guided dropout for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1249-1258., (Code)

[2] Yang Yang, LongyinWen, Siwei Lyu, Stan Z. Li,Unsupervised Learning of Multi-Level Descriptors for Person Re-Identification, Association for the Advancement of Artificial Intelligence (AAAI), San Francisco, California, USA, 2017
(Code): need email to author

[3] Matsukawa T, Okabe T, Suzuki E, et al. Hierarchical gaussian descriptor for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1363-1372., (Code)

[4] You J, Wu A, Li X, et al. Top-push video-based person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1345-1353. , (Code)

[5] Zheng Z, Zheng L, Yang Y. A Discriminatively Learned CNN Embedding for Person Re-identification[J]. arXiv preprint arXiv:1611.05666, 2016., (Code)

[6] Ahmed E, Jones M, Marks T K. An improved deep learning architecture for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3908-3916., (Code)

[7] Liao S, Hu Y, Zhu X, et al. Person re-identification by local maximal occurrence representation and metric learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2197-2206., (Code)

[8] Zheng L, Wang S, Tian L, et al. Query-adaptive late fusion for image search and person re-identification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1741-1750. ((Code) google drive, (Code) baidu pan)

[9] Yang Y, Yang J, Yan J, et al. Salient color names for person re-identification[C]//European Conference on Computer Vision. Springer International Publishing, 2014: 536-551., (Code): need email to author

[10] Bazzani L, Cristani M, Murino V. Symmetry-driven accumulation of local features for human characterization and re-identification[J]. Computer Vision and Image Understanding, 2013, 117(2): 130-144., (Code)

[11] Xiong F, Gou M, Camps O, et al. Person re-identification using kernel-based metric learning methods[C]//European conference on computer vision. Springer International Publishing, 2014: 1-16., (Code)

[12] Zhao R, Ouyang W, Wang X. Unsupervised salience learning for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013: 3586-3593., (Code)

[13] Farenzena M, Bazzani L, Perina A, et al. Person re-identification by symmetry-driven accumulation of local features[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 2360-2367. , (Code)

0x02 相关问题

1、低分辨率下的视频人脸识别(32*32 以下)
答: 这种一般没办法弄,你们可以尝试的做去 block,或者去噪预处理,会有一点点提升。

答: 如果遮挡不大的话,深度学习对这方面是有一定鲁棒性的

答:如果,你们人脸区间固定(比如人脸登录),可以先提取深度学习特征,然后做 add-boosting.做检测框。 Fast-RCNN 差不多能用。 To Boost or Not to Boost? On the Limits of Boosted Trees for Object Detection

答:这个使用办法解决的,一般用深度学习产生去眼镜照片,再识别。 Robust Deep Auto-encoder for Occluded Face Recognition

0x03 学术前沿



(1) 生成对抗模型:

• Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks, NIPS, 2014. (最早的一篇)
• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets,arXiv:1411.1784 [cs.LG] (比较出名的一篇)


• Jost Tobias Springenberg, “Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks”, ICLR 2016
• Harrison Edwards, Amos Storkey, “Censoring Representations with an Adversary”, ICLR 2016,
• Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A. Efros, “Generative Visual Manipulation on the Natural Image Manifold”, ECCV 2016.
• Mixing Convolutional and Adversarial Networks ◦Alec Radford, Luke Metz, Soumith Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016.

(2) 深度增强学习

Reinforcement Learning Course by David Silver
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
• Highly related with Silver’s course, you can read/skip the corresponding chapters while taking the courses

一些开始的论文 f Deep Reinforcement Learning (DQN)
• Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013, December 20). Playing Atari with Deep Reinforcement Learning. arXiv.org. SJTU Machine Vision and Intelligence Group page 7
• Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

• Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling Network Architectures for Deep Reinforcement Learning. CoRR.
• van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. AAAI.
• Hausknecht, M. J., & Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI.
• Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., & Munos, R. (2015). Increasing the Action Gap - New Operators for Reinforcement Learning. CoRR, cs.AI.
• Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016, February 15). Deep Exploration via Bootstrapped DQN. arXiv.org.
• Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized Experience Replay. CoRR.
• Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., et al. (2016, February 5). Asynchronous Methods for Deep Reinforcement Learning. arXiv.org.

• Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
• Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in Neural Information Processing Systems (pp. 2204-2212). SJTU Machine Vision and Intelligence Group page 8
• Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2016, September 17). Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning. arXiv.org.

(3) 人体姿态估计

• Haoshu Fang, Shuqin Xie, Cewu Lu, RMPE: Regional Multi-person Pose Estimation, arXiv:1612.00137 [cs.CV]
• Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, Realtime Multi-person 2D Pose Estimation using Part Affinity Fields, CVPR 2017

(4) 视觉问答


• Xiong, Caiming, Stephen Merity, and Richard Socher. “Dynamic Memory Networks for Visual and Textual Question Answering.” arXiv:1603.01417 (2016).
• Mateusz Malinowski, Marcus Rohrbach, Mario Fritz, Ask Your Neurons: A Neural-based Approach to Answering Questions about Images, arXiv:1505.01121.
• Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, CVPR, 2015 SUNw:Scene Understanding workshop.
• Hauyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu, Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, arXiv:1505.05612.
• Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han, Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction, arXiv:1511.05765
• Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). Stacked Attention Networks for Image Question Answering. arXiv:1511.02274.
• Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Multimodal Residual Learning for Visual QA, arXiv:1606:01455
• Hyeonwoo Noh and Bohyung Han, Training Recurrent Answering Units with Joint Loss Minimization for VQA, arXiv:1606.03647
• Jin-Hwa Kim, Kyoung Woon On, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang, Hadamard Product for Low-rank Bilinear Pooling, arXiv:1610.04325.

(5) Mask-R-CNN

另外还有一篇是刚刚出来的 Mask RCNN,也是比较火。

文章在这里 https://arxiv.org/abs/1703.06870
具体中文讨论在这里 https://www.zhihu.com/question/57403701