Hierarchical Action Classification with Network Pruning

Mahdi Davoodikakhki

KangKang Yin

Research on human action classification has made significant progresses in the past few years. Most deep learning methods focus on improving performance by adding more network components. We propose, however, to better utilize auxiliary mechanisms, including hierarchical classification, network pruning, and skeleton-based preprocessing, to boost the model robustness and performance. We test the effectiveness of our method on four commonly used testing datasets: NTU RGB+D 60[1], NTU RGB+D 120[2], Northwestern-UCLA Multiview Action 3D[3], and UTD Multimodal Human Action Dataset[4]. Our experiments show that our method can achieve either comparable or better performance on all four datasets. In particular, our method sets up a new baseline for NTU 120, the largest dataset among the four. We also analyze our method with extensive comparisons and ablation studies

Method

Here, we briefly describe our method and encourage the readers to read our paper[5] for deeper explanations.

Network Model

We build our network on top of the Glimpse Clouds network[6] and take advantage of cropping the area around people in the video (video cropping), Hierarchical Classification, and Network Pruning to improve the accuracy. Hierarchical Classification could help with extracting more meaningful features in the initial stacks and Network Pruning could be beneficial in overcoming overparameterization.

Superclasses

To encourage our network in extracting features gradually, we make 4 levels each containing superclasses, which have the same number of classes in each level. We first train our model without any hierarchical classification and obtain the similarities among the classes. We then try to put the most similar classes in the same superclass to make the classification easier on levels and gradually make the task harder. To obtain the superclasses, we uniformly put the classes randomly in superclasses and then use a greedy algorithm to change the swap the classes that have the most effect in decreasing similarities between superclasses.

We continue the algorithm until no improvement could be achieved. We also repeat the algorithm 1000 times and use the best obtained result for the superclasses configuration in each level.

N-UCLA: all 10 action classes and the derived superclasses.

Results

Here, we compare our method with the recent publications. We have achieved state-of-the-art results in most comparisons. We have provided more analysis and ablation studies on the backpropagated gradients, superclasses size, and percentage of pruning in our paper and we again encourage the readers to look at our paper for more details.[5]

Comparison on NTU 60. – indicates no results available.
Method	Year	Pose Input	RGB Input	Cross-View	Cross-Subject
Glimpse Clouds[6]	2018		✔	93.2%	86.6%
FGCN[7]	2020	✔		96.25%	90.22%
MS-G3D Net[8]	2020	✔		96.2%	91.5%
PoseMap[9]	2018	✔	✔	95.26%	91.71%
MMTM[10]	2019	✔	✔	-	91.71%
Action Machine[11]	2019		✔	97.2%	94.3%
PGCN[12]	2019	✔	✔	-	96.4%
Ours	2020	✔	✔	98.79%	95.66%

Comparison on NTU 120. * indicates results obtained from author-released code.
Method	Year	Pose Input	RGB Input	Cross-View	Cross-Subject
Action Machine[11]	2019		✔	-	-
TSRJI[13]	2019	✔		62.8%	67.9%
PoseMap from Papers with Code[14]	2019	✔	✔	66.9%	64.6%
SkeleMotion[15]	2019	✔	✔	66.9%	67.7%
GVFE + AS-GCN with DH-TCN[16]	2018		✔	79.8%	78.3%
Glimpse Clouds[6]	2020	✔	✔	83.84%*	93.52%*
FGCN[7]	2020	✔	✔	87.4%	85.4%
MS-G3D Net[8]	2020	✔	✔	88.4%	86.9%
Ours	2020	✔	✔	94.54%	93.69%

Comparison on UTD-MHAD. * indicates results obtained from author-released code. The Pre-trained column indicates if the model was pre-trained on ImageNet and/or a bigger human action dataset.
Method	Year	Pre-trained	Pose Input	RGB Input	Cross-Subject
Glimpse Clouds[6]	2018	✔		✔	84.19%*
JTM[17]	2016	✔	✔		85.81%
Optical Spectra[18]	2018	✔	✔		86.97%
JDM[19]	2017	✔	✔		88.10%
Action Machine Archived Version[20]	2019	✔		✔	92.5%*
PoseMap[9]	2018	✔	✔	✔	94.51%
Ours[21]	2020	✔	✔	✔	91.63%

Comparison on N-UCLA. – indicates no results available. The Pre-trained column indicates if the model was pre-trained on ImageNet and/or a bigger human action dataset.
Method	Year	Pre-trained	Pose Input	RGB Input	View1	View2	View3	Average
Ensemble TS-LSTM[22]	2017		✔		-	-	89.22%	-
EleAtt-GRU(aug.)[23]	2018	✔	✔		-	-	90.7%	-
Enhanced Viz.[24]	2017	✔	✔		-	-	92.61%	-
Glimpse Clouds[6]	2018	✔		✔	83.4%	89.5%	90.1%	87.6%
FGCN[7]	2020		✔		-	-	95.3%	-
Action Machine[11]	2019	✔		✔	88.3%	92.2%	96.5%	92.3%
Ours	2020	✔	✔	✔	91.10%	91.95%	98.92%	93.99%

References

[1]Shahroudy, A., Liu, J., Ng, T., Wang, G.: Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: CVPR. pp. 1010–1019 (2016)

[2]Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.Y., Kot Chichung, A.: Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)

[3]J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu. Cross-view action modeling, learning, and recognition. IEEE Conference on Computer Vision and Pattern Recognition, pages 2649–2656, 2014.

[4]C. Chen, R. Jafari, and N. Kehtarnavaz. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In IEEE International Conference on Image Processing, pages 168–172, 2015.

[5]Hierarchical Action Classification with Network Pruning, Mahdi Davoodikakhki and KangKang Yin. 15th International Symposium on Visual Computing (ISVC 2020).

[6]Fabien Baradel, Christian Wolf, Julien Mille, and Graham W. Taylor. Glimpse clouds: Human activity recognition from unstructured feature points. IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[7]Hao Yang, Dan Yan, Li Zhang, Dong Li, YunDa Sun, ShaoDi You, and Stephen J. Maybank. Feedback graph convolutional network for skeleton-based action recognition, 2020.

[8]Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition, 2020.

[9]M. Liu and J. Yuan. Recognizing human actions as the evolution of pose estimation maps. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1159–1168, 2018.

[10]Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, and Kazuhito Koishida. Mmtm: Multimodal transfer module for cnn fusion, 2019.

[11]Jiagang Zhu, Wei Zou, Zhu Zheng, Liang Xu, and Guan Huang. Action machine: Toward person-centric action recognition in videos. IEEE Signal Processing Letters, PP, 2019.

[12]Lei Shi, Yifan Zhang, Jian Cheng, and Han-Qing Lu. Action recognition via pose-based graph convolutional networks with intermediate dense supervision. ArXiv, abs/1911.12509, 2019.

[13]Carlos Caetano, Francois Bremond, and William Robson Schwartz. Skeleton image representation for 3d action recognition based on tree structure and reference joints. SIBGRAPI Conference on Graphics, Patterns and Images, 2019.

[14]M. Liu and J. Yuan. Recognizing human actions as the evolution of pose estimation maps. https://paperswithcode.com/paper/recognizing-human-actions-as-the-evolution-of. Accessed: 2020-05-12.

[15]Carlos Caetano, Jessica Sena, Franeois Bremond, Jefersson A. Dos Santos, and William Robson Schwartz. Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. IEEE International Conference on Advanced Video and Signal Based Surveillance, 2019.

[16]Konstantinos Papadopoulos, Enjie Ghorbel, Djamila Aouada, and Björn Ottersten. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition, 2019.

[17]Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. ACM International Conference on Multimedia, 2016.

[18]Y. Hou, Z. Li, P. Wang, and W. Li. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(3):807–811, 2018.

[19]C. Li, Y. Hou, P. Wang, and W. Li. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters, 24(5):624–628, 2017.

[20]Jiagang Zhu, Wei Zou, Liang Xu, Yiming Hu, Zheng Zhu, Manyu Chang, Junjie Huang, Guan Huang, and Dalong Du. Action machine: Rethinking action recognition in trimmed videos, 2018.

[21]Hierarchical Action Classification with Network Pruning, Mahdi Davoodikakhki and KangKang Yin. ArXiv, 2020

[22]Lee, I., Kim, D., Kang, S., Lee, S.: Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: ICCV. pp. 1012–1020 (2017)

[23]Pengfei Zhang, Jianru Xue, Cuiling Lan, Wenjun Zeng, Zhanning Gao, and Nanning Zheng. Adding attentiveness to the neurons in recurrent neural networks. In Proceedings of the European Conference on Computer Vision, pages 135–151, 2018.

[24]Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 03 2017.