«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.2095-0411.2024.06.009]
点击复制

融合群助教模型的两阶段知识蒸馏文本分类方法()

分享到：

常州大学学报(自然科学版)[ISSN:2095-0411/CN:32-1822/N]

卷:: 第36卷
期数:: 2024年06期

页码:: 71-82

栏目:: 计算机与信息工程

出版日期:: 2024-12-03

文章信息/Info

Title:: Incorporating two-stage knowledge distillation text classification method with group assistant models

文章编号:: 2095-0411(2024)06-0071-12

作者:: 张骏强; 高尚兵; 苏睿; 李文婷; 淮阴工学院计算机与软件工程学院, 江苏淮安 223003; 江苏省物联网移动互联技术工程实验室, 江苏淮安 223001

Author(s):: ZHANG Junqiang; GAO Shangbing; SU Rui; LI Wenting; (School of Computer and Software Engineering, Huaiyin Institute of Technology, Huai'an 223003, China; Laboratory for Internet of Things and Mobile Internet Technology of Jiangsu Province, Huai'an 223001, China)

关键词:: 文本分类; 预训练语言模型; 两阶段知识蒸馏; 群助教模型; 渐进式蒸馏

Keywords:: text classification; pre-trained language model; two-stage knowledge distillation; group assistant models; progressive distillation

分类号:: TP 391

DOI:: 10.3969/j.issn.2095-0411.2024.06.009

文献标志码:: A

摘要:: 针对Transformer架构的预训练语言模型进行文本分类时性能较优的模型存在参数量多、训练开销大以及推理时延高的问题,提出了一种融合群助教模型的两阶段知识蒸馏文本分类方法,其中群助教模型(Group assistant models,GAM)由图卷积神经网络助教模型(Graph convolution network assistant model, GCNAM)和Transformer助教模型组成,该方法将教师模型的知识经过Transformer助教模型传递蒸馏到学生模型中,期间通过图卷积神经网络助教模型对两阶段蒸馏过程进行指导。同时,针对模型中间层的知识蒸馏,提出了一种渐进式知识蒸馏策略,根据模型知识分布密度调整教师模型被蒸馏的层级。根据多个数据集的实验结果,文中方法均优于基线方法,并以最高损失0.73%的F₁值为代价,将模型参数量降低了48.20%,推理速度提升了56.94%。

Abstract:: In the case of pre-trained language models using the Transformer architecture for text classification tasks, the better performing models suffer from a high number of parameters, huge training overhead, and high inference latency. This paper proposes a two-stage knowledge distillation text classification method incorporating a group teaching assistant model, in which the group teaching assistant model consists of a graph convolutional neural network teaching assistant model and a Transformer teaching assistant model. The knowledge of the teacher model is distilled to the student model by the Transformer assistant model, during which the two-stage distillation process is guided by the graph convolutional neural network assistant model. At the same time, a progressive knowledge distillation strategy is proposed for the intermediate knowledge distillation of the model, which adjusts the level at which a specific teacher model is distilled according to the model knowledge distribution density. Experimental results on multiple datasets show that the proposed approach outperforms the baseline approach in all cases and reduces the size of the model parameters by 48.20% and increases the speed of inference by 56.94% at the cost of a maximum loss of 0.73% of the F₁-score value.

参考文献/References:

[1] CUI W. A Chinese text classification system based on Naive Bayes algorithm[J]. MATEC Web of Conferences, 2016, 44: 01015.
[2] MIAO F, ZHANG P, JIN L B, et al. Chinese news text classification based on machine learning algorithm[C]//2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics(IHMSC). Hangzhou: IEEE, 2018: 48-51.
[3] GOUDJIL M, KOUDIL M, BEDDA M, et al. A novel active learning method using SVM for text classification[J]. International Journal of Automation and Computing, 2018, 15(3): 290-298.
[4] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2. New York: ACM, 2013: 3111-3119.
[5] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543.
[6] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. [S.l.]:[s.n.], 2019: 4171-4186.
[7] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[EB/OL]. 2015: arXiv: 1503.02531.[2015-03-09]. https://arxiv.org/abs/1503.02531.pdf.
[8] GOU J P, YU B S, MAYBANK S J, et al. Knowledge distillation: a survey[J]. International Journal of Computer Vision, 2021, 129(6): 1789-1819.
[9] LI L, XIAO L L, WANG N Z, et al. Text classification method based on convolution neural network[C]//2017 3rd IEEE International Conference on Computer and Communications(ICCC). Chengdu: IEEE, 2018: 1985-1989.
[10] LUAN Y D, LIN S F. Research on text classification based on CNN and LSTM[C]//2019 IEEE International Conference on Artificial Intelligence and Computer Applications(ICAICA). Dalian: IEEE, 2019: 352-355.
[11] 孙新, 唐正, 赵永妍, 等. 基于层次混合注意力机制的文本分类模型[J]. 中文信息学报, 2021, 35(2): 69-77.
[12] WU Z H, PAN S R, CHEN F W, et al. A comprehensive survey on graph neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4-24.
[13] YAO L A, MAO C S, LUO Y A. Graph convolutional networks for text classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 7370-7377.
[14] BASTINGS J, TITOV I, AZIZ W, et al. Graph convolutional encoders for syntax-aware neural machine translation[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 1957-1967.
[15] HUANG L Z, MA D H, LI S J, et al. Text level graph neural network for text classification[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Stroudsburg: Association for Computational Linguistics, 2019: 3444-3450.
[16] SUN C, QIU X P, XU Y G, et al. How to fine-tune BERT for text classification?[C]//SUN M, HUANG X, JI H, et al. China National Conference on Chinese Computational Linguistics. Cham: Springer, 2019: 194-206.
[17] TANG R, LU Y, LIU L Q, et al. Distilling task-specific knowledge from BERT into simple neural networks[EB/OL]. 2019: arXiv: 1903.12136.[2019-03-28]. https://arxiv.org/abs/1903.12136.pdf.
[18] SUN S Q, CHENG Y, GAN Z, et al. Patient knowledge distillation for BERT model compression[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Stroudsburg: Association for Computational Linguistics, 2019: 4323-4332.
[19] JIAO X Q, YIN Y C, SHANG L F, et al. TinyBERT: distilling BERT for natural language understanding[C]//Findings of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4163-4174.
[20] OZEROV A, DUONG N Q K. Inplace knowledge distillation with teacher assistant for improved training of flexible deep neural networks[C]//2021 29th European Signal Processing Conference(EUSIPCO). Dublin: IEEE, 2021: 1356-1360.
[21] 黄震华, 杨顺志, 林威, 等. 知识蒸馏研究综述[J]. 计算机学报, 2022, 45(3): 624-653.
[22] CUI Y M, CHE W X, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing[C]//Findings of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 657-668.
[23] HUANG S, PAPERNOT N, GOODFELLOW I, et al. Adversarial Attacks on Neural Network Policies[C]//Proceedings of the International Conference on Learning Representations. France:International Machine Learning, 2017: 24-26.
[24] KARIMI A, ROSSI L, PRATI A. Adversarial training for aspect-based sentiment analysis with BERT[C]//2020 25th International Conference on Pattern Recognition(ICPR). Milan: IEEE, 2021: 8797-8803.
[25] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). Stroudsburg: Association for Computational Linguistics, 2014: 1746-1751.
[26] LIU P F, QIU X P, HUANG X J. Recurrent neural network for text classification with multi-task learning[C]//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. New York: ACM, 2016: 2873-2879.
[27] JOHNSON R, ZHANG T. Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers). Stroudsburg: Association for Computational Linguistics, 2017: 562-570.

备注/Memo

备注/Memo:: 收稿日期: 2024-08-28。
基金项目: 国家重点研发计划资助项目(2018YFB1004904); 国家自然科学基金面上资助项目(62076107); 江苏省六大人才高峰资助项目(XYDXXJS-011)。
作者简介: 张骏强(1997—), 男, 江苏南京人, 硕士生。通信联系人: 高尚兵(1981—), E-mail: luxiaofen_2002@126.com

更新日期/Last Update: 1900-01-01

常州大学学报(自然科学版)[ISSN:2095-0411/CN:32-1822/N]

文章信息/Info

参考文献/References:

备注/Memo

常用功能

导航/Navigate

工具/Tools

统计/Statistics