Enhancing Knowledge Distillation for Text Summarization

Kotit, Mohammad Basheer

View/Open

Mohammad Kotit _ OGS Approved Thesis.pdf (2.803Mb)

Date

2024-01

Author

Kotit, Mohammad Basheer

Metadata

Show full item record

Abstract

In the realm of natural language processing, recent advancements have been significantly shaped by the development of large pretrained Seq2Seq Transformer models, including BART, PEGASUS, and T5. These models have revolutionized various text generation applications, such as machine translation, text summarization, and chatbot development, by offering remarkable improvements in accuracy and fluency. However, their deployment in text summarization often encounters significant challenges in environments with limited computational resources. This research proposes an innovative solution: the development of compact student models. These models are designed to emulate the capabilities of their larger pretrained counterparts (teacher models) while ensuring reduced computational demands and increased processing speed, thus maintaining high performance with greater efficiency. Knowledge distillation, a popular technique in model optimization, typically employs two primary techniques: direct knowledge distillation and the use of pseudo-labels. Our research enhances direct knowledge distillation by introducing an effective behavior function. This function selectively emphasizes the more certain predictions from the teacher model, thereby addressing the exposure bias issue that arises from differences between training and testing environments. In addition to this, we propose a novel approach to select the most reliable predictions from the teacher model. These highconfidence predictions are then utilized as pseudo-summaries, optimizing the student model’s training through the pseudo-label technique. This dual approach mainly focuses on the confidence of teacher predictions and offers a comprehensive solution to enhance the model’s performance while maintaining computational efficiency. We evaluated our methods using BART on the CNN/DM dataset and Pegasus on the XSUM dataset. The findings of these assessments revealed that our approaches not only successfully achieved the knowledge distillation objectives, but also significantly surpassed the performance of the teacher models.

DOI/handle

http://hdl.handle.net/10576/51500

Collections

Computing [‎103‎ items ]