Enhancing Knowledge Distillation for Text Summarization
Abstract
In the realm of natural language processing, recent advancements have been significantly
shaped by the development of large pretrained Seq2Seq Transformer models,
including BART, PEGASUS, and T5. These models have revolutionized various text
generation applications, such as machine translation, text summarization, and chatbot
development, by offering remarkable improvements in accuracy and fluency.
However, their deployment in text summarization often encounters significant challenges
in environments with limited computational resources. This research proposes
an innovative solution: the development of compact student models. These models are
designed to emulate the capabilities of their larger pretrained counterparts (teacher models)
while ensuring reduced computational demands and increased processing speed,
thus maintaining high performance with greater efficiency.
Knowledge distillation, a popular technique in model optimization, typically employs
two primary techniques: direct knowledge distillation and the use of pseudo-labels. Our
research enhances direct knowledge distillation by introducing an effective behavior
function. This function selectively emphasizes the more certain predictions from the
teacher model, thereby addressing the exposure bias issue that arises from differences
between training and testing environments. In addition to this, we propose a novel
approach to select the most reliable predictions from the teacher model. These highconfidence
predictions are then utilized as pseudo-summaries, optimizing the student
model’s training through the pseudo-label technique. This dual approach mainly focuses
on the confidence of teacher predictions and offers a comprehensive solution to enhance
the model’s performance while maintaining computational efficiency.
We evaluated our methods using BART on the CNN/DM dataset and Pegasus on
the XSUM dataset. The findings of these assessments revealed that our approaches not
only successfully achieved the knowledge distillation objectives, but also significantly
surpassed the performance of the teacher models.
DOI/handle
http://hdl.handle.net/10576/51500Collections
- Computing [100 items ]