Large Language Models (LLMs) are becoming increasingly popular and have demon strated strong capabilities across a wide range of natural language processing tasks. Nonetheless, pre-training large language models remains a complex and resource-intensive process. Several techniques have been developed to simplify this procedure, including methods that reduce model size while aiming to preserve performance. One of the most widely used approaches is knowledge distillation, which transfers knowledge from a larger pre-trained model (the teacher) to a smaller more efficient model (the student), often resulting in minimal performance degradation. This technique typically involves training the student model to replicate the output distribution produced by the teacher, rather than learning directly from the training corpus. This thesis proposes and analyzes a novel approach that inverts the traditional di rection of knowledge transfer. Instead of distilling from a larger to a smaller model, we explore the potential of transferring a combination of corpus-based knowledge and information from a smaller model to improve the training of a larger one.

Large Language Models (LLMs) are becoming increasingly popular and have demon strated strong capabilities across a wide range of natural language processing tasks. Nonetheless, pre-training large language models remains a complex and resource-intensive process. Several techniques have been developed to simplify this procedure, including methods that reduce model size while aiming to preserve performance. One of the most widely used approaches is knowledge distillation, which transfers knowledge from a larger pre-trained model (the teacher) to a smaller more efficient model (the student), often resulting in minimal performance degradation. This technique typically involves training the student model to replicate the output distribution produced by the teacher, rather than learning directly from the training corpus. This thesis proposes and analyzes a novel approach that inverts the traditional di rection of knowledge transfer. Instead of distilling from a larger to a smaller model, we explore the potential of transferring a combination of corpus-based knowledge and information from a smaller model to improve the training of a larger one.

Assisted Pre-training for Decoder-Only Language Models

FRIGIONE, LUIGI
2024/2025

Abstract

Large Language Models (LLMs) are becoming increasingly popular and have demon strated strong capabilities across a wide range of natural language processing tasks. Nonetheless, pre-training large language models remains a complex and resource-intensive process. Several techniques have been developed to simplify this procedure, including methods that reduce model size while aiming to preserve performance. One of the most widely used approaches is knowledge distillation, which transfers knowledge from a larger pre-trained model (the teacher) to a smaller more efficient model (the student), often resulting in minimal performance degradation. This technique typically involves training the student model to replicate the output distribution produced by the teacher, rather than learning directly from the training corpus. This thesis proposes and analyzes a novel approach that inverts the traditional di rection of knowledge transfer. Instead of distilling from a larger to a smaller model, we explore the potential of transferring a combination of corpus-based knowledge and information from a smaller model to improve the training of a larger one.
2024
Assisted Pre-training for Decoder-Only Language Models
Large Language Models (LLMs) are becoming increasingly popular and have demon strated strong capabilities across a wide range of natural language processing tasks. Nonetheless, pre-training large language models remains a complex and resource-intensive process. Several techniques have been developed to simplify this procedure, including methods that reduce model size while aiming to preserve performance. One of the most widely used approaches is knowledge distillation, which transfers knowledge from a larger pre-trained model (the teacher) to a smaller more efficient model (the student), often resulting in minimal performance degradation. This technique typically involves training the student model to replicate the output distribution produced by the teacher, rather than learning directly from the training corpus. This thesis proposes and analyzes a novel approach that inverts the traditional di rection of knowledge transfer. Instead of distilling from a larger to a smaller model, we explore the potential of transferring a combination of corpus-based knowledge and information from a smaller model to improve the training of a larger one.
NLP
Large Language model
Deep learning
Transformer NNs
File in questo prodotto:
File Dimensione Formato  
Frigione_Luigi.pdf

accesso aperto

Dimensione 4.13 MB
Formato Adobe PDF
4.13 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/99592