The ever-growing landscape of Android applications necessitates robust security mechanisms to mitigate potential vulnerabilities. This thesis presents a comprehensive evaluation of three Language Model (LLM)-based tools—ChatGPT, Google Bard, and Android Studio Bot—for automated Android security vulnerability repair. To ensure the authenticity of the evaluation, a database comprising 80 vulnerable code snippets sourced from Google Android Security Bulletins is utilized. The evaluation process involves recording the outcomes of vulnerability fixes generated by each LLM-based tool. Two distinct evaluation techniques are employed: first, the calculation of BLEU scores to quantify the syntactic and semantic correctness of the repairs, and second, manual human evaluation for a more nuanced assessment. The comparison between the actual fixes and those generated by the LLM-based tools aims to highlight their efficacy in addressing security vulnerabilities. Results from the evaluation provide insights into the strengths and limitations of each LLM-based model concerning syntactic and semantic accuracy. The study reveals instances where the models excel in producing effective repairs and identifies areas for improvement. The combination of automated evaluation metrics and human assessment adds depth to the analysis, enhancing the reliability of the findings. In conclusion, this thesis contributes to the understanding of LLM-based tools' capabilities in automating Android security vulnerability repairs. The generated suggestions and results serve as valuable guidance for developers and researchers in leveraging these tools effectively, ultimately advancing the state of automated security practices in Android application development.

The ever-growing landscape of Android applications necessitates robust security mechanisms to mitigate potential vulnerabilities. This thesis presents a comprehensive evaluation of three Language Model (LLM)-based tools—ChatGPT, Google Bard, and Android Studio Bot—for automated Android security vulnerability repair. To ensure the authenticity of the evaluation, a database comprising 80 vulnerable code snippets sourced from Google Android Security Bulletins is utilized. The evaluation process involves recording the outcomes of vulnerability fixes generated by each LLM-based tool. Two distinct evaluation techniques are employed: first, the calculation of BLEU scores to quantify the syntactic and semantic correctness of the repairs, and second, manual human evaluation for a more nuanced assessment. The comparison between the actual fixes and those generated by the LLM-based tools aims to highlight their efficacy in addressing security vulnerabilities. Results from the evaluation provide insights into the strengths and limitations of each LLM-based model concerning syntactic and semantic accuracy. The study reveals instances where the models excel in producing effective repairs and identifies areas for improvement. The combination of automated evaluation metrics and human assessment adds depth to the analysis, enhancing the reliability of the findings. In conclusion, this thesis contributes to the understanding of LLM-based tools' capabilities in automating Android security vulnerability repairs. The generated suggestions and results serve as valuable guidance for developers and researchers in leveraging these tools effectively, ultimately advancing the state of automated security practices in Android application development.

Valutazione di Strumenti Basati su LLM per il Fixing Automatico di Vulnerabilità di Sicurezza in Applicazioni Android

AHMED, SAAD
2023/2024

Abstract

The ever-growing landscape of Android applications necessitates robust security mechanisms to mitigate potential vulnerabilities. This thesis presents a comprehensive evaluation of three Language Model (LLM)-based tools—ChatGPT, Google Bard, and Android Studio Bot—for automated Android security vulnerability repair. To ensure the authenticity of the evaluation, a database comprising 80 vulnerable code snippets sourced from Google Android Security Bulletins is utilized. The evaluation process involves recording the outcomes of vulnerability fixes generated by each LLM-based tool. Two distinct evaluation techniques are employed: first, the calculation of BLEU scores to quantify the syntactic and semantic correctness of the repairs, and second, manual human evaluation for a more nuanced assessment. The comparison between the actual fixes and those generated by the LLM-based tools aims to highlight their efficacy in addressing security vulnerabilities. Results from the evaluation provide insights into the strengths and limitations of each LLM-based model concerning syntactic and semantic accuracy. The study reveals instances where the models excel in producing effective repairs and identifies areas for improvement. The combination of automated evaluation metrics and human assessment adds depth to the analysis, enhancing the reliability of the findings. In conclusion, this thesis contributes to the understanding of LLM-based tools' capabilities in automating Android security vulnerability repairs. The generated suggestions and results serve as valuable guidance for developers and researchers in leveraging these tools effectively, ultimately advancing the state of automated security practices in Android application development.
2023
Evaluation of LLM-based Tools for the Automated Repair of Security Vulnerabilities in Android Apps
The ever-growing landscape of Android applications necessitates robust security mechanisms to mitigate potential vulnerabilities. This thesis presents a comprehensive evaluation of three Language Model (LLM)-based tools—ChatGPT, Google Bard, and Android Studio Bot—for automated Android security vulnerability repair. To ensure the authenticity of the evaluation, a database comprising 80 vulnerable code snippets sourced from Google Android Security Bulletins is utilized. The evaluation process involves recording the outcomes of vulnerability fixes generated by each LLM-based tool. Two distinct evaluation techniques are employed: first, the calculation of BLEU scores to quantify the syntactic and semantic correctness of the repairs, and second, manual human evaluation for a more nuanced assessment. The comparison between the actual fixes and those generated by the LLM-based tools aims to highlight their efficacy in addressing security vulnerabilities. Results from the evaluation provide insights into the strengths and limitations of each LLM-based model concerning syntactic and semantic accuracy. The study reveals instances where the models excel in producing effective repairs and identifies areas for improvement. The combination of automated evaluation metrics and human assessment adds depth to the analysis, enhancing the reliability of the findings. In conclusion, this thesis contributes to the understanding of LLM-based tools' capabilities in automating Android security vulnerability repairs. The generated suggestions and results serve as valuable guidance for developers and researchers in leveraging these tools effectively, ultimately advancing the state of automated security practices in Android application development.
LLM Models
Effectiveness of LLM
Chatgpt,bar
language models
Android Security bug
File in questo prodotto:
File Dimensione Formato  
Ahmed_Saad.pdf

accesso aperto

Dimensione 751.84 kB
Formato Adobe PDF
751.84 kB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/64049