AUTOMATING INFORMATION EXTRACTION FROM PEROVSKITE SOLAR CELLS LITERATURE USING LARGE LANGUAGE MODELS

GAD, RADWA ESSAM

View/Open

Radwa Gad_ OGS Approved Thesis.pdf (3.075Mb)

Date

2025-06

Author

GAD, RADWA ESSAM

Metadata

Show full item record

Abstract

With the rapid advancement of perovskite solar cells (PSCs) research, efficiently extracting structured data from scientific literature has become essential for accelerating materials discovery and development. PSCs studies often report multiple device configurations within a single paper, making traditional single-device extraction approaches insufficient. In this thesis, we are the first to propose an automated information extraction pipeline that leverages Large Language models (LLMs) to extract structured attributes for all reported devices in PSCs research papers. Our experiments utilize open-source and closed-source LLMs, including GPT-4o-mini, LLaMA 3.1 70B, and Qwen 2.5 72B, ensuring a comprehensive evaluation across various model architectures. Additionally, we introduce the first multi-device evaluation framework using an optimization-based matching algorithm. We also define a wide range of PSC-specific attributes, carefully selected to enhance the practical utility of the extracted dataset for researchers. Our experimental results demonstrate that the proposed pipeline outperforms existing approaches, achieving a champion-device extraction F1 score of 90.06%, F1 score of 78.70% for multi-device extraction, and the best F1 score of 90.98% for the best device in multi-device extraction. These findings highlight the effectiveness of our approach in delivering a scalable, reproducible, and efficient solution for automating structured information extraction from PSCs literature.

DOI/handle

http://hdl.handle.net/10576/66442

Collections

Computing [‎117‎ items ]