Large language models overcome the challenges of unstructured text data in ecology
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F67985939%3A_____%2F24%3A00598400" target="_blank" >RIV/67985939:_____/24:00598400 - isvavai.cz</a>
Alternative codes found
RIV/00216208:11310/24:10489198
Result on the web
<a href="https://doi.org/10.1016/j.ecoinf.2024.102742" target="_blank" >https://doi.org/10.1016/j.ecoinf.2024.102742</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.ecoinf.2024.102742" target="_blank" >10.1016/j.ecoinf.2024.102742</a>
Alternative languages
Result language
angličtina
Original language name
Large language models overcome the challenges of unstructured text data in ecology
Original language description
The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labourintensive, posing a significant challenge. In this study, we aimed to assess the application of three state-ofthe-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87-100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81-97%), whereas LLaMA-2-70B showed the worst performance (37-73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10618 - Ecology
Result continuities
Project
<a href="/en/project/GA23-07278S" target="_blank" >GA23-07278S: Harnessing iEcology and culturomics to advance invasion science</a><br>
Continuities
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Ecological Informatics
ISSN
1574-9541
e-ISSN
1878-0512
Volume of the periodical
82
Issue of the periodical within the volume
September
Country of publishing house
NL - THE KINGDOM OF THE NETHERLANDS
Number of pages
7
Pages from-to
102742
UT code for WoS article
001290358400001
EID of the result in the Scopus database
2-s2.0-85200389928