Large language models overcome the challenges of unstructured text data in ecology

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F67985939%3A_____%2F24%3A00598400" target="_blank" >RIV/67985939:_____/24:00598400 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/00216208:11310/24:10489198
Výsledek na webu
<a href="https://doi.org/10.1016/j.ecoinf.2024.102742" target="_blank" >https://doi.org/10.1016/j.ecoinf.2024.102742</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.ecoinf.2024.102742" target="_blank" >10.1016/j.ecoinf.2024.102742</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Large language models overcome the challenges of unstructured text data in ecology
Popis výsledku v původním jazyce
The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labourintensive, posing a significant challenge. In this study, we aimed to assess the application of three state-ofthe-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87-100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81-97%), whereas LLaMA-2-70B showed the worst performance (37-73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.
Název v anglickém jazyce
Large language models overcome the challenges of unstructured text data in ecology
Popis výsledku anglicky
The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labourintensive, posing a significant challenge. In this study, we aimed to assess the application of three state-ofthe-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87-100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81-97%), whereas LLaMA-2-70B showed the worst performance (37-73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10618 - Ecology

Návaznosti výsledku

Projekt
<a href="/cs/project/GA23-07278S" target="_blank" >GA23-07278S: Využití internetových informačních zdrojů (iEcology a culturomics) ve výzkumu biologických invazí</a><br>
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Ecological Informatics
ISSN
1574-9541
e-ISSN
1878-0512
Svazek periodika
82
Číslo periodika v rámci svazku
September
Stát vydavatele periodika
NL - Nizozemsko
Počet stran výsledku
7
Strana od-do
102742
Kód UT WoS článku
001290358400001
EID výsledku v databázi Scopus
2-s2.0-85200389928

Podobné výsledky(10)

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation It is not a piece of cake for GPT: Explaining Textual Entailment Recognition in the presence of Figurative Language Linguistic Rule Induction Improves Adversarial and OOD Robustness in Large Language Models

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Large language models overcome the challenges of unstructured text data in ecology

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)