Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Large language models overcome the challenges of unstructured text data in ecology

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F67985939%3A_____%2F24%3A00598400" target="_blank" >RIV/67985939:_____/24:00598400 - isvavai.cz</a>

  • Nalezeny alternativní kódy

    RIV/00216208:11310/24:10489198

  • Výsledek na webu

    <a href="https://doi.org/10.1016/j.ecoinf.2024.102742" target="_blank" >https://doi.org/10.1016/j.ecoinf.2024.102742</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1016/j.ecoinf.2024.102742" target="_blank" >10.1016/j.ecoinf.2024.102742</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Large language models overcome the challenges of unstructured text data in ecology

  • Popis výsledku v původním jazyce

    The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labourintensive, posing a significant challenge. In this study, we aimed to assess the application of three state-ofthe-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87-100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81-97%), whereas LLaMA-2-70B showed the worst performance (37-73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.

  • Název v anglickém jazyce

    Large language models overcome the challenges of unstructured text data in ecology

  • Popis výsledku anglicky

    The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labourintensive, posing a significant challenge. In this study, we aimed to assess the application of three state-ofthe-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87-100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81-97%), whereas LLaMA-2-70B showed the worst performance (37-73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.

Klasifikace

  • Druh

    J<sub>imp</sub> - Článek v periodiku v databázi Web of Science

  • CEP obor

  • OECD FORD obor

    10618 - Ecology

Návaznosti výsledku

  • Projekt

    <a href="/cs/project/GA23-07278S" target="_blank" >GA23-07278S: Využití internetových informačních zdrojů (iEcology a culturomics) ve výzkumu biologických invazí</a><br>

  • Návaznosti

    I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

  • Rok uplatnění

    2024

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    Ecological Informatics

  • ISSN

    1574-9541

  • e-ISSN

    1878-0512

  • Svazek periodika

    82

  • Číslo periodika v rámci svazku

    September

  • Stát vydavatele periodika

    NL - Nizozemsko

  • Počet stran výsledku

    7

  • Strana od-do

    102742

  • Kód UT WoS článku

    001290358400001

  • EID výsledku v databázi Scopus

    2-s2.0-85200389928