Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Efficient Use of Large Language Models for Analysis of Text Corpora

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61988987%3A17610%2F24%3AA2502N5Z" target="_blank" >RIV/61988987:17610/24:A2502N5Z - isvavai.cz</a>

  • Nalezeny alternativní kódy

    RIV/68407700:21730/24:00381571

  • Výsledek na webu

    <a href="https://www.scitepress.org/Papers/2024/123498/123498.pdf" target="_blank" >https://www.scitepress.org/Papers/2024/123498/123498.pdf</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.5220/0012349800003654" target="_blank" >10.5220/0012349800003654</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Efficient Use of Large Language Models for Analysis of Text Corpora

  • Popis výsledku v původním jazyce

    In this paper, we propose an efficient approach for tracking a given phenomenon in a corpus using natural language processing (NLP) methods. The topic of tracking phenomena in a corpus is important, especially in the fields of sociology, psychology, and economics, which study human behavior in society. Unlike existing approaches that rely on universal large language models (LLMs), which are computationally expensive, we focus on using computationally less expensive methods. These methods allow for high data processing speed while maintaining high accuracy. Our approach is inspired by the cascade approach to optimization, where we first roughly filter out unwanted information and then gradually use more accurate models, which are computationally more expensive. In this way, we are able to process large amounts of data with high accuracy using different models, while also reducing the overall cost of computations. To demonstrate the proposed method, we chose a task that consists of finding the frequency of occurrence of a certain phenomenon in a large text corpus, which is divided into individual months of the year. In practice, this means that we can, for example, use Internet discussions to find out how much people are discussing a particular topic. The entire solution is presented as a pipeline, which consists of individual phases that successively process text data using methods selected to minimize the overall cost of processing all data.

  • Název v anglickém jazyce

    Efficient Use of Large Language Models for Analysis of Text Corpora

  • Popis výsledku anglicky

    In this paper, we propose an efficient approach for tracking a given phenomenon in a corpus using natural language processing (NLP) methods. The topic of tracking phenomena in a corpus is important, especially in the fields of sociology, psychology, and economics, which study human behavior in society. Unlike existing approaches that rely on universal large language models (LLMs), which are computationally expensive, we focus on using computationally less expensive methods. These methods allow for high data processing speed while maintaining high accuracy. Our approach is inspired by the cascade approach to optimization, where we first roughly filter out unwanted information and then gradually use more accurate models, which are computationally more expensive. In this way, we are able to process large amounts of data with high accuracy using different models, while also reducing the overall cost of computations. To demonstrate the proposed method, we chose a task that consists of finding the frequency of occurrence of a certain phenomenon in a large text corpus, which is divided into individual months of the year. In practice, this means that we can, for example, use Internet discussions to find out how much people are discussing a particular topic. The entire solution is presented as a pipeline, which consists of individual phases that successively process text data using methods selected to minimize the overall cost of processing all data.

Klasifikace

  • Druh

    D - Stať ve sborníku

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

  • Návaznosti

    I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

  • Rok uplatnění

    2024

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název statě ve sborníku

    ICPRAM 2024

  • ISBN

    978-989758684-2

  • ISSN

    2184-4313

  • e-ISSN

  • Počet stran výsledku

    11

  • Strana od-do

    695-705

  • Název nakladatele

  • Místo vydání

    Roma, Italy

  • Místo konání akce

    Roma, Italy

  • Datum konání akce

    28. 1. 2024

  • Typ akce podle státní příslušnosti

    WRD - Celosvětová akce

  • Kód UT WoS článku