Odvození větné struktury bez anotovaných korpusů

Název projektu anglicky
Sentence structure induction without annotated corpora
Anotace anglicky
Syntactic analysis of sentences is one of the fundamental problems of computational linguistics. At present, the use of supervised approaches that need a large number of syntactically annotated corpora (treebanks) to learn the syntax of language. The disadvantage is the financial and time demands of such a corpus and the need to create a new treebank for each additional language. In this project, we will work on an alternative method. The syntactic relations will be learned automatically from a text corpora with no linguistic annotation. These "unsupervised" methods have recently become very popular and it turns out that, for certain types of tasks, they are better than the supervised methods. Their advantage is their simplicity and their linguistic and domain independence. We will test the induced grammar models in applications where a simple n-gram models currently outperform the syntactic ones, for example in machine translation. Our hypothesis is that the syntactic models based solely on data and not on linguistic rules can improve the machine translation results.

Kategorie VaV
ZV - Základní výzkum
CEP - hlavní obor
AI - Jazykověda
CEP - vedlejší obor
IN - Informatika
CEP - další vedlejší obor
—
OECD FORD - odpovídající obory <br>(dle <a href="http://www.vyzkum.cz/storage/att/E6EF7938F0E854BAE520AC119FB22E8D/Prevodnik_oboru_Frascati.pdf">převodníku</a>)
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)<br>60201 - General language studies<br>60202 - Specific languages<br>60203 - Linguistics

Hodnocení poskytovatelem
U - Uspěl podle zadání (s publikovanými či patentovanými výsledky atd.)
Zhodnocení výsledků projektu
Všechny hlavní cíle projektu byly splněny. Byl vydán otevřený software LiStr obsahující nástroje pro neřízené odvození struktury vět. Zásadní jsou poznatky týkající se možností neřízeného parsingu včetně jeho využití ve strojovém překladu.

Důvěrnost údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Systémové označení dodávky dat
CEP17-GA0-GP-U/01:1
Datum dodání záznamu
30. 6. 2017

Podobné projekty(10)