Analytical grammar forms extraction as a new challenge for corpora (Case of conditional mood in Polish and Ukrainian)

Fokin, S. (2022). Analytical grammar forms extraction as a new challenge for corpora (Case of conditional mood in Polish and Ukrainian). Polonica , 42 (1). https://doi.org/10.17651/POLON.42.9

Abstract

A particular challenge for modern textual corpora is the tagging of analytical grammar categories. The components of these categories may be separated in certain contexts by other words or may even be inverted. A particular interest regarding the selection of analytical grammatical forms is centred around the conditional mood in some Slavic languages, as expressed by means of two words: a past verb form and the particle by/б/би/бы, which is why in most modern corpora, this category lacks a specific tag for these compound forms. The case of Polish is particularly complicated because the particle by may either be merged with the participle or used separately; furthermore, its separated form may contain a personal verb ending. Specific queries subject to experiment on Polish and Ukrainian corpora allow selecting the analytical forms in question.

https://doi.org/10.17651/POLON.42.9

PDF

ePUB

mobi

References

Alexandrov, M., Blanco, X., Mitrofanova O.M., & Zakharov, V. (2007). Nooj Applications for Document Clustering and Corpus Linguistics. In X. Blanco, & M.Silberztein (Eds.), Proceedings of the 2007 International NooJ Conference (pp. 6–19). Newcastle: Cambridge Scholars Publishing. https://www.cambridgescholars.com/download/sample/60082

Conditional Marker Auxiliaries, https://universaldependencies.org/pl/dep/aux-cnd.html (accessed: 28.10.2022).

Conditional Mood Tagset, https://universaldependencies.org/u/feat/Mood.html (accessed: 28.10.2022).

Corpus Query Language (n.d.). Sketch Engine. https://www.sketchengine.eu/documentation/corpus-querying/ (accessed: 28.10.2022).

Fokin, S.B. (2020). Estructura de consultas para la selección automática de formas gramaticales analíticas del tiempo futuro en lenguas eslavas. Mundo Eslavo, 19, 25–38.

Gaszyńska-Magiera, M. (1998). Tryb przypuszczający w nauczaniu języka polskiego jako obcego. Acta Universitatis Lodziensis. Kształcenie Polonistyczne Cudzoziemców, 10, 51–60.

GRAK, General Regionally Annotated Corpus of Ukrainian. (2017–2022). Генеральний Регіонально Анотований Корпус Української Мови, http://www.parasolcorpus.org/bonito/run.cgi/first_form (accessed: 28.10.2022).

Grzegorczykowa, R., Laskowski, R., & Wróbel, H. (1999). Gramatyka współczesnego języka polskiego (t. 1). Warszawa: PWN.

HANCO. Helsinki Annotated Russian Corpus (1999–2018). ХАНКО – Хельсинкский аннотированный корпус русского языка. http://h248.it.helsinki.fi/hanco/ (accessed: 2.02.2022).

Institute of Formal and Applied Linguistics Charles University, Czech Republic Faculty of Mathematics and Physics. (2022). UDPipe 1 Models. https://ufal.mff.cuni.cz/udpipe/1/models (accessed: 2.02.2022).

Jelínek, T., Stindlová, B., Rosen, A., & Hana, J. (2012). Combining manual and automatic annotation of a Learner Corpus. In P. Sojka, A. Horák, I. Kopeček, K. Pala (Eds.), Text, Speech and Dialogue – Proceedings of the 15th International Conference. TSD 2012 (pp 127–134). Brno: Springer Verlag.

Korpus barokowy (2013–2018). Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do 1772 r.). https://korba.edu.pl/query_corpus/ (accessed: 2.02.2022).

Haitao, L., Chunshan, X., Junying, L. (2017). Dependency distance: A new perspective on syntactic patterns in natural languages. Physics of Life Reviews, 21, 171–193.

McDonald, J. (2007). Characterizing the Errors of Data-Driven Dependency Parsing Models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 122–131). Prague: Association for Computational Linguistics.

Miller, G.A. (1956). The Magical Number Seven, Plus or Minus Two. The Psychological Review, 63, 81–97.

NKJP. Narodowy Korpus Języka Polskiego (2008–2010). Poliqarp search engine for NKJP data.

Polish Newscrawl (Leipzig Corpora Collection), http://cql.corpora.uni-leipzig.de/bonito/run.cgi/first?corpname=pol_newscrawl_2011 (accessed: 2.02.2022).

Przepiórkowski, A., & Woliński, M. (2003). A flexemic tagset for Polish. ACL anthology. https://aclanthology.org/W03-2905.pdf (accessed: 28.10.2022).

Przepiórkowski, A., & Wil, J. (2011). Poliqarp Query Language. http://nkjp.pl/poliqarp/help/ense3.html#x4-50003 (accessed: 2.02.2022).

Rosen, A., Hana, J., Štindlová, B. et al. (2014). Evaluating and automating the annotation of a learner corpus. Lang Resources & Evaluation, 48, 65–92.

Szober, S. (2022). Nauka o języku. Dla klasy trzeciej gimnazialnej. Warszawa: Wydawnictwo M. Arcta w Warszawie.

Zaleska, M. (1999). The Irrealis in the Polish Language: A question of verbal moods, conjunctions or the modal particle by? In L. Mereu (Ed.), Boundaries of Morphology and Syntax (pp.137–156). Roma: John Benjamins Publishing Company.

Zeman, D. (2016). Universal Annotation of Slavic Verb Forms. The Prague Bulletin of Mathematical Linguistics, 105, 143–193.

Downloads

Download data is not yet available.

Analytical grammar forms extraction as a new challenge for corpora (Case of conditional mood in Polish and Ukrainian)

Keywords

How to Cite

Download Citation

Abstract

References

Downloads