Nagaoka Tigrinya Corpus

1. What is a corpus ?

"In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed)" [Wikipedia].

2. The Nagaoka Tigrinya corpus 1.0 (NTC 1.0)

The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language.
This text corpus is compiled at Nagaoka university of Technology.
The corpus is a collection of news articles from an Eritrean newspaper called "Haddas Ertra". 
It contains about 100 articles published between March 2013 and December 2013.
The current release of NTC (NTC 1.0) has a total of 72,080 tokens. 
On average, one sentence contains 15 words.

The text was randomly selected from different domains (or Topics) of Haddas Ertra listed as follows.

 Topic  Articles
 Agriculture  10
 Business   5
 Culture  14
 Health  13
 History   4    
 Law   9
 Politics   7
 Relationship   8
 Sport  11
 Social  12
 General   7
 Total  100

3. Tagset design

The corpus is manually tagged for part of speech tags with few enhancements done automatically.
This released NTC 1.0  is labelled with 20 Tigrinya parts-of-speech that contain level-1 (Major PoS Category) and Level-2 (Type of Category) information. The tags are given as follows: 

 Category      Type          Label Example
 Noun  N 
  Verbal N_V  
  Proper N_PRP  
 Pronoun  PRO 
 Verb  V 
  Perfective V_PRF 
  Imperfective V_IMF 
  Imperative V_IMV 
  Gerundive V_GER 
  Auxiliary V_AUX 
  Relative V_REL 
 Adjective  ADJ 
 Adverb  ADV 
 Preposition    PRE 
 Conjunction  CON 
 Interjection  INT 
 Numeral  NUM 
 Punctuation  PUN 
 Foreign  Word  FW 
 Unclassified  UNC 

The guidelines for tagging NTC 1.0 were developed based on three Tigrinya grammar books. These are:
1) Tigrinya Grammar by Adi Ghebre (2000) 
2) A Comprehensive Tigrinya Grammar by Amanuel Sahle (1998) and
3) Tigrinya Grammar by John Mason (1996)
 
4. Format of NTC

Tigrinya uses the Ge'ez Script as its writing system. 
The corpus is available in both Ge'ez script and Transliterated Latin script. SERA transliteration scheme has been used with a few adjustments. The upper case 'I' was used to exclusively mark the epenthetic vowel (know as 'sads' in Ge'ez script). 
For machine readability and flexible manipulation,  the corpus was pre-processed (cleaned) and encoded in TEI corpus format
The retained punctuation marks are, ፡ (two dots), ። (four dots), ፧ (three dots) or ?, !, "" and (). The first three are specific to languages that use the Ge'ez script. 
In order to normalize the corpus, cliticized words (words joined by an apostrophe) are separated into their constituent parts. 
For example, ክጽሕፍ’ዩ /kISIHIfI’yu/ ‘he will write’ is a cliticized form of the two words ክጽሕፍ /kISIHIfI/ and እዩ /Iyu/. 
This tendency occurs because it is customary to mask laryngeals such as እ ‘I’, ኣ ‘a’ or ኢ ‘i’ with an apostrophe while writing.


5. Downloads

NTC 1.0 can be used freely for research purposes.

    1. Download NTC 1.0 - TEI format in Ge'ez script
    2. Download NTC 1.0 - TEI format in Latin script (Transliterated)

6. Contact us

For any suggestions, corrections and usage of the corpus, please reach us at: 

yemane@jnlp.org or yemanekeleta@gmail.com.

We appreciate your input to help us improve the quality of NTC.

We hope this corpus will encourage further Natural Language Processing (NLP) research on Tigrinya and other Eritrean languages.


7. USES OF THE CORPUS 

 


 


Comments