ota-roberta-base-ner (Ottoman Turkish NER)
This model is a Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-roberta-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.
Model Details
- Developed by: Enes Yılandiloğlu
- Model type: Token classification (NER)
- Language(s): Ottoman Turkish (ota)
- License: cc-by-nc-4.0
- Finetuned from: enesyila/ota-roberta-base
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-roberta-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-roberta-base-ner")
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")
text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))
[{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
{'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
{'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
{'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]
Training Procedure
- Loss: Cross-entropy loss
- Batch size: 16 (train), 16 (eval)
- Optimizer: AdamW
- Learning rate: 3e-5
- Learning rate scheduler: Linear
- Warmup ratio: 0.01
- Epochs: 10 (early stopping enabled)
- Gradient checkpointing: Enabled
- Mixed precision: Enabled (fp16)
Training Data
The model was fine-tuned on a manually annotated corpus of 5 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of 6,992 NER spans with labels PER, LOC, ORG, MISC.
Folowing works were used as training data:
- Ḳıṣâṣ-i Enbiyâ (16th century)
- Zeyl-i Şakâʾik (17th century)
- Veḳâyiʿü'l-Fużala (1731)
- Neticetü'l-Fikriyye (18th century)
- Silkü'l-Leʾal-i ʿÂl-i Os̱mân (18th century)
Named entity distribution by dataset split (roughly 80/10/10):
| Split | LOC | MISC | ORG | PER | TOTAL |
|---|---|---|---|---|---|
| Train | 1313 | 609 | 813 | 2835 | 5570 |
| Dev | 147 | 68 | 133 | 365 | 713 |
| Test | 162 | 61 | 124 | 362 | 709 |
Evaluation Results
Span-level results on test set:
| Label | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| LOC | 0.8971 | 0.9385 | 0.9173 | 195 |
| MISC | 0.7317 | 0.7895 | 0.7595 | 76 |
| ORG | 0.9195 | 0.9195 | 0.9195 | 149 |
| PER | 0.9066 | 0.9278 | 0.9171 | 471 |
Span-level (micro avg):
- Precision: 0.8909
- Recall: 0.9169
- F1: 0.9038
Span-level (macro avg):
- Precision: 0.8637
- Recall: 0.8938
- F1: 0.8783
Token-level results (excluding “O” label)
| Label | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| B-PER | 0.9470 | 0.9490 | 0.9480 | 471 |
| I-PER | 0.9574 | 0.9662 | 0.9618 | 2956 |
| B-LOC | 0.9254 | 0.9538 | 0.9394 | 195 |
| I-LOC | 0.9176 | 0.9125 | 0.9150 | 537 |
| B-ORG | 0.9589 | 0.9396 | 0.9492 | 149 |
| I-ORG | 0.9701 | 0.9602 | 0.9651 | 979 |
| B-MISC | 0.8800 | 0.8684 | 0.8742 | 76 |
| I-MISC | 0.9128 | 0.8738 | 0.8929 | 515 |
Token-level (macro avg, excl. “O”):
- Precision: 0.9336
- Recall: 0.9279
- F1: 0.9307
Token-level (weighted avg, excl. “O”):
- Precision: 0.9491
- Recall: 0.9495
- F1: 0.9493
Model Card Author
Enes Yılandiloğlu
Model Card Contact
- Downloads last month
- 47
Model tree for enesyila/ota-roberta-base-ner
Collection including enesyila/ota-roberta-base-ner
Collection
This collection includes NLP models and datasets for Ottoman Turkish.
•
6 items
•
Updated