NyishiBERT
NyishiBERT is a monolingual masked language model for Nyishi (njz-Latn), a Sino-Tibetan language spoken in Northeast India. A transformer-based language model for the Nyishi language.
Model Details
Model Description
- Developed by: MWire Labs
- Model type: Masked Language Model (MLM)
- Language: Nyishi (ISO 639-3: njz, Roman script)
- License: CC-BY-4.0
- Base architecture: ModernBERT-Base
- Parameters: 149M
- Training data: 55,870 sentences from WMT EMNLP 2025 (WMT25)
Model Architecture
Architecture: ModernBERT-Base
- Parameters: 149M
- Layers: 22
- Hidden size: 768
- Attention heads: 12
- Context window: 1024 tokens
- Positional embeddings: RoPE (Rotary Position Embeddings)
- Normalization: Pre-LayerNorm
Training Details
Training Data:
- Source: WMT EMNLP 2025 (Tenth Conference on Machine Translation)
- Total sentences: 55,870
- Training split: 44,696 sentences (80%)
- Validation split: 5,587 sentences (10%)
- Test split: 5,587 sentences (10%)
- Script: Roman (njz-Latn)
Training Configuration:
- Objective: Masked Language Modeling (15% masking probability)
- Optimizer: AdamW
- Learning rate: 2e-5 (linear warmup + decay)
- Warmup ratio: 10%
- Batch size: 16 (effective)
- Training epochs: 10
- Total steps: 27,940
- Precision: bfloat16
- Hardware: 1× NVIDIA A40 (48GB)
- Training time: ~1.7 hours
Tokenization:
- Tokenizer: SentencePiece Unigram tokenizer shared with NE-BERT
- Vocabulary size: 50,368
- Shared with MWireLabs/ne-bert
Performance
Intrinsic Evaluation
Evaluated on held-out test set (5,587 sentences):
| Metric | Score |
|---|---|
| Test Loss | 3.03 |
| Perplexity | 20.78 |
Usage
Direct Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("MWireLabs/nyishibert")
model = AutoModelForMaskedLM.from_pretrained("MWireLabs/nyishibert")
# Example: Fill mask
text = "Ngulug [MASK] nyilakuma"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Get predictions
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
logits = outputs.logits[0, masked_index, :]
predicted_token_id = logits.argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Predicted word: {predicted_token}")
Pipeline Usage
from transformers import pipeline
# Create fill-mask pipeline
unmasker = pipeline('fill-mask', model='MWireLabs/nyishibert')
# Predict masked tokens
result = unmasker("Ngulug [MASK] nyilakuma")
print(result)
Fine-tuning
This model can be fine-tuned for downstream tasks such as:
- Text classification
- Named entity recognition
- Part-of-speech tagging
- Dependency parsing
from transformers import AutoModelForSequenceClassification
# Load for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
"MWireLabs/nyishibert",
num_labels=2
)
# ... add your fine-tuning code
Limitations and Bias
Known Limitations
Script: Trained exclusively on Roman script (njz-Latn). The model will not work with other scripts.
Orthographic variation: Nyishi lacks standardized orthography. The model reflects spelling conventions present in the WMT25 training data, which may vary from other writing practices.
Domain coverage: Training data comes from mixed domains in WMT25. Performance may vary on specialized domains not represented in the training corpus.
Data size: Trained on 55,870 sentences. While sufficient for meaningful language modeling, larger corpora would likely improve performance.
Vocabulary coverage: Uses NE-BERT's shared tokenizer. Some Nyishi-specific terms may be suboptimally tokenized.
Potential Biases
- The model may reflect biases present in the WMT25 training corpus
- Performance may be better on domains well-represented in training data
- Spelling variations common in digital Nyishi text may not all be equally represented
Ethical Considerations
Language and Community
- Community engagement: This model is intended to support Nyishi language technology and preservation efforts.
- Data sovereignty: All training data is from publicly available WMT25 resources.
- Orthography: Use of Roman script reflects current digital practice but does not constitute endorsement of any particular writing system.
Responsible Use
- This model is a research tool for Nyishi language technology development
- Users should be aware of the model's limitations when deploying in production
- Community feedback on model behavior and outputs is welcomed
Citation
If you use NyishiBERT in your research, please cite:
@misc{nyishibert2026,
author = {MWire Labs},
title = {NyishiBERT: A Monolingual Language Model for Nyishi},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/MWireLabs/nyishibert}},
}
Training data citation:
@inproceedings{wmt25,
title = {Findings of the 2025 Conference on Machine Translation (WMT25)},
booktitle = {Proceedings of the Tenth Conference on Machine Translation},
year = {2025},
address = {Suzhou, China},
month = {November},
publisher = {Association for Computational Linguistics}
}
Model Card Contact
For questions, feedback, or issues regarding this model:
- Organization: MWire Labs
- Issues: Please open an issue on the model repository
Acknowledgments
- Training data: WMT EMNLP 2025 (Tenth Conference on Machine Translation)
- Tokenizer: Shared with NE-BERT
- Architecture: ModernBERT
- Downloads last month
- 17