NyishiBERT

NyishiBERT is a monolingual masked language model for Nyishi (njz-Latn), a Sino-Tibetan language spoken in Northeast India. A transformer-based language model for the Nyishi language.

Model Details

Model Description

Developed by: MWire Labs
Model type: Masked Language Model (MLM)
Language: Nyishi (ISO 639-3: njz, Roman script)
License: CC-BY-4.0
Base architecture: ModernBERT-Base
Parameters: 149M
Training data: 55,870 sentences from WMT EMNLP 2025 (WMT25)

Model Architecture

Architecture: ModernBERT-Base
- Parameters: 149M
- Layers: 22
- Hidden size: 768
- Attention heads: 12
- Context window: 1024 tokens
- Positional embeddings: RoPE (Rotary Position Embeddings)
- Normalization: Pre-LayerNorm

Training Details

Training Data:

Source: WMT EMNLP 2025 (Tenth Conference on Machine Translation)
Total sentences: 55,870
Training split: 44,696 sentences (80%)
Validation split: 5,587 sentences (10%)
Test split: 5,587 sentences (10%)
Script: Roman (njz-Latn)

Training Configuration:

Objective: Masked Language Modeling (15% masking probability)
Optimizer: AdamW
Learning rate: 2e-5 (linear warmup + decay)
Warmup ratio: 10%
Batch size: 16 (effective)
Training epochs: 10
Total steps: 27,940
Precision: bfloat16
Hardware: 1× NVIDIA A40 (48GB)
Training time: ~1.7 hours

Tokenization:

Tokenizer: SentencePiece Unigram tokenizer shared with NE-BERT
Vocabulary size: 50,368
Shared with MWireLabs/ne-bert

Performance

Intrinsic Evaluation

Evaluated on held-out test set (5,587 sentences):

Metric	Score
Test Loss	3.03
Perplexity	20.78

Usage

Direct Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("MWireLabs/nyishibert")
model = AutoModelForMaskedLM.from_pretrained("MWireLabs/nyishibert")

# Example: Fill mask
text = "Ngulug [MASK] nyilakuma"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get predictions
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
logits = outputs.logits[0, masked_index, :]
predicted_token_id = logits.argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted word: {predicted_token}")

Pipeline Usage

from transformers import pipeline

# Create fill-mask pipeline
unmasker = pipeline('fill-mask', model='MWireLabs/nyishibert')

# Predict masked tokens
result = unmasker("Ngulug [MASK] nyilakuma")
print(result)

Fine-tuning

This model can be fine-tuned for downstream tasks such as:

Text classification
Named entity recognition
Part-of-speech tagging
Dependency parsing

from transformers import AutoModelForSequenceClassification

# Load for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    "MWireLabs/nyishibert",
    num_labels=2
)
# ... add your fine-tuning code

Limitations and Bias

Known Limitations

Script: Trained exclusively on Roman script (njz-Latn). The model will not work with other scripts.
Orthographic variation: Nyishi lacks standardized orthography. The model reflects spelling conventions present in the WMT25 training data, which may vary from other writing practices.
Domain coverage: Training data comes from mixed domains in WMT25. Performance may vary on specialized domains not represented in the training corpus.
Data size: Trained on 55,870 sentences. While sufficient for meaningful language modeling, larger corpora would likely improve performance.
Vocabulary coverage: Uses NE-BERT's shared tokenizer. Some Nyishi-specific terms may be suboptimally tokenized.

Potential Biases

The model may reflect biases present in the WMT25 training corpus
Performance may be better on domains well-represented in training data
Spelling variations common in digital Nyishi text may not all be equally represented

Ethical Considerations

Language and Community

Community engagement: This model is intended to support Nyishi language technology and preservation efforts.
Data sovereignty: All training data is from publicly available WMT25 resources.
Orthography: Use of Roman script reflects current digital practice but does not constitute endorsement of any particular writing system.

Responsible Use

This model is a research tool for Nyishi language technology development
Users should be aware of the model's limitations when deploying in production
Community feedback on model behavior and outputs is welcomed

Citation

If you use NyishiBERT in your research, please cite:

@misc{nyishibert2026,
  author = {MWire Labs},
  title = {NyishiBERT: A Monolingual Language Model for Nyishi},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWireLabs/nyishibert}},
}

Training data citation:

@inproceedings{wmt25,
  title = {Findings of the 2025 Conference on Machine Translation (WMT25)},
  booktitle = {Proceedings of the Tenth Conference on Machine Translation},
  year = {2025},
  address = {Suzhou, China},
  month = {November},
  publisher = {Association for Computational Linguistics}
}

Model Card Contact

For questions, feedback, or issues regarding this model:

Organization: MWire Labs
Issues: Please open an issue on the model repository

Acknowledgments

Training data: WMT EMNLP 2025 (Tenth Conference on Machine Translation)
Tokenizer: Shared with NE-BERT
Architecture: ModernBERT

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32