Hugging Face License: CC BY 4.0 Language Architecture Task Low Resource Region

NyishiBERT

NyishiBERT is a monolingual masked language model for Nyishi (njz-Latn), a Sino-Tibetan language spoken in Northeast India. A transformer-based language model for the Nyishi language.

Model Details

Model Description

  • Developed by: MWire Labs
  • Model type: Masked Language Model (MLM)
  • Language: Nyishi (ISO 639-3: njz, Roman script)
  • License: CC-BY-4.0
  • Base architecture: ModernBERT-Base
  • Parameters: 149M
  • Training data: 55,870 sentences from WMT EMNLP 2025 (WMT25)

Model Architecture

Architecture: ModernBERT-Base
- Parameters: 149M
- Layers: 22
- Hidden size: 768
- Attention heads: 12
- Context window: 1024 tokens
- Positional embeddings: RoPE (Rotary Position Embeddings)
- Normalization: Pre-LayerNorm

Training Details

Training Data:

  • Source: WMT EMNLP 2025 (Tenth Conference on Machine Translation)
  • Total sentences: 55,870
  • Training split: 44,696 sentences (80%)
  • Validation split: 5,587 sentences (10%)
  • Test split: 5,587 sentences (10%)
  • Script: Roman (njz-Latn)

Training Configuration:

  • Objective: Masked Language Modeling (15% masking probability)
  • Optimizer: AdamW
  • Learning rate: 2e-5 (linear warmup + decay)
  • Warmup ratio: 10%
  • Batch size: 16 (effective)
  • Training epochs: 10
  • Total steps: 27,940
  • Precision: bfloat16
  • Hardware: 1× NVIDIA A40 (48GB)
  • Training time: ~1.7 hours

Tokenization:

  • Tokenizer: SentencePiece Unigram tokenizer shared with NE-BERT
  • Vocabulary size: 50,368
  • Shared with MWireLabs/ne-bert

Performance

Intrinsic Evaluation

Evaluated on held-out test set (5,587 sentences):

Metric Score
Test Loss 3.03
Perplexity 20.78

Usage

Direct Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("MWireLabs/nyishibert")
model = AutoModelForMaskedLM.from_pretrained("MWireLabs/nyishibert")

# Example: Fill mask
text = "Ngulug [MASK] nyilakuma"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get predictions
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
logits = outputs.logits[0, masked_index, :]
predicted_token_id = logits.argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted word: {predicted_token}")

Pipeline Usage

from transformers import pipeline

# Create fill-mask pipeline
unmasker = pipeline('fill-mask', model='MWireLabs/nyishibert')

# Predict masked tokens
result = unmasker("Ngulug [MASK] nyilakuma")
print(result)

Fine-tuning

This model can be fine-tuned for downstream tasks such as:

  • Text classification
  • Named entity recognition
  • Part-of-speech tagging
  • Dependency parsing
from transformers import AutoModelForSequenceClassification

# Load for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    "MWireLabs/nyishibert",
    num_labels=2
)
# ... add your fine-tuning code

Limitations and Bias

Known Limitations

  1. Script: Trained exclusively on Roman script (njz-Latn). The model will not work with other scripts.

  2. Orthographic variation: Nyishi lacks standardized orthography. The model reflects spelling conventions present in the WMT25 training data, which may vary from other writing practices.

  3. Domain coverage: Training data comes from mixed domains in WMT25. Performance may vary on specialized domains not represented in the training corpus.

  4. Data size: Trained on 55,870 sentences. While sufficient for meaningful language modeling, larger corpora would likely improve performance.

  5. Vocabulary coverage: Uses NE-BERT's shared tokenizer. Some Nyishi-specific terms may be suboptimally tokenized.

Potential Biases

  • The model may reflect biases present in the WMT25 training corpus
  • Performance may be better on domains well-represented in training data
  • Spelling variations common in digital Nyishi text may not all be equally represented

Ethical Considerations

Language and Community

  • Community engagement: This model is intended to support Nyishi language technology and preservation efforts.
  • Data sovereignty: All training data is from publicly available WMT25 resources.
  • Orthography: Use of Roman script reflects current digital practice but does not constitute endorsement of any particular writing system.

Responsible Use

  • This model is a research tool for Nyishi language technology development
  • Users should be aware of the model's limitations when deploying in production
  • Community feedback on model behavior and outputs is welcomed

Citation

If you use NyishiBERT in your research, please cite:

@misc{nyishibert2026,
  author = {MWire Labs},
  title = {NyishiBERT: A Monolingual Language Model for Nyishi},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWireLabs/nyishibert}},
}

Training data citation:

@inproceedings{wmt25,
  title = {Findings of the 2025 Conference on Machine Translation (WMT25)},
  booktitle = {Proceedings of the Tenth Conference on Machine Translation},
  year = {2025},
  address = {Suzhou, China},
  month = {November},
  publisher = {Association for Computational Linguistics}
}

Model Card Contact

For questions, feedback, or issues regarding this model:

  • Organization: MWire Labs
  • Issues: Please open an issue on the model repository

Acknowledgments

  • Training data: WMT EMNLP 2025 (Tenth Conference on Machine Translation)
  • Tokenizer: Shared with NE-BERT
  • Architecture: ModernBERT

Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support