AST Fine-tuned on ESC-50

An Audio Spectrogram Transformer (AST) model fine-tuned on the ESC-50 dataset for environmental sound classification.

Model Description

This model is based on the Audio Spectrogram Transformer architecture, fine-tuned to classify 50 categories of environmental sounds. The AST applies a pure attention mechanism to audio spectrograms, treating them as sequences of patches similar to Vision Transformers (ViT).

Training

  • Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
  • Dataset: ESC-50 (Environmental Sound Classification)

Labels

The model classifies audio into 50 environmental sound categories:

Animals: cat, chirping_birds, cow, crow, dog, frog, hen, insects, pig, rooster, sheep

Natural Sounds: crackling_fire, crickets, rain, sea_waves, thunderstorm, water_drops, wind

Human Sounds: breathing, brushing_teeth, clapping, coughing, crying_baby, drinking_sipping, footsteps, laughing, sneezing, snoring

Domestic Sounds: clock_alarm, clock_tick, door_wood_creaks, door_wood_knock, glass_breaking, keyboard_typing, mouse_click, toilet_flush, vacuum_cleaner, washing_machine

Urban Sounds: airplane, car_horn, church_bells, engine, fireworks, helicopter, siren, train

Mechanical/Tools: can_opening, chainsaw, hand_saw, pouring_water

License

BSD-3-Clause

Downloads last month
55
Safetensors
Model size
86.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bioamla/ast-esc50

Finetuned
(155)
this model

Dataset used to train bioamla/ast-esc50

Space using bioamla/ast-esc50 1