---
license: apache-2.0
tags:
- multimodal
- agentic-ai
- retrieval-augmented
- explainable-ai
- reasoning
- automation
- accessibility
- vision-language
- audio-processing
- table-understanding
language:
- en
- multilingual
pipeline_tag: any-to-any
---
# Universal-Multimodal-Agent (UMA)

## New: Multimodal Datasets Catalog (Phase 1 Data Collection)
We’re kicking off data collection for UMA. Below is a curated, growing catalog of top public multimodal datasets by category with brief notes and links for immediate use.

### A. Text–Image
- LAION-5B — Massive web-scale image–text pairs; LAION-400M/1B subsets available. https://laion.ai/blog/laion-5b/
- COCO (2017 Captions) — Image captioning and detection; strong baselines. https://cocodataset.org/#home
- Visual Genome — Dense region descriptions, objects, attributes, relationships. https://visualgenome.org/
- Conceptual Captions (3M/12M) — Web image–alt-text pairs. https://ai.google.com/research/ConceptualCaptions/
- Flickr30k / Flickr8k — Classic captioning sets. https://hockenmaier.cs.illinois.edu/CS546-2014/data/flickr30k.html
- CC3M/CC12M — Common Crawl–derived image–text pairs. https://github.com/google-research-datasets/conceptual-captions
- SBU Captions — Image–text pairs from Flickr. http://www.cs.virginia.edu/~vicente/sbucaptions/
- TextCaps — OCR-centric captioning with text in images. https://textvqa.org/textcaps/
- VizWiz — Images taken by blind users; accessibility focus. https://vizwiz.org/
- WebLI (if accessible) — Large-scale multilingual image–text pairs. https://ai.google/discover/papers/webli/

### B. Text–Image Reasoning / VQA / Document QA
- VQAv2 — Visual Question Answering benchmark. https://visualqa.org/
- GQA — Compositional reasoning over scenes. https://cs.stanford.edu/people/dorarad/gqa/
- OK-VQA / A-OKVQA — Requires external knowledge. https://okvqa.allenai.org/
- ScienceQA — Multimodal science questions with diagrams. https://scienceqa.github.io/
- DocVQA / TextVQA — Reading text in images. https://textvqa.org/
- InfographicVQA — VQA on charts/infographics. https://www.microsoft.com/en-us/research/project/infographicvqa/
- ChartQA / PlotQA / Chart-to-Text — Chart understanding, reasoning. https://github.com/vis-nlp/ChartQA

### C. Text–Table (Structured Data)
- TabFact — Table fact verification from Wikipedia. https://tabfact.github.io/
- WikiTableQuestions — Semantic parsing over tables. https://ppasupat.github.io/WikiTableQuestions/
- ToTTo — Controlled table-to-text generation. https://github.com/google-research-datasets/ToTTo
- SQA (Sequential QA over tables) — Multi-turn QA on tables. https://allenai.org/data/sqa
- Spider — Text-to-SQL over multiple DBs (semi-structured). https://yale-lily.github.io/spider
- TURL — Table understanding pretraining. https://github.com/sunlab-osu/TURL
- OpenTabQA — Open-domain QA over tables. https://github.com/IBM/OpenTabQA
- MultiTab / TABBIE resources — Tabular reasoning. https://multitab-project.github.io/

### D. Text–Audio / Speech
- LibriSpeech — ASR with read English speech. https://www.openslr.org/12
- Common Voice — Multilingual crowdsourced speech. https://commonvoice.mozilla.org/
- Librilight — Large-scale unlabeled speech for self-supervised learning. https://github.com/facebookresearch/libri-light
- TED-LIUM / How2 — Talks with transcripts and multimodal context. https://lium.univ-lemans.fr/ted-lium/
- AudioSet — Weakly labeled ontology of sounds (with YouTube links). https://research.google.com/audioset/
- ESC-50 / UrbanSound8K — Environmental sound classification. https://github.com/karoldvl/ESC-50
- VoxCeleb — Speaker identification/verification. http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
- SPGISpeech (if license allows) — Financial domain ASR. https://datasets.kensho.com/s/sgpispeech

### E. Full Multimodal / Multi-domain (Text–Image–Table–Audio—and more)
- MMMU — Massive Multidiscipline Multimodal Understanding benchmark. https://mmmu-benchmark.github.io/
- MMBench / MME / LVLM-eHub — LVLM comprehensive eval suites. https://mmbench.opencompass.org.cn/
- Egoschema / Ego4D (video+audio+text) — Egocentric multi-sensor datasets. https://ego4d-data.org/
- MultiModal C4 (MMC4) — Web-scale multi-doc image–text corpus. https://github.com/allenai/mmc4
- WebQA / MultimodalQA — QA over web images and text. https://github.com/omni-us/research-multimodalqa
- Chart/Document suites: DocLayNet, PubLayNet, DocVQA series. https://github.com/ibm-aur-nlp/PubLayNet
- ArXivDoc / ChartX / SynthChart — Synthetic + real doc/chart sets. https://github.com/vis-nlp/ChartX

### F. Safety, Bias, and Accessibility-focused Sets
- Hateful Memes — Multimodal bias/toxicity benchmark. https://github.com/facebookresearch/mmf/tree/main/projects/hateful_memes
- ImageNet-A/O/R — Robustness variants. https://github.com/hendrycks/imagenet-r
- VizWiz (again) — Accessibility-oriented images/questions. https://vizwiz.org/
- MS MARCO (multimodal passages via docs) + OCR corpora — Retrieval grounding. https://microsoft.github.io/msmarco/

### G. Licensing and Usage Notes
- Always check each dataset’s license and terms of use; some require access requests or restrict commercial use.
- Maintain separate manifests with source, license, checksum, and intended use. Prefer mirrored, deduplicated shards with exact provenance.

---

## Call for Collaboration: Build UMA with Us
We’re assembling an open team. If you’re passionate about agentic multimodal AI, join us.

Roles we’re seeking (volunteer or sponsored collaborations):
- Research Scientists: Multimodal learning, alignment, grounding, evaluation.
- Research Engineers: Training pipelines, distributed systems, retrieval, tool-use interfaces.
- Data Scientists / Data Engineers: Dataset curation, cleaning, deduplication, data governance.
- Domain Experts: Finance, healthcare, education, accessibility, scientific communication.
- Accessibility Specialists: Inclusive design, alt-text/sonification, screen-reader workflows, disability advocacy.
- MLOps/Infra: Dataset storage, versioning, scalable training eval infra (HF Datasets, WebDataset, parquet, Arrow).
- Community & Documentation: Tutorials, examples, benchmark harnesses, governance.

How to get involved now:
- Open a Discussion with your background and interests: https://huggingface.co/amalsp/Universal-Multimodal-Agent/discussions
- Propose datasets or contribute manifests via PRs (add to datasets/manifest/*.jsonl)
- Share domain-specific tasks and evaluation rubrics
- Star and watch the repo for updates

Initial roadmap for data:
- Phase 1: Curate public datasets and licenses; build manifests and downloaders
- Phase 2: Unified preprocessing (image, OCR, tables, audio), deduping, quality filters
- Phase 3: Balanced training mixtures + eval suites (MMMU/MMBench/DocVQA/ASR)

Ethics & Safety:
- Respect dataset licenses, privacy, and consent. Implement filter lists and red-teaming sets.
- Document known biases and limitations; enable opt-out mechanisms where applicable.

Contributors will be acknowledged in the README and future preprint.


## Original Project Overview
[Existing content retained below]