--- license: apache-2.0 tags: - multimodal - agentic-ai - retrieval-augmented - explainable-ai - reasoning - automation - accessibility - vision-language - audio-processing - table-understanding language: - en - multilingual pipeline_tag: any-to-any --- # Universal-Multimodal-Agent (UMA) ## New: Multimodal Datasets Catalog (Phase 1 Data Collection) We’re kicking off data collection for UMA. Below is a curated, growing catalog of top public multimodal datasets by category with brief notes and links for immediate use. ### A. Text–Image - LAION-5B — Massive web-scale image–text pairs; LAION-400M/1B subsets available. https://laion.ai/blog/laion-5b/ - COCO (2017 Captions) — Image captioning and detection; strong baselines. https://cocodataset.org/#home - Visual Genome — Dense region descriptions, objects, attributes, relationships. https://visualgenome.org/ - Conceptual Captions (3M/12M) — Web image–alt-text pairs. https://ai.google.com/research/ConceptualCaptions/ - Flickr30k / Flickr8k — Classic captioning sets. https://hockenmaier.cs.illinois.edu/CS546-2014/data/flickr30k.html - CC3M/CC12M — Common Crawl–derived image–text pairs. https://github.com/google-research-datasets/conceptual-captions - SBU Captions — Image–text pairs from Flickr. http://www.cs.virginia.edu/~vicente/sbucaptions/ - TextCaps — OCR-centric captioning with text in images. https://textvqa.org/textcaps/ - VizWiz — Images taken by blind users; accessibility focus. https://vizwiz.org/ - WebLI (if accessible) — Large-scale multilingual image–text pairs. https://ai.google/discover/papers/webli/ ### B. Text–Image Reasoning / VQA / Document QA - VQAv2 — Visual Question Answering benchmark. https://visualqa.org/ - GQA — Compositional reasoning over scenes. https://cs.stanford.edu/people/dorarad/gqa/ - OK-VQA / A-OKVQA — Requires external knowledge. https://okvqa.allenai.org/ - ScienceQA — Multimodal science questions with diagrams. https://scienceqa.github.io/ - DocVQA / TextVQA — Reading text in images. https://textvqa.org/ - InfographicVQA — VQA on charts/infographics. https://www.microsoft.com/en-us/research/project/infographicvqa/ - ChartQA / PlotQA / Chart-to-Text — Chart understanding, reasoning. https://github.com/vis-nlp/ChartQA ### C. Text–Table (Structured Data) - TabFact — Table fact verification from Wikipedia. https://tabfact.github.io/ - WikiTableQuestions — Semantic parsing over tables. https://ppasupat.github.io/WikiTableQuestions/ - ToTTo — Controlled table-to-text generation. https://github.com/google-research-datasets/ToTTo - SQA (Sequential QA over tables) — Multi-turn QA on tables. https://allenai.org/data/sqa - Spider — Text-to-SQL over multiple DBs (semi-structured). https://yale-lily.github.io/spider - TURL — Table understanding pretraining. https://github.com/sunlab-osu/TURL - OpenTabQA — Open-domain QA over tables. https://github.com/IBM/OpenTabQA - MultiTab / TABBIE resources — Tabular reasoning. https://multitab-project.github.io/ ### D. Text–Audio / Speech - LibriSpeech — ASR with read English speech. https://www.openslr.org/12 - Common Voice — Multilingual crowdsourced speech. https://commonvoice.mozilla.org/ - Librilight — Large-scale unlabeled speech for self-supervised learning. https://github.com/facebookresearch/libri-light - TED-LIUM / How2 — Talks with transcripts and multimodal context. https://lium.univ-lemans.fr/ted-lium/ - AudioSet — Weakly labeled ontology of sounds (with YouTube links). https://research.google.com/audioset/ - ESC-50 / UrbanSound8K — Environmental sound classification. https://github.com/karoldvl/ESC-50 - VoxCeleb — Speaker identification/verification. http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ - SPGISpeech (if license allows) — Financial domain ASR. https://datasets.kensho.com/s/sgpispeech ### E. Full Multimodal / Multi-domain (Text–Image–Table–Audio—and more) - MMMU — Massive Multidiscipline Multimodal Understanding benchmark. https://mmmu-benchmark.github.io/ - MMBench / MME / LVLM-eHub — LVLM comprehensive eval suites. https://mmbench.opencompass.org.cn/ - Egoschema / Ego4D (video+audio+text) — Egocentric multi-sensor datasets. https://ego4d-data.org/ - MultiModal C4 (MMC4) — Web-scale multi-doc image–text corpus. https://github.com/allenai/mmc4 - WebQA / MultimodalQA — QA over web images and text. https://github.com/omni-us/research-multimodalqa - Chart/Document suites: DocLayNet, PubLayNet, DocVQA series. https://github.com/ibm-aur-nlp/PubLayNet - ArXivDoc / ChartX / SynthChart — Synthetic + real doc/chart sets. https://github.com/vis-nlp/ChartX ### F. Safety, Bias, and Accessibility-focused Sets - Hateful Memes — Multimodal bias/toxicity benchmark. https://github.com/facebookresearch/mmf/tree/main/projects/hateful_memes - ImageNet-A/O/R — Robustness variants. https://github.com/hendrycks/imagenet-r - VizWiz (again) — Accessibility-oriented images/questions. https://vizwiz.org/ - MS MARCO (multimodal passages via docs) + OCR corpora — Retrieval grounding. https://microsoft.github.io/msmarco/ ### G. Licensing and Usage Notes - Always check each dataset’s license and terms of use; some require access requests or restrict commercial use. - Maintain separate manifests with source, license, checksum, and intended use. Prefer mirrored, deduplicated shards with exact provenance. --- ## Call for Collaboration: Build UMA with Us We’re assembling an open team. If you’re passionate about agentic multimodal AI, join us. Roles we’re seeking (volunteer or sponsored collaborations): - Research Scientists: Multimodal learning, alignment, grounding, evaluation. - Research Engineers: Training pipelines, distributed systems, retrieval, tool-use interfaces. - Data Scientists / Data Engineers: Dataset curation, cleaning, deduplication, data governance. - Domain Experts: Finance, healthcare, education, accessibility, scientific communication. - Accessibility Specialists: Inclusive design, alt-text/sonification, screen-reader workflows, disability advocacy. - MLOps/Infra: Dataset storage, versioning, scalable training eval infra (HF Datasets, WebDataset, parquet, Arrow). - Community & Documentation: Tutorials, examples, benchmark harnesses, governance. How to get involved now: - Open a Discussion with your background and interests: https://huggingface.co/amalsp/Universal-Multimodal-Agent/discussions - Propose datasets or contribute manifests via PRs (add to datasets/manifest/*.jsonl) - Share domain-specific tasks and evaluation rubrics - Star and watch the repo for updates Initial roadmap for data: - Phase 1: Curate public datasets and licenses; build manifests and downloaders - Phase 2: Unified preprocessing (image, OCR, tables, audio), deduping, quality filters - Phase 3: Balanced training mixtures + eval suites (MMMU/MMBench/DocVQA/ASR) Ethics & Safety: - Respect dataset licenses, privacy, and consent. Implement filter lists and red-teaming sets. - Document known biases and limitations; enable opt-out mechanisms where applicable. Contributors will be acknowledged in the README and future preprint. ## Original Project Overview [Existing content retained below]