Vasu Sharma
Applied Research Scientist Lead at Meta AI | Building the Future of Multimodal Foundation Models
Leading groundbreaking research in multimodal AI, working on Llama 3/4 and Chameleon models. Graduate School in Multimodal Machine Learning at Carnegie Mellon University (Dept. Rank #1), and Undergrad from IIT Kanpur in Computer Science and Engineering (JEE rank 165, GPA 9.9/10) passionate about pushing the boundaries of AI.
Leading AI Research at Scale
I'm an Applied Research Scientist Lead at Meta AI, where I spearhead the development of multimodal foundation models including Chameleon, Llama 3/4, DinoV2 etc. My work spans trillion token scale pretraining, mixture of expert architectures, and next generation conversational AI agents.
With a Masters from Carnegie Mellon (Department Rank #1) and experience across Meta, Amazon Alexa, and Citadel, I bridge cutting-edge research with real-world applications.
Multimodal Foundation Models
Large Language Models (LLMs)
Computer Vision & NLP
Speech & Audio Processing
Reinforcement Learning
AI Safety & Evaluation
Research & Industry Experience
Education
Key Research Internships
Research Impact & Publications
Advancing the frontiers of AI and machine learning through impactful research contributions
DINOv2: Learning Robust Visual Features without Supervision
A breakthrough self-supervised learning approach that learns powerful visual representations without requiring labeled data, achieving state-of-the-art performance across multiple computer vision tasks.
MAViL: Masked Audio-Video Learners
Novel multi-modal learning framework that jointly processes audio and video through masked reconstruction, enabling robust cross-modal understanding.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Pioneering foundation model architecture that seamlessly integrates multiple modalities through early fusion, enabling unified reasoning across text, images, and code.
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Innovative approach to creating mixture-of-experts models by combining specialized language models, achieving superior performance with efficient parameter utilization.
Demystifying CLIP data
Comprehensive analysis of CLIP training data, providing crucial insights into data quality, bias, and their impact on model performance and fairness.
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models
Thorough evaluation framework for vision-language models, revealing important limitations and proposing improvements for better multi-modal understanding.
Let's Build the Future of AI Together
Interested in collaborating on cutting-edge AI research? Looking for expertise in multimodal foundation models or large-scale AI systems? I'm always excited to discuss innovative projects and research opportunities.