@Scale: AI & DATA

June 25, 2025

Location: Santa Clara Convention Center
5001 Great America Pkwy, Santa Clara, CA 95054

Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale.

This year, we will focus on building a world in which Agents interact with billions of users, a critical step towards unlocking the full potential of AI and data systems. Our in-person talks and panels will delve into the latest advancements in agent development, deployment, and product integration, featuring expert insights on topics such as data for agents, agent tools & environments, safety, and privacy. Attendees can expect to gain practical knowledge and strategies for building AI-powered products, as well as a deeper understanding of the evolving ecosystem and its implications for traditional BI and product analytics.

In addition to our in-person talks and panels, our poster session will showcase a wide range of topics relevant to product analytics, exploring market opportunities, understanding trends, making better decisions, and ensuring the products and systems they build run robustly and reliably at scale. The conference will foster open discussion and collaboration across the industry, and highlight open source solutions that can be leveraged as a foundation for others to build on.

RSVPS CLOSED

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

Schedule

Poster Sessions

08:30 AM - 10:00 AM

Attendee Registration

08:30 AM - 10:00 AM

Breakfast, Raffle Submissions, Networking

09:55 AM - 10:00 AM

Event Welcome

WATCH NOW

Speaker Jelena Pješivac-Grbović,Meta

10:00 AM - 10:10 AM

Opening Remarks

WATCH NOW

Speaker Aparna Ramani,Meta

10:10 AM - 10:35 AM

Fireside Chat

WATCH NOW

Speaker Aparna Ramani,Meta

Speaker Ion Stoica,AnyScale

10:35 AM - 10:55 AM

Metamate from Chatbot to Coworker

WATCH NOW

Presentation information coming soon!

Speaker Hussein Mehanna,Meta

10:55 AM - 11:15 AM

Beyond RAG: Production-Ready AI Agents Powered by Enterprise-Scale Data

WATCH NOW

At Snowflake, we’ve been pursuing how to use Agents and AI to let any business user talk to their data. Enterprise data isn’t always tidy—and AI agents need more than great retrieval to drive real value. In this session, we’ll share what we’ve learned at Snowflake about enabling agents that deeply understand and reason over structured business data. We’ll cover challenges like navigating messy schemas, generating trustworthy SQL, ensuring consistency in definitions like “revenue,” and making the agent’s process visible to non-technical users. Whether you’re scaling agent use across departments or starting to integrate them into core business workflows, you’ll leave with strategies to make agents effective, reliable, and trusted partners in the enterprise.

Speaker Jeff Hollan,Snowflake

11:15 AM - 11:35 AM

Agentic Evals

WATCH NOW

Speaker Shishir Patil,Meta

11:35 AM - 11:55 AM

Silent Errors in Large-Scale LLM training: Challenges and Lessons Learned

WATCH NOW

GPU cluster reliability is a growing challenge as AI models and the clusters that host them grow to unprecedented scale. Insidious errors such as Silent Data Corruptions (SDCs) are particularly difficult to address due to their highly elusive and non-deterministic nature, and their effect on large-scale LLM training and inference is poorly understood. In this talk, we will present how NVIDIA is leveraging its deep expertise in GPUs and AI to holistically tackle this challenge from silicon to data centers. We will go over the work we are doing to improve our understanding of these complex errors and their effect in real world at-scale AI cluster deployments, and the solutions we are developing to help researchers, cluster builders, and the industry protect against SDCs.

Speaker Cyril Meurillon,NVIDIA

Speaker Devin O’Kelly,NVIDIA

11:55 AM - 12:20 PM

Q&A Session

WATCH NOW

Moderator Jelena Pješivac-Grbović,Meta

Speaker Hussein Mehanna,Meta

Speaker Jeff Hollan,Snowflake

Speaker Shishir Patil,Meta

Speaker Cyril Meurillon,NVIDIA

Speaker Devin O’Kelly,NVIDIA

12:20 PM - 01:20 PM

Lunch & Poster Sessions

01:20 PM - 01:50 PM

Live Panel: GenAI Startups

WATCH NOW

We will explore the world of reinforcement learning, post training and what it’s like to build a startup on open models.

Moderator Joe Spisak,Meta

Panelist Horace He,Thinking Machines

Panelist Dhruv Batra,Yutori

Panelist Sal Candido,Evolutionary Scale AI

Panelist Eugen Hotaj,Perplexity

Panelist Carina Hong,Axiom

01:50 PM - 02:15 PM

How to Prepare Your Agents for the Ice(berg) Age

WATCH NOW

In a future world where agents interact with billions of users, many of these agents will also have to interact with data querying tools to provide answers grounded in facts. As enterprise data analytics is rapidly moving towards open table formats like Apache Iceberg, these agents need to be able to speak to Iceberg-based data. Once agents can speak Iceberg, they gain an advantage - they become portable. They can run in public clouds, locally on a laptop (during development), or on-premise, accessing enterprise data that can't be moved to public clouds. Portability is important because, while Nvidia GPUs dominate in the cloud, the GPU stack looks different on-premises and on consumer hardware. In this talk, we will discuss how Apache Iceberg tooling and portable application runtimes make agents grounded in facts and enable them to run across different GPU stacks and deployment models.

Speaker Serhii Sokolenko,Tower.dev

02:15 PM - 02:40 PM

Agentic Solution for Data Warehouse Access

WATCH NOW

Meta manages a large-scale data warehouse where security is a critical component. Every day, teams across Meta are tasked with managing access to the data they oversee and obtaining access to data through internal data products. In this talk, we delve into the challenges of managing internal data access at Meta's scale and its growing complexity. We will also share how we developed an agentic solution to empower both data users and data owners in addressing these challenges.

Speaker Can Lin,META

Speaker Uday Ramesh Savagaonkar,Meta

02:40 PM - 03:05 PM

Break & Poster Sessions

03:05 PM - 03:30 PM

Agentic Observability - Making LLM Apps Debuggable, Trustworthy, and Scalable

WATCH NOW

As LLM applications evolve into multi-agent systems and power complex decision-making workflows, the ability to observe and debug their behavior becomes a core engineering challenge. These systems are dynamic, non-deterministic, and increasingly reliant on external tools and APIs making traditional monitoring approaches insufficient. At Fiddler, we've worked with enterprise and federal teams deploying LLMs at scale, and what we’ve consistently seen is the absence of effective observability creates blind spots that delay iteration and introduce risk. In this talk, we will introduce Agentic Observability, a set of techniques and infrastructure to monitor production LLM systems. We will walk through how we trace agent reasoning and tool usage in structured form, apply Fast Trust Models to evaluate output quality beyond token-level accuracy, and monitor shifts in behavior using statistical and embedding-based methods. We will also share how we enable integration testing for agent workflows by simulating decision paths and validating semantic intent, all while operating under the scale and latency constraints of modern AI stacks. This work bridges AI science, platform engineering, and real-world GenAI deployment. We will highlight engineering lessons learned from high-scale environments, and how these observability tools are helping teams move faster, catch failures earlier, and build AI systems that can be trusted in production.

Speaker Krishna Gade,Fiddler

03:30 PM - 03:55 PM

Bringing AI Into the Real World

WATCH NOW

Join us as we delve into the high-level design and architecture that enabled the creation of Ray-Ban Meta, a best-in-class AI wearable used by millions. To make AI truly useful, it must be seamlessly integrated into our daily lives, providing reliable and high-performance capabilities. However, achieving this requires overcoming real-world physical limitations through innovative engineering and model design.

In this talk, we'll explore how a user-centric approach drove the development of a complex architecture that harmoniously brings together multiple components to meet the unique needs of our users. From running models directly on the frames to optimizations informed by user behavior, we'll break down the key elements that have made Ray-Ban Meta a game-changer in the world of AI wearables.

Speaker Alexandru Petrescu,Meta

Advancing AI Wearables: The Technical Journey of Ray-Ban Meta read more

03:55 PM - 04:25 PM

Live Panel: Infrastructure in an Agentic World

WATCH NOW

If the future is agentic, what does this mean for Infrastructure?

Moderator Karthik Lakshminarayanan,Meta

Panelist Barak Yagour,Meta

Panelist Anna Berenberg,Google

Panelist Barr Moses,Monte Carlo

Panelist Qi Ke,Microsoft

04:25 PM - 04:30 PM

Closing Remarks

WATCH NOW

Speaker Jelena Pješivac-Grbović,Meta

04:30 PM - 06:00 PM

Happy Hour & Poster Sessions

Accelerate training data consumption at Meta through e2e optimized log ingestion

Authors: Peng Li, Karthik Vijayakumar, Todd Porter, Agustin Vergottini, Saurav Sen, Meta

Meta’s Data Ingestion service feeds 10’s TB/s log data into the vast majority of RecSys systems at Meta. Traditionally, the ingestion service treats log data as a blackbox. In the last couple of years, we have been e2e optimizing the log ingestion path with its upstream and downstream systems. Examples are in-line training data clustering, propagating large columnar row batches e2e, and efficient data partitioning. As a result, training data now has a up to ~40% higher storage compression ratio and uplifted training data consumption QPS. At the same time, ingestion becomes cheaper and more reliable. This is an example of how Meta addresses the challenge of explosive training data ingestion needs through holistic optimizations.

AI Data Storage for Exabyte-Scale Training of Recommendation Models

Authors: Weiran Liu, Vivek Vaidya, Manju Anand, Harsha Rastogi, Shubham Ajmera, Wenqi Wu, Zhenyuan Zhao, Sarang Masti, Murali Vilayannur, Alex Zhavnerchik, Lucas Vasconcelos Santana, Meta

AI Dev Productivity and how we enabled federated GPU clusters transparently to researchers through global scheduling

Author: Sachin Kumar Lakharia, NVIDIA

GPU capacity is often spread across multiple regions and CSPs, typically resulting in the creation of separate Slurm clusters for each regional pool. This leads to resource fragmentation, demand imbalances, and inefficiencies, causing both resource waste and user frustration. In this session, we’ll share how we leverage Slurm for global scheduling across diverse regional capacity pools, addressing challenges like heterogeneous hardware, data locality, user code, and checkpoints—all while maintaining reliability at scale.

AI for Error Mitigation: A Case Study

Authors: Vidya Rajaraman, Jim Beveridge, Joost de Nijs, Penni Johnson, Victor Voronov, Meta

Case Study: An AI-Powered Error Assistant

A pilot project was integrated into Meta's internal distributed scheduler, leveraging Large Language Models (LLMs) to assist users in understanding error messages.

Goal

The primary objective is to alleviate the burden on on-call engineers by clarifying errors and directing them to relevant forums for user support.

Solution

A custom AI-backed UI component, coupled with curated and targeted prompting on the back end, performs two key tasks:

Provides a basic explanation of the error message.
Directs users to a support group for follow-up assistance. Note that this AI solution aims to guide users to technical support as needed, rather than offering a direct solution.

Without this AI-first approach, users were unsure about which support group to contact, leading to significant delays in problem resolution as they get passed from one group to another.

Results

Our internal scheduler, a distributed cron worker pool, manages job clusters at an impressive scale, handling millions of jobs per day. In this context, the in-UI solution has attracted significant user traffic, averaging 357 Weekly Active Users (WAU) and 1400 Monthly Active Users (MAU). Our primary user base consists of data and software engineers.

Future Developments

The team is iterating on this solution to incorporate recent AI features, address user feedback, and develop a more Retrieval-Augmented Generation (RAG)-based, potentially agentic solution in the next phase.

AI Strategy for Models - measurement and governance

Author: Sunny Cui, Albert Huang(DE), Vivian Shen(DS), Jae Cho, Victoria Yang, Zhe Wang(SWEs), Meta

This initiative aims to enhance the performance of AI models by standardizing metrics and improving governance. It combines efforts to create a centralized data foundation for models, standardizing sampling strategies, and streamlining precision/recall metrics measurement for better decision-making. Additionally, it focuses on model governance by deprecating outdated models and increasing the refreshing cadence of existing ones, ensuring compliance with governance standards and optimal model performance. By integrating these efforts, the project provides a comprehensive view of model metrics, enhancing performance and streamlining evaluation and decision-making processes for more effective AI applications.

Architecting AI SaaS at Scale with TiDB: Lessons from 500,000 Databases

Author: Xin Shi, PingCAP

As AI becomes table stakes in modern SaaS, many teams rush to isolate tenants with one-database-per-customer architectures — only to face an explosion of operational complexity. What happens when that number hits 500,000?

In this session, we’ll share how TiDB, an open-source, MySQL-compatible distributed SQL database, helped a hyper-growth AI SaaS platform scale beyond 500K databases—without losing control. You’ll hear the key architectural pivots that reduced complexity, improved performance, and enabled a sustainable path to growth.

You’ll learn how to:

Scale multi-tenant architectures without sacrificing isolation or performance
Reduce operational overhead with smarter data layer design
Enable usage-based pricing models by rethinking infrastructure boundaries
Whether you're launching your first AI feature or supporting millions of users, this talk offers proven strategies to stay ahead—without the infrastructure debt.

At Facebook, Your Feedback Matters! How LLMs Help Meta Understand and Address User Pain Points

Author: Bee Padalkar , Meta

At FB, we value user feedback as a crucial component of our product development. User feedback typically comes in unstructured formats, such as free-text descriptions, making it challenging to process and analyze at scale using traditional methods. By harnessing the power of LLMs together with open source tooling, we have significantly improved our ability to measure and interpret feedback, enabling us to identify system issues more effectively and implement impactful product changes.

In this poster, we’ll dive deeper into 1) How LLM can parse unstructured user feedback and get sentiment 2) How we used open source tech to build at scale 3) Influencing cross collaborations across areas for driving product impact 4) Our Learnings from building using AI in a privacy safe environment!

Note: While this poster will detail Facebook’s approach specifically, many other area within Meta are adopting a similar approach.

Building a Computation Engine for an Agentic Multimodal Database

Authors: Paritosh Aggarwal & Nathan Weigand, Snowflake

Snowflake Cortex AI/ AI SQL builds the next generation of data warehouses: leveraging the strengths of generative AI to process unstructured data together with structured data. We extend the familiar concepts of SQL to bring bulk processing of such data directly within the data warehouse.

In this poster, we will share our use of multi-model approaches and intelligent query planning to achieve speedups of up to 100x compared to naive approaches, while keeping quality the same. We will also show how Snowflake thinks about SQL and extending abstractions to support multimodal data.

Building Multimodal AI Agents for Data-Heavy, Real-World Applications

Author: Miles Gordenker, Datagrid

As AI agents move from copilots to autonomous systems, one of the most unsolved challenges is enabling them to deeply understand and act on complex, multimodal, and context-rich data environments.

This talk will explore how to build AI agents capable of reasoning over massive, heterogeneous data — spanning documents, databases, spreadsheets, diagrams, images, and unstructured files — while grounding their outputs in reliable context.

We’ll share our experience designing multimodal research agents that serve industries where precision, reliability, and scale are critical: from finance to heavy operations to the built world. These agents go beyond chat—they retrieve, synthesize, reason, and execute tasks across diverse data types, helping teams answer complex questions, automate workflows, and drive decision-making with real-time context.

Collective Wisdom of Models: Advanced Feature Importance Techniques

Author: Assaf Cohen, Meta

How to leverage multiple models feature importance score to create an aggregated feature importance score across multiple features and models for better feature exploration and selection. Due to our scale in Meta finding the right features to use in models (including agents) is a complex task. Using our approach we leverage multiple models feature importance reportings to make feature exploration more data driven and also to provide additional aspect for the feature selection process.

This will be based on a recent medium blog post we published in Meta's Blog - https://medium.com/@AnalyticsAtMeta/collective-wisdom-of-models-advanced-feature-importance-techniques-at-meta-1a7a8d2f9e27

Embedding Offloading using SSD Storage for Large Model Training

Authors: Intaik Park, Joe Wang, David Lai, Raahul Kalyaan Jakka, Meta

How to leverage multiple models feature importance score to create an aggregated feature Meta has been at the forefront of developing cutting-edge techniques for recommendation systems that led to a significant increase in the size of embedding tables. The significant growth in these tables has resulted in storage bottlenecks during GPU training, as the High-Bandwidth Memory (HBM) on GPUs is insufficient to accommodate such large tensors. To address these challenges, a new approach was developed that offloads data to SSD storage, retrieving the necessary tensors at runtime while hiding slow access latency with hierarchical cache system. This approach enables a scalable infrastructure, allowing larger models to be trained without additional GPU hardware.

SSD presents different storage characteristics and constraints compared to HBM or DRAM, which requires different approaches and innovations. A new system design to incorporate SSD in model training is presented along with challenges and solutions.

Faiss & Vector Search at Meta

Authors: Junjie Qi, Ramil Bakhshyiev, Matthijs Douze, Gergely Szilvasy, Honghao Qiu, Vishal Gandhi, Meta

In this poster we describe some recent accomplishments and improvements of the FAISS (Foundational AI Similarity Search), an industry-leading open-source vector search and clustering tool developed by Meta over the past decade. This library has been recently enhanced through a collaboration with NVIDIA to integrate state-of-the-art GPU-accelerated algorithms from the cuVS library, resulting in significant performance improvements, including up to 12.3x faster index build times and 8.1x faster search latency, as demonstrated by benchmarks on large-scale datasets.

RaBitQ, now supported in Faiss, offers a novel binary quantization approach that enhances vector search by performing scalar quantization on transformed query vectors, achieving faster search speeds than traditional asymmetric binary quantization methods, with experimental results showing it performs as well or better than certain configurations of OPQ, PQFS, and LSQ for speed versus accuracy trade-offs, although current implementations lack SIMD optimizations which could further enhance performance.

The Offline Vector KNN Search workflow achieved a significant milestone by successfully processing 100s of billion embeddings, leveraging the power of Meta's internal infrastructure and Faiss library to demonstrate the feasibility of large-scale vector search, while also highlighting opportunities for future enhancements in approximate search, multi-GPU processing, and optimized indexing techniques.

A prototype for distributed clustering on GPUs achieved over a 10x reduction in latency compared to CPUs, training million centroids on a large dataset with high dimensions within 1 hour , compared to 10s hours on high memory CPUs, while maintaining near-identical data quality across cluster size and embedding-centroid distance metrics.

The Offline Vector KNN Search workflow achieved a significant milestone by successfully processing 1 trillion embeddings using synthetic vectors, leveraging the power of Meta's internal infrastructure and Faiss library to demonstrate the feasibility of large-scale vector search, while also highlighting opportunities for future enhancements in approximate search, multi-GPU processing, and optimized indexing techniques.

A prototype for distributed clustering on GPUs achieved over a 10x reduction in latency compared to CPUs, training 2 million centroids on a 22 million dataset with 768 dimensions in just 1 hour using 16 GPUs, compared to 11 hours on 71 high memory CPUs, while maintaining near-identical data quality across cluster size and embedding-centroid distance metrics.

Growth Analytics Agent

Authors: Chirag Parmar, Joe Kumar, Sowmyasri Muthupandi, Lu Xu, Meta

Understanding product growth is paramount, yet deriving key metrics like Daily/Monthly Active Users (DAU/MAU), cohort retention, and user stickiness often requires complex data manipulation. This poster presents an innovative agent leveraging Large Language Models (LLMs) to answer natural language questions specifically focused on product growth analytics. Users can ask, "What were our DAU and MAU last month?", "Calculate 7-day retention for users who signed up in January", or "What is our current DAU/MAU stickiness ratio?". The agent intelligently interfaces with product analytics databases, particularly utilizing a datelist dataset pattern for efficient querying and interpretation of time-series user activity. It translates natural language into complex queries for these growth metrics, synthesizing data into understandable insights, charts, and summaries. This approach aims to empower product teams, designers, and researchers to self-serve their data needs, accelerate the discovery of critical product insights, and foster a truly data-informed product development culture.

Heterogeneous Training Numerics in Semi-Synchronous Training

Authors: Shangfu Peng, Jiaming Cui, Chuanhao Zhuge, Xiaodong Wang, Meta

As Meta prepares for the future of cross-region heterogeneous hardware training, semi-synchronous training emerges as a promising paradigm. Our experiments reveal that synchronous SGD Nesterov (DiLoCo with local_worker M=1) outperforms the current synchronous AdamW baseline. Furthermore, semi-synchronous training with M=2/4 achieves results comparable to synchronous SGD Nesterov while enabling cross-region scalability.

In heterogeneous hardware environments, imbalanced performance between workers necessitates workload skewing. Our findings indicate that increasing workload skew—through batch size (BS), data parallel groups (DP), or local steps (H)—not only maintains but can enhance training quality, aligning closely with synchronous SGD Nesterov outcomes.

This study simulates real-world scenarios where local workers differ in workloads, memory, and network capabilities, reflecting the anticipated heterogeneous hardware landscape. Our research, supported by external studies, confirms that synchronous SGD Nesterov outperforms traditional synchronous training (AdamW). Scaling law experiments with semi-synchronous training (M=2/4) show evaluation metrics comparable to M=1 results.

We explored heterogeneous training numerics by adjusting workloads per local worker using three adaptive strategies:

Heterogeneous Batch Size: Configuring batch sizes (BS) for each data parallel group per worker.
Asymmetric Hierarchical HSDP: Configuring the number of data parallel groups (DP) per worker.
Dynamic Local Steps: Configuring the number of local steps (H) per worker.

Our experiments demonstrate that increased workload skew improves the performance of the loss curve, effectively moving semi-synchronous training towards synchronous SGD Nesterov. By employing these adaptive strategies, training behavior and loss curves align with synchronous SGD Nesterov settings. As workload skews to the extreme, where all workload is assigned to one local worker, training behavior transitions as follows:

Adjusting workload via BS moves training towards synchronous SGD Nesterov with setting (1, BS*M, DP, H).
Adjusting workload via DP moves training towards synchronous SGD Nesterov with setting (1, BS, DP*M, H).
Adjusting workload via H moves training towards synchronous SGD Nesterov with setting (1, BS, DP, H*M).

This research underscores the potential of semi-synchronous training to effectively harness heterogeneous hardware, paving the way for scalable and efficient cross-region training solutions.

Job-o-scope: Detecting Stragglers Significantly Faster in Large Scale Training

Authors: Anubhav Chaturvedi, Charles Yoon, Harsha Bommireddy, Jayesh Seshadri, Michael Au-Yeung, Shyam Sundar Chandrasekaran, Uttam Thakore, Vladimir Ivanov and Ya Liu

Identifying slow ranks (“stragglers”) is a multi-million dollar problem for Meta today. When training machine-learning models at any scale, slowness on just a few ranks to reach a common, required state can cause the whole cluster to wait for those ranks and pause training progress, especially on Meta’s large-scale jobs running on over 16000 ranks. This leads to significant losses due to operational expenses and lost productivity. Moreover, the probability of such an event increases exponentially with the number of ranks in a training job. For instance, an interruption or slow-down of 1 hour on our largest clusters leads to $25-50K in operational power loss from compute spent powering machines that were waiting to be debugged and often requires thousands of dollars more in human resources to debug each instance of these problems.

Here, we propose Job-o-scope, a novel approach for enabling highly-visual and interactive monitoring and debugging capabilities for large scale training jobs that scales to 1M GPUs, provides millisecond-level time resolution, operates in real-time, is always-on yet imposes zero overhead on training performance, and has a low-cost for implementation and maintenance due to its technical simplicity. Job-o-scope aims to not just be occasionally useful, but to be a one-stop-shop pre-triaging solution, substantially simplifying a wide range of scenarios,including health monitoring, isolating performance issues, finding systemic synchronization skews and outliers, and providing visual imagery for dashboards and data for alerts.

Further, to improve the automation and detection of stragglers, we propose a data storage engine and associated straggler detection algorithms that can significantly reduce the reaction time and cost impact of straggles. Debugging these by hand today on a per-issue basis can be challenging due to variations in capability and familiarity with the tooling, and this helps address that.

Meta Log Collection @Scale: System for handling logs from billions of users

Authors: Satish Asok, Tej Singh, Kedarnath Kurnool Gandla, and Nicholas Chadwick Meta

At Meta, we tackle the immense challenge of collecting, transporting, and processing log data at an unprecedented scale. By leveraging cutting-edge technologies and innovative design principles, we’ve built a highly scalable system capable of supporting the vast volume of data generated by our global user base. We also employ real-time streaming technologies to ingest critical machine learning logs, enabling us to use fresh data to power ranking systems and deliver personalized experiences to billions of users. Join us as we explore the technical details and architectural decisions behind our logging infrastructure, and discover how we're pushing the boundaries of what's possible in large-scale log collection.

Nimble: A Flexible File Format for Training Data Efficiency

Authors: Huameng Jiang, Yoav Helfman, Serge Druzkin, Meta

We have been evolving file format for ML data for the past few years to accommodate the exponential growth of ML training footprint in Meta. The rapid growth in scale for both the training data and training complexity posed significant challenges across training and storage capacity. As we made several innovations in the current warehouse columnar file format to cater to the new workload shape and resource bottlenecks, we discovered major pain points in metadata handling and extensibility. Enter Nimble, a file format designed with ML data characteristics in mind, that supports much faster extensions and iterations.

Nimble is optimized for ML oriented feature map layouts (flatmaps). The layout was critical for IO, CPU and memory efficiency when multiple training workloads share the same source of truth. However, the layout also requires several orders of magnitude more metadata, and introduces tricky worst case tradeoffs downstream. Nimble incorporates the ML feature map layout as a first class concept and solves those previous pain points from the design. We are continuing to optimize for the ML feature map layouts in Nimble by leveraging more parallelism, laziness and customizability.

Nimble is a file format with the core philosophy of hyper customization in mind. Nimble hosts a repertoire of encodings and an encoding selection framework (adhoc and history based) to truly offer the optimal efficiency for datasets with different downstream bottlenecks. Nimble also allowed easy extension for its encodings and on disk layouts, enabling workload specific optimizations from product group customers. The level of customization allowed Nimble to bring 15% efficiency wins for generic ML workloads, and up to 45% with workload specific optimizations.

Nimble is en route to being the default file format for the entire data warehouse, and an important part of the OSS Velox ecosystem.

Optimizing Data Loading for Efficient AI Model Training

Authors: Moto Hira, Abhinandan Krishnan, Francisc Bungiu, Olga Gerasimova, Victor Bourgin, Yuta Inoue

Inefficient GPU utilization during AI model training poses significant challenges, hindering innovation and increasing costs. A prevalent issue is the inefficiency in data loading, which often leaves GPUs underutilized. As GPUs become faster, overcoming the data loading bottleneck becomes increasingly difficult.

To tackle this challenge, we have developed a novel data loading framework featuring an efficient data execution engine, along with flexible multi-threading and multi-processing capabilities. Our framework also provides real-time statistics for each stage of the data loading pipeline, enabling observability of data loading and iterative optimizations by helping identify bottlenecks in the pipeline.

We demonstrate the effectiveness of our framework through case studies that optimize production pipelines at Meta Reality Lab, underscoring its potential to enhance training efficiency and accelerate innovation. Our approach delivers substantial benefits compared to existing data loading solutions, including improvements in training speed ranging from 30% to 300%, and optimized pipelines achieving up to 90% SM utilization. The framework offers real-time insights that facilitate iterative optimization and allows for the flexible composition of multi-threading and multi-processing strategies.

Our efficient data loading framework empowers machine learning practitioners to identify bottlenecks and make informed, data-driven decisions to improve their data loading pipeline, resulting in significant improvements in training speed and GPU utilization for ML training workflows.

Realtime Graph Algorithm Computation @ Meta Scale

Authors: Jie Tian, Ming Chen, Jingfang Liu, Delin Sun, Enis Soztutar , Meta

Realtime graph algorithms at scale require specialized data infrastructure to meet the needs of order of Billion nodes and order of 100s of Billion edges. We present learnings from a joint collaboration between Monetization and Data Infrastructure orgs at Meta where we built a real-time platform for efficient computation of complex graph algorithms like Personalized Page Rank at Meta scale.

Scaling Business AI Agents: Evaluating Data Quality and Gaps

Authors: Claire Sun, Javid Jafarov, Rohit Babbar, Tingting Zhao, Meta

The quality of backend data directly influences RAG-based Business AI Agent’s ability to respond accurately and effectively to user inquiries, which impacts user satisfaction and trust.

Assessing the quality of textual information consistently across businesses and identifying data gaps is challenging due to complex business contexts, diverse data formats (structured or unstructured) and storage (across multiple sources).

Data quality is defined through two primary dimensions: sufficiency, which refers to comprehensiveness and detail of available information, and accuracy, which refers to precisely reflecting the most up-to-date sources of truth.

This poster introduces data entities and LLM judges as a solution to this challenge by using the presence of data entities to assess sufficiency, enabling consistent measurement and identification of data gaps across diverse business contexts.

A data entity is a self-contained data point serving business needs (e.g., product features, shipping policies) and contributes to a comprehensive knowledge graph for answering customer inquiries.

Scalable Data Modeling Patterns for GenAI Training Data

Author: Sky Yin, Meta

Sharing some of our experience tackling the scalability challenges to prepare internet scale of data for Llama pre-training. Above 100 billions rows level, many of common practices to tune our compute engines don't work anymore. Instead, we use a series of data modeling techniques (sharding, indexing, sampling, compaction etc) to scale.

SEVmate - Rooting out the cause: An AI approach to SEV investigations

Authors: Narayanan Sankaran, Henry Bond, Rahul Kindi, Edward Yao, Meta

SEVmate automatically finds the code change responsible for causing a SEV. This product aims to fundamentally reshape how software incidents are investigated at Meta, and drastically reduce both revenue loss and improve dev productivity.

In the talk we will explore how we are solving this root cause analysis problem using a multi-stage ranking pipeline which leverages a combination of retrieval techniques, SNNs, Embeddings, and LLMs to reason about the connections between code changes and SEVs.

We will discuss how we analyze changes as Meta scale, and incorporate axillary signals from our knowledge graph, such as observability (metrics, events, logs, and traces), employee investigation activity, and our extensive code and service graph.

Table Compare: Safeguarding Data Integrity at Meta

Authors: Asaf Avigal, Yoel Gottlieb, Meta

In the dynamic environment of Meta, where billions of data points influence every decision, ensuring data integrity is paramount. The Table Compare tool is a pivotal innovation designed to address the challenges of maintaining data reliability amidst thousands of daily code changes. This tool automates and standardizes the validation process during code reviews, ensuring that even minor discrepancies in data are identified before deployment. By comparing test tables against production tables, Table Compare detects discrepancies in row populations, column values, and structural changes, providing a comprehensive report that engineers can use to ensure data accuracy.

Transforming Snowflake Intelligence with Agentic Innovations from Snowflake AI Research

Author: Zhewei Yao, Snowflake

Snowflake Intelligence is the front end agentic solution for snowflake customers. Its backend is a combination of a series of research works from Snowflake AI researchers and Snowflake engineers. We will discuss and illustrate how we seamlessly combine structured and unstructured data and output the conclusion for users and how we use reasoning chains to resolve multi-hop and/or complex questions.

Tulib: A Unified, Schema-Driven (De)Serialization Layer Powering AI & Data Pipelines at Meta Scale

Author: Abdullah Ozturk, Meta

Large-scale AI and other critical workloads at Meta ingest TBs of heterogeneous data per second, spanning high-throughput logs, columnar pages, legacy text, and GPU-ready tensors. Historically, each ingestion, streaming, or ML pipeline embedded its own parser. Fragmented (de)serialization stacks—each tied to its own schema rules—were inflating latency, duplicating engineering effort, slowing innovation, and hiding silent data drift that could poison downstream models. Even a trivial field-name disagreement could corrupt terabytes, trigger outages and force costly re-processing. Tulib eliminates this friction by providing a single, runtime-pluggable library that speaks every wire and storage format in use while unifying dynamic serialization and enforcing schema contracts at runtime.

Key Ideas:

One API, many formats. The modular SerDe translates between row and columnar layouts, allowing each pipeline stage to choose the most performance-appropriate format.
Schema‑first. Every read or write is validated against a concrete schema fetched from the authoritative store with backward and forward compatibility.
Column‑aware. Integrated predicate evaluation and vector builders materialize only the required columns/features or rows, so downstream systems avoid reading bytes they’ll discard anyway.
Pluggable encodings. Support encodings like dictionary etc. on a column‑by‑column basis. 5) Hot reload. Schema changes are detected and adopted without process restarts.

Since its 2021 rollout Tulib has become the de facto middleware in logging, Scribe queues, stream processing, data-prep for ML and warehouse ingestion. Zero-copy vector builders feed analytic engines directly, and traffic on Scribe has fallen 20–45% in real workloads with column-aware optimizations. By treating “format” and “schema” as orthogonal concerns, Tulib turns format diversity from a liability into a plug-and-play advantage—and shows how large organisations can evolve data pipelines rapidly without sacrificing performance or correctness.

Understand Customer feedback using interactive LLM

Authors: Tim Dorn, Tony Fu, Gaurav Shrivastava, Meta

LLMs are revolutionizing the analysis of customer feedback. For example, large online retailers are leveraging LLMs to highlight key aspects of product reviews for their consumers, eliminating the need for manual sifting and summarizing. For Meta, the value of customer feedback lies in refining our products and services to meet the evolving needs of our users. By tuning into their voices, we can ensure we’re delivering products and services that truly matter to them.

Unified Investigation Runbooks

Authors: Chinmay Gandhi, Vlad Tsvang, Ankit Agarwal, JuanPablo, Neeru Sharma, Meta

Building Investigation Workflows @ Meta. Unified Runbook is a robust framework that enables experts to create dynamic, interactive workflows with either hierarchical or linear structures. It supports both code-based and UI-based authoring, and has seamless integrations with Meta's monitoring and incident management stack.

Zero-Copy Cloning

Authors: Brian Dahmen, Derek Chen, Vyom Munshi , Meta

Meta's data warehouse provides users the ability to apply delta modifications to their warehouse tables via DELETE and UPDATE sql statements.

Executing delta mutations directly on top of production assets can have unintended consequences, both from a business and infrastructure perspective. This was pushing customers towards physical replicating their data into distinct tables for isolation, duplicating storage costs.

Meta developed a zero-copy-cloning feature in the warehouse, providing customers with the ability to efficiently "clone" subsets of data from one table to another.

SPEAKERS AND MODERATORS

Dr. Jelena Pješivac-Grbović is an engineering director in Data Infrastructure in Meta. Her teams... read more

Jelena Pješivac-Grbović

Meta

Aparna is VP Engineering at Meta, responsible for AI Infrastructure, Data Infrastructure and Developer... read more

Aparna Ramani

Meta

Ion Stoica is a Professor in the EECS Department at the University of California... read more

Ion Stoica

AnyScale

I’m a technology executive with a deep background in building AI systems—from physical autonomy... read more

Hussein Mehanna

Meta

Jeff is the Director of Product for AI Agents and Applications at Snowflake. His... read more

Jeff Hollan

Snowflake

Shishir is a Research Scientist on the Llama post-training team, where he led the... read more

Shishir Patil

Meta

Cyril Meurillon is a software engineer at NVIDIA, where he covers resiliency. His work... read more

Cyril Meurillon

NVIDIA

Devin O'Kelly is a Senior HPC Engineer at NVIDIA where he focuses on fleet... read more

Devin O’Kelly

NVIDIA

Joe Spisak is Product Director and Head of Open Source in Meta’s Generative AI... read more

Joe Spisak

Meta

Bio: Horace is interested in making both researchers and GPUs happy. He currently works... read more

Horace He

Thinking Machines

Dhruv Batra is a co-founder and the Chief Scientist of Yutori. Previously, he was... read more

Dhruv Batra

Yutori

Sal is the Chief Technology Officer for EvolutionaryScale, a public benefit corporation developing artificial... read more

Sal Candido

Evolutionary Scale AI

Member of technical staff at Perplexity, Former Llama post training at Meta. read more

Eugen Hotaj

Perplexity

Carina Hong is the Founder/CEO of Axiom Math, a company building toward quantitative superintelligence... read more

Carina Hong

Axiom

Serhii Sokolenko is the CEO and co-founder of Tower.dev, a hassle-free platform for data... read more

Serhii Sokolenko

Tower.dev

Can Lin is a software engineer in the AI & Data Infrastructure Responsibility area... read more

Can Lin

Uday Ramesh Savagaonkar

Meta

Krishna Gade is the Founder/CEO of Fiddler.AI, a Model Performance Monitoring startup. Prior to... read more

Krishna Gade

Fiddler

Alexandru Petrescu has been a Software Engineer at Meta for the past 12 years,... read more

Alexandru Petrescu

Meta

Karthik Lakshminarayanan is a Product Management Director at Meta read more

Karthik Lakshminarayanan

Meta

Barak Yagour is the Vice President of Engineering at Meta, leading the Data Infrastructure... read more

Barak Yagour

Meta

Anna Berenberg is an Engineering Fellow and Uber Tech Lead for GCP as Platform.... read more

Anna Berenberg

Google

Barr Moses is CEO & Co-Founder of Monte Carlo, a data and AI observability... read more

Barr Moses

Monte Carlo

Corporate Vice President, Cloud + AI Qi Ke leads several key initiatives within Microsoft... read more

Qi Ke

Microsoft

2025 Events

@Scale is a technical conference series for engineers who build or maintain systems designed for scale. New this year, in person and virtual attendance options will be available at all four of our programs, which will bring together complementary themes to create event communities to spark cross-discipline collaboration.

NETWORKING - AUGUST 13, 2025

Hosted In Person & Virtually
Santa Clara Convention Center

In 2025, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our front-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the forefront of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.

Joining us are speakers from Alibaba, ByteDance, Meta, Microsoft, Oracle Cloud Infrastructure, and more!

PRODUCT - OCTOBER 22, 2025

Hosted In Person & Virtually
Meta Campus, Menlo Park

@Scale: Product is an exciting evolution of the conference series, bringing together the best of Product @Scale, RTC @Scale, Mobile @Scale, and Video @Scale. This comprehensive program is designed for engineers who are passionate about building and optimizing large-scale products. Attendees will gain insights into the latest innovations, best practices, and tools that drive efficiency and performance across product development, real-time communication, mobile platforms, and video technologies.

SYSTEMS & RELIABILITY - PAST EVENT

Hosted In Person & Virtually
Meta Campus, Menlo Park

The first installment of the 2025 @Scale conference series will combine two of the most foundational topics across the stack, Systems & Reliability. This two-track program will feature technical talks about the demands of AI and the conference theme of "rising to the challenge." The themed talks will include compelling stories about solving the hardest hyper-scale problems with distributed systems, infra resilience and many more complex challenges by speakers from around the industry.

Past Event

AI & DATA - PAST EVENT

Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale.

Past Event

LATEST NOTES

AI Infra @Scale

06/25/2025

Advancing AI Wearables: The Technical Journey of Ray-Ban Meta

In November 2023, Meta introduced the Ray-Ban Meta, AI-powered wearable glasses that aimed to seamlessly integrate multi-modal AI technology into...