Program

Program

Half-day Event
We expect a half-day event. Preferred date: 11th June. The estimated breakdown among
invited speakers, presentations, and poster sessions is as follows:
• Opening and closing: 15 minutes
• Invited talks: 6 x 30 minutes
• Contributed work presentations: 6 x 10 minutes
• Poster session: 50 minutes
The total amount of time is 390 minutes (6.5 hours), including coffee and lunch breaks.

Half-day Event

Oral papers are grouped into two blocks of 20 min total (2 × 10 min each).
Poster session + coffee starts at 3:30 PM

Time

Activity

1:00 PM – 1:05 PM

1:05 PM – 1:30 PM

1:30 PM – 1:55 PM

1:55 PM – 2:20 PM

2:20 PM – 2:45 PM

2:45 PM – 3:05 PM

3:05 PM – 3:30 PM

Opening Remarks

Sara Beery – Generalization vs Specialization: AI Deployment in the Era of Foundation Models

Eric Granger – Domain Generalization for Cross- and Mixed-Modal Visible-Infrared Re-Identification

Francesco Locatello – Representation Learning for Downstream Causal Inference

Aditi Raghunathan – Predicting the Performance of Foundation Models Under Distribution Shift

Oral Paper 1 (10 min) + Oral Paper 2 (10 min)

Yasuhiro Tsuchida – Advancing Edge AI and Video Analysis Technology: AWL’s Global Impact and Real-World Implementation

3:30 PM – 4:10 PM

Coffee Break + Poster Session

4:10 PM – 4:35 PM

4:35 PM – 4:55 PM

4:55 PM – 5:20 PM

Elisa Ricci – Toward Generalizable Vision-Language Models: Improving Fine-Grained Understanding from Limited Image Samples and Synthetic Videos

Oral Paper 3 (10 min) + Oral Paper 4 (10 min)

Kai Han – Category Discovery: An Open-World Learning Perspective

5:20 PM – 5:25 PM

Award Announcements & Closing Remarks

Oral Papers

Oral 1: Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation

Jia Wei · Xiaoqi Zhao · Jonghye Woo · Jinsong Ouyang · Georges El Fakhri · Qingyu Chen · Xiaofeng Liu

Oral 2: IMC: A Benchmark for Invariant Learning under Multiple Causes
Taero Kim · Seonggyun Lee · Joonseong Kang · Youngjun Choi · Wonsang Yun · Nicole Hee-Yeon Kim · Ziyu Chen · Lexing Xie · Kyungwoo Song

Oral 3: Task-Level Contrastiveness for Cross-Domain Few-Shot Learning
Kristi Topollaj · Anna Choromanska

Oral 4: PiCaZo: Pixel-Aligned Contrastive Learning for Zero-Shot Domain Adaptation
Aniruddh Sikdar · Arya Kishor · Ishika Kadam · Suresh Sundaram

Invited Talks

Title: Toward generalizable Vision-Language Models: Improving fine-grained understanding from limited image samples and synthetic videos

Abstract: Vision-Language Models have shown impressive performance on a wide range of tasks, yet their generalization capabilities remain a key challenge especially in fine-grained image and video understanding tasks. In this talk, I will present two recent works that explore novel strategies to address this limitation. First, I will consider the problem of few-shot adaptation for image recognition. I will introduce Two-Stage Few-Shot Adaptation (2SFS), a novel and simple strategy that explicitly separates task-level feature extraction and concept specialization. 2SFS yields improved generalization capabilities over baselines and consistent gains across multiple datasets, backbones, and settings. Second, I will present SynViTA, a novel framework for improving video-language alignment using synthetic videos. SynViTA mitigates the noise and the distribution shift in generated video content by weighting samples based on semantic similarity and enforcing fine-grained caption consistency, leading to consistent gains on multiple video benchmarks and downstream tasks.

Title: Generalization vs specialization: AI deployment in the era of foundation models.

Abstract: Foundation model training aims for broad generalization: training on as much high-quality but general-purpose data as possible, usually from massive internet-scale datasets, with the goal of building models that work off-the-shelf for as many users as possible. In practice however, foundation models can be suboptimal for specific deployments. We explore this remaining generalization gap to better understand factors that contribute to performance inequities, including different distributions over categories and different data characteristics from the general training data pool. We introduce dataset subset selection for specialization as one possible mechanism to efficiently optimize foundation models for specific deployments, seeking to identify finetuning subsets closely aligned with the target deployment to achieve superior performance under the given distribution and attribute shifts.

Title: Category discovery: an open-world learning perspective

Abstract: In this talk, I will present our recent advances in category discovery, an emerging open-world learning task that aims to automatically categorize visual concepts in unlabeled data by transferring knowledge from labeled data, where the unlabeled set may contain both known and novel classes. I will discuss our progress in tackling key challenges, particularly through effective utilization of foundation models, including: enhancing sensitivity to semantic shifts while maintaining robustness to domain shifts, enabling continual learning for evolving category discovery, mitigating label bias during training, leveraging hyperbolic space for richer geometric representations of categories, etc. Finally, I will summarize our key insights, practical takeaways, and lessons learned from overcoming these challenges.

Title: Representation learning for downstream causal inference.

Abstract: Machine learning and AI have the potential to transform data-driven scientific discovery, enabling not only accurate predictions for several scientific phenomena but also accelerating causal understanding. In this talk, I will present the challenges and our initial progress in learning representations for the causal inference downstream task of treatment effect estimation. I will specifically focus on the setting where the outcome of interest is recorded in high-dimensional observations during an experiment, using our real-world ISTAnt benchmark in experimental ecology as a motivating example. First, I will discuss how common and seemingly harmless choices in machine learning lead to biased estimates in a transfer learning setting and how to fix them. Most relevant to the workshop, I will show how domain generalization techniques can be successfully used to correct selection bias. Finally, I will show how to zero-shot generalize a predictor trained on past experiments, enabling drawing correct causal conclusions on an unseen target experimental population (e.g., generalizing to a new treatment).

Title: Domain Generalization for Cross- and Mixed-Modal Visible-Infrared Re-Identification.

Abstract: Visible-infrared person re-identification (VI-ReID) aims to match individuals across different camera modalities, a critical task in modern surveillance systems.
A key challenge in V-I ReID is training a backbone model capable of effectively addressing the significant discrepancies across modalities. State-of-the-art methods that generate a single intermediate bridging domain are often less effective, as this generated domain may not adequately capture sufficient common discriminant information. We introduce the Bidirectional Multi-step Domain Generalization (BMDG), a novel approach for unifying feature representations across diverse modalities. To minimize the cross-modal gap, BMDG creates multiple virtual intermediate domains by learning and aligning body part features extracted from both I and V modalities. First, BMDG aligns modalities in the feature space by learning shared and modality-invariant body part prototypes from V and I images. Then, it generalizes the feature representation by applying bidirectional multi-step learning, which progressively refines feature representations in each step and incorporates more prototypes from both modalities. Based on these prototypes, multiple bridging steps enhance the feature representation.

While state-of-the-art VI-ReID methods focus on cross-modality matching, real-world applications often involve mixed galleries containing both V and I images, where state-of-the-art methods show performance limitations due to large domain shifts and low discrimination across mixed modalities. This is because gallery images from the same modality may have lower domain gaps but correspond to different identities. We introduce a realistic set of mixed-modal ReID settings, where galleries contain data from both modalities. To address the domain shift among inter-modal and low discrimination capacity in intra-modal matching, we propose the Mixed Modality-Erased and -Related (MixER) method. This learning approach disentangles modality-specific and modality-shared identity information through orthogonal decomposition, modality confusion, and id-modality-related objectives. MixER enhances feature robustness across modalities, improving cross-modal and mixed-modal settings performance. Our extensive experiments on the SYSU-MM01, RegDB and LLCM datasets indicate that our methods achieve state-of-the-art performance using a single backbone, and highlight its versatility in different cross-modal and mixed-modal settings.

Title: Predicting the Performance of Foundation Models Under Distribution Shift.

Abstract: Can we forecast how a foundation model will behave once the data drift—without labeling a single new example? I will describe a striking phenomenon called agreement-on-the-line that provides a surprisingly precise answer. Across independently trained networks, the pairwise agreement measured in-distribution vs out-of-distribution lies on the very same straight line as the in-distribution vs out-of-distribution accuracy. Because agreement is observable with unlabeled data, fitting this line allows us to estimate out-of-distribution accuracy without any labels. I will then demonstrate how this method can be applied to the modern paradigm of fine-tuning foundation models. Finally, I will discuss how various failures in vision-language models can be tracked to a cross-modal attention imbalance. Together, these insights attempt to transform robustness to distribution shifts to a predictive science.

Title: Advancing Edge AI and Video Analysis Technology: AWL’s Global Impact and Real-World Implementation.

Abstract: Founded in Tokyo in June 2016, AWL has expanded its research and development footprint to Bangalore, India, and Hanoi, Vietnam, focusing on the global advancement and social implementation of edge AI and video analysis technology.

Recent developments in applications such as Agent AI have significantly enhanced human task support. However, these applications often require costly GPU servers, presenting substantial barriers in terms of cost and power consumption.

AWL addresses these challenges with its core technology, the “AWL Engine,” and associated video analysis AI applications. These innovations aim to deliver AI solutions for video analysis that are both cost-effective and energy-efficient.

The AWL Engine is designed to overcome the difficulties of AI model optimization and accuracy degradation due to environmental changes. Miniaturized AI models often lose generalization accuracy, necessitating fine-tuning for specific installation environments. However, these models are prone to significant accuracy deterioration during operation. The AWL Engine continuously monitors AI model performance, collects data for fine-tuning, and automates the fine-tuning process to maintain optimal accuracy. Additionally, it establishes an infrastructure for deploying low-cost, low-power AI applications by leveraging federated learning to continuously improve the foundation model without compromising privacy.

This keynote will explore the core technology of the AWL Engine, its diverse applications, and the current status of its real-world implementation in retail and manufacturing sectors.

Profile: Yasuhiro Tsuchida, the visionary Director and CTO of AWL, stands at the helm of the company’s global Research and Development on Artificial Intelligence Technology. With an unparalleled ability to craft analytical frameworks that drive the company’s success, Yasuhiro sensed the dawn of a revolutionary AI era and joined AWL to lead the charge from his hometown of Hokkaido. His mission: to conquer the global market and establish AWL as a titan in the AI industry. At AWL, Yasuhiro masterminded the creation and nationwide deployment of AWLBOX and AWL Lite, now operational in over 10,000 locations.

Before his transformative journey with AWL, Yasuhiro held prestigious leadership roles at Matsushita Electric Industrial Co., Ltd., now known as Panasonic Corporation. His tenure included a remarkable five-year stint as Director of New Business Development at Panasonic Silicon Valley LAB, where he accelerated the evolution of mobile O2O Commerce. Prior to this, Yasuhiro spearheaded numerous groundbreaking projects in the Corporate R&D Division, solidifying his reputation as a pioneer of innovation.

Yasuhiro’s illustrious career began at Mobile Communications Company, where he developed a cutting-edge middleware platform for NTT DoCoMo, Japan’s largest carrier. He holds a master’s degree in Computer Science from Hokkaido University Graduate School, a testament to his profound expertise and relentless pursuit of excellence.