Enhancing Mixture-of-Experts Specialization
via Cluster-Aware Upcycling

CVPR 2026
Sanghyeok Chu*1,2, Pyunghwan Ahn†2, Gwangmo Song2, SeungHwan Kim2, Honglak Lee2,3, Bohyung Han†1,4
1ECE & 4IPAI, Seoul National University  2LG AI Research  3University of Michigan
* Work done during an internship at LG AI Research  † Corresponding authors
Key Idea: We leverage semantic structure in a pretrained dense model's activations to initialize both experts and the router in MoE upcycling, breaking expert symmetry and promoting early specialization.
Teaser figure

Abstract

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.

Cluster-Aware Initialization

Cluster-aware Upcycling initialization pipeline

The initialization pipeline consists of three components:

Step 1: Clustering Input Activations

Input activations from each FFN block are partitioned via spherical k-means clustering based on cosine similarity, which directly aligns with the router's logit computation.

Step 2: Cluster-Aware Expert Initialization

Each expert is initialized to capture the principal subspace of its corresponding cluster, preserving pretrained knowledge within the assigned semantic region while promoting diversity across experts.

Step 3: Router Initialization with Cluster Centroids

The router weights are set to the cluster centroids, ensuring that early routing decisions align with the data's semantic structure rather than random routing.

Expert-Ensemble Self-Distillation

EESD loss

Tokens with near-uniform routing probabilities indicate weak alignment between input and experts, making it difficult to preserve and reinforce consistent specialization. The Expert-Ensemble Self-Distillation (EESD) loss addresses this using a dense EMA ensemble teacher that activates all experts simultaneously, providing stable supervision for the sparse MoE model. This provides stronger guidance when routing is uncertain, while remaining small for confident tokens. EESD introduces only modest overhead in our experiments: ~5.3% in wall-clock time and ~2.8% in GPU memory.

Zero-Shot Results

Zero-shot retrieval (MSCOCO, Recall@1) and classification accuracy on CLIP ViT-B/16. Cluster-aware Upcycling achieves the best performance across most benchmarks.

Method MSCOCO ImageNet-1k VTAB
Nat.
I→TT→IAvg. ValV2ARSketchObjNetAvg.
Dense34.350.842.662.554.423.570.645.842.549.962.6
Drop-Upcycling34.151.342.762.054.522.770.845.742.949.860.9
Sparse Upcycling34.950.942.963.055.123.771.246.342.350.362.0
CLIP-MoE34.051.542.862.954.924.571.646.243.450.662.8
Cluster-aware Upcycling (Ours)35.451.643.563.255.124.172.146.843.550.863.3

Few-Shot & Fine-Tuning Results

ImageNet-1k few-shot and full fine-tuning accuracy on the upcycled ViT-B/16. Improvements are most pronounced in few-shot regimes, where initialization quality is critical due to limited training signals.

Method5-shot10-shotFull FT
Dense50.457.172.8
Sparse Upcycling50.957.873.0
Drop-Upcycling51.157.973.1
CLIP-MoE51.358.073.2
Cluster-aware Upcycling (Ours)51.558.273.3

Ablation Study

Cluster-aware initialization and EESD play complementary roles. EESD alone shows modest gains, but combined with cluster-aware initialization it yields further improvements across all benchmarks.

Cluster-init.EESDMSCOCO I→TMSCOCO T→IIN ValIN 10-shot
34.950.963.057.8
35.151.163.258.1
34.651.462.757.8
35.451.663.258.2

Analysis

We analyze how Cluster-aware Upcycling influences expert specialization across four dimensions, quantitatively confirming that our method mitigates expert symmetry and redundancy.

Relative Compactness

(a) Relative Compactness. Measures overlap between intra- and inter-expert variance. Lower values indicate more disentangled, geometrically independent expert subspaces.

Expert Similarity

(b) Expert Similarity. Pairwise cosine similarity between expert weights. Cluster-aware Upcycling maintains significantly higher parameter diversity across experts.

Routing Entropy

(c) Routing Entropy. Our model starts with low entropy, increases during training as load-balancing encourages exploration, and stabilizes at a lower level than baselines.

Expert Utilization

(d) Expert Utilization. Balanced and stable utilization across all experts without routing collapse, confirming structured specialization without imbalance.

BibTeX

Coming soon.