How is K-means clustering used to group assets for portfolio construction, and what are its limitations with financial return data?
I'm exploring unsupervised learning methods in CFA quantitative methods and want to understand how K-means can replace traditional sector classifications for portfolio construction. The idea of letting return patterns define asset groups sounds appealing, but I'm worried about K-means assumptions (spherical clusters, equal variance) clashing with the reality of financial data.
K-means clustering partitions assets into K groups by minimizing within-cluster variance. Instead of relying on subjective sector labels (which mix unrelated companies), clustering uses statistical return patterns to reveal natural groupings.\n\nAlgorithm Steps:\n1. Choose K (number of clusters)\n2. Initialize K centroids randomly\n3. Assign each asset to the nearest centroid (Euclidean distance in feature space)\n4. Recompute centroids as the mean of assigned assets\n5. Repeat steps 3-4 until assignments stabilize\n\nWorked Example:\nLakefront Asset Management wants to diversify a 50-stock portfolio beyond traditional GICS sectors. They compute 8 features for each stock: 12-month return, 60-day volatility, beta, dividend yield, P/E ratio, debt-to-equity, revenue growth, and earnings stability.\n\nRunning K-means with K=6 produces:\n\n| Cluster | Profile | Stocks | Traditional Sectors Mixed |\n|---|---|---|---|\n| 1 | High-growth, high-vol | 8 | Tech + Biotech + Consumer Discretionary |\n| 2 | Stable dividend payers | 11 | Utilities + REITs + Consumer Staples |\n| 3 | Cyclical value | 9 | Industrials + Materials + Energy |\n| 4 | Defensive low-beta | 7 | Healthcare + Telecom + Utilities |\n| 5 | Leveraged growth | 8 | Financials + Tech + Real Estate |\n| 6 | Quality compounders | 7 | Tech + Healthcare + Consumer |\n\nClusters 2 and 4 both contain Utilities stocks — but cluster 2 groups them with REITs based on yield characteristics while cluster 4 groups others with Healthcare based on low-beta behavior. This captures economically meaningful distinctions that sector labels miss.\n\nChoosing K (Elbow Method):\nPlot within-cluster sum of squares (WCSS) against K. Lakefront tested K=3 to K=12:\n\n- K=3: WCSS=284 (too few, heterogeneous clusters)\n- K=6: WCSS=121 (elbow point, clear inflection)\n- K=10: WCSS=89 (marginal improvement, overly granular)\n\nK=6 provided the best balance between granularity and statistical stability.\n\nLimitations with Financial Data:\n- K-means assumes spherical clusters — financial return distributions are often elongated or asymmetric\n- Sensitive to outliers (extreme returns distort centroids)\n- Clusters may be unstable across time periods — quarterly reclustering often reassigns 20-30% of assets\n- Euclidean distance in high dimensions suffers from the curse of dimensionality\n\nAlternatives: Hierarchical clustering handles non-spherical shapes; DBSCAN automatically determines K and handles outliers.\n\nExplore clustering applications in our CFA Quantitative Methods question bank.
Master Level II with our CFA Course
107 lessons · 200+ hours· Expert instruction
Related Questions
How do I map a CFA Ethics vignette to the right standard?
When does a duty to clients override pressure from an employer?
Do conflicts have to be disclosed before making a recommendation?
Why do CFA Ethics answers focus so much on the action taken?
What does a high-water mark actually do in a hedge fund fee calculation?
Join the Discussion
Ask questions and get expert answers.