What's the difference between supervised and unsupervised learning, and how are they used in finance?
CFA Level II now covers machine learning basics. I get that supervised learning uses labeled data and unsupervised doesn't, but I'm unclear on practical finance applications. When would a portfolio manager or analyst use each type?
Machine learning (ML) in finance is a growing topic on the CFA exam. The fundamental distinction is about whether you have a target variable (label) to predict.
Supervised Learning:
You have input features (X) and a known output (Y). The algorithm learns the relationship X -> Y from historical data, then predicts Y for new data.
Finance Applications:
- Credit scoring: Predict default probability (Y = default/no default) from borrower characteristics (income, debt ratio, credit history)
- Stock return prediction: Predict next-month return from factors (value, momentum, quality)
- Fraud detection: Classify transactions as fraudulent or legitimate
- Earnings forecasting: Predict quarterly EPS from fundamental and market data
Common Algorithms:
- Linear/logistic regression (simplest, most interpretable)
- Decision trees and random forests
- Support vector machines
- Neural networks (most flexible, least interpretable)
Unsupervised Learning:
You only have input features — no target variable. The algorithm finds hidden patterns, groupings, or structure in the data.
Finance Applications:
- Portfolio clustering: Group stocks by return behavior (not just sector) to build truly diversified portfolios
- Regime detection: Identify market regimes (bull/bear/sideways) from price and volatility patterns
- Anomaly detection: Flag unusual trading patterns without pre-defining what 'unusual' means
- Risk factor discovery: Find hidden factors driving asset returns beyond traditional Fama-French factors
Common Algorithms:
- K-means clustering
- Principal Component Analysis (PCA)
- Hierarchical clustering
Hybrid Approach — Semi-Supervised:
In practice, financial data often has a small amount of labeled data and a large amount of unlabeled data. Semi-supervised methods use both: train on labeled examples and let the unlabeled data improve the model's understanding of the data distribution.
Key Tradeoffs for Analysts:
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Data requirement | Labeled data (expensive) | Unlabeled data (abundant) |
| Evaluation | Clear metrics (accuracy, RMSE) | Subjective (are clusters meaningful?) |
| Interpretability | Varies by algorithm | Often harder to interpret |
| Overfitting risk | High (fitting noise in labels) | Lower (no labels to overfit) |
Practice ML classification questions in our CFA Level II question bank.
Master Level II with our CFA Course
107 lessons · 200+ hours· Expert instruction
Related Questions
How do I map a CFA Ethics vignette to the right standard?
When does a duty to clients override pressure from an employer?
Do conflicts have to be disclosed before making a recommendation?
Why do CFA Ethics answers focus so much on the action taken?
What does a high-water mark actually do in a hedge fund fee calculation?
Join the Discussion
Ask questions and get expert answers.