Hypothesis Testing for CFA Level I: A Complete Walkthrough
If hypothesis testing is the part of Quantitative Methods that keeps tripping you up, you are not alone. The mechanics are simple in isolation — pick a test, compute a statistic, compare against a threshold — but on the exam those pieces appear inside a wordy stem that can mislead a candidate who has not built a clean process. This guide walks through the entire workflow the way it actually shows up on Level I, including the small judgment calls that distinguish a candidate who has done five mocks from one who has done fifty.
We will use a fictional asset manager, Brenwood Capital, throughout the article. Brenwood runs a US large-cap equity strategy and uses statistical screens to evaluate whether new factors materially improve returns. That single running example lets us tie every step of hypothesis testing back to a decision a real portfolio manager would actually make.
Why Hypothesis Testing Is Tested So Often
CFA Level I uses hypothesis testing in two places: the dedicated Quantitative Methods reading, and embedded inside FRA, Equity, Fixed Income, and Portfolio Management questions where the test is implicit. That second category is where most candidates lose points. A vignette will hand you a sample of returns and ask whether they support a manager's claim — without ever using the word "hypothesis." If you have not internalized the framework, you may answer the wrong question entirely.
The Five-Step Framework
Every hypothesis testing question follows the same five steps. Memorize the sequence and run it like a checklist.
A surprising number of mistakes happen before step 3. Candidates rush past steps 1 and 2 because they feel mechanical, then discover at step 4 that the test statistic was set up against the wrong tail. Slow down on the early steps. They are not where time is wasted.
Step 1: Writing the Null Hypothesis Without Second-Guessing
The null hypothesis (H₀) is always the status quo — the claim that nothing has changed, that the parameter equals some target, or that two groups are the same. The alternative (H₁) is what you would need evidence to support.
The rule that survives exam pressure: the null always contains the equality. Whatever the question is asking you to "prove" goes into H₁.
- A manager claims her strategy beats a 1.0% monthly benchmark → H₀: μ ≤ 1.0%, H₁: μ > 1.0% (one-tailed, right side)
- An analyst suspects two industries have different mean P/E ratios → H₀: μ₁ = μ₂, H₁: μ₁ ≠ μ₂ (two-tailed)
- A regulator wants to confirm a fund's tracking error is at most 50 bps → H₀: σ ≤ 0.005, H₁: σ > 0.005 (one-tailed)
The verb in the prompt is the tell. "Beats", "exceeds", "is greater than" → right-tail. "Is less than", "underperforms" → left-tail. "Differs from", "is not equal" → two-tail.
At Brenwood Capital, lead researcher Priya Anand wants to test whether a new low-volatility screen produces a higher mean monthly return than the 0.85% baseline. The hypotheses write themselves: H₀: μ ≤ 0.85%, H₁: μ > 0.85%. One-tailed, right side.
Step 2: Choosing the Significance Level
The significance level α is the probability of rejecting H₀ when it is actually true — in other words, the probability of a Type I error. On Level I, α is almost always handed to you (1%, 5%, or 10%). When the question does not specify, default to 5%.
The trade-off is direct: lowering α makes rejection harder, which reduces Type I error but increases Type II error (failing to reject a false null). The two errors move in opposite directions for a fixed sample size — only increasing n improves both.
Brenwood uses α = 0.05 for all internal factor research. That is a deliberate choice: a tighter level (0.01) would let too many genuinely good factors slip through; a looser level (0.10) would approve too many factors that are just noise.
Step 3: Selecting the Right Test Statistic
This is the step where a decision tree beats memorization.
For a one-sample mean test with unknown population variance, the t-statistic is:
t = (x̄ − μ₀) / (s / √n)
with degrees of freedom n − 1.
Brenwood's research team collected 25 monthly returns from the low-volatility screen. Sample mean = 1.10%. Sample standard deviation = 0.62%. Plugging in:
t = (1.10 − 0.85) / (0.62 / √25) = 0.25 / 0.124 = 2.02
Degrees of freedom = 24.
Step 4: Critical Values and p-values
You have two interchangeable ways to make the reject/fail-to-reject decision:
Method A — Compare test statistic to critical value. From a t-table with df = 24 and α = 0.05 (one-tailed), the critical value is 1.711. Brenwood's t = 2.02 exceeds 1.711, so we reject H₀.
Method B — Compare p-value to α. The p-value is the probability of observing a test statistic at least as extreme as the one calculated, assuming H₀ is true. For t = 2.02 with df = 24, the one-tailed p-value is approximately 0.027. Since 0.027 < 0.05, we reject H₀.
The two methods always agree when applied correctly. The exam often provides only one of them, so be comfortable with both.
A subtle but high-value rule: the p-value is the smallest α at which you would reject the null. If the p-value is 0.027, you would reject at α = 0.05 or α = 0.10, but fail to reject at α = 0.01. Many exam questions test exactly this.
Step 5: Writing the Conclusion
Never stop at "reject H₀." Write the economic conclusion. In Brenwood's case: at the 5% significance level, there is sufficient evidence that the low-volatility screen produces a higher mean monthly return than the 0.85% baseline. That sentence is the answer the exam is asking for — the rest is procedure.
Type I, Type II, and Power — The Three Concepts Most Often Confused
| Concept | Definition | Symbol |
|---|---|---|
| Type I error | Rejecting H₀ when H₀ is true | α |
| Type II error | Failing to reject H₀ when H₀ is false | β |
| Power | Probability of rejecting H₀ when H₁ is true | 1 − β |
| Confidence level | Probability of failing to reject H₀ when H₀ is true | 1 − α |
Three relationships are worth committing to memory:
- α and β trade off for a fixed sample size. You cannot simultaneously lower both without collecting more data.
- Power increases with sample size, all else equal. A larger n produces a tighter sampling distribution, which makes true differences easier to detect.
- Power increases as the true effect size increases. Detecting a tiny edge requires a large sample; detecting a huge edge can be done with very little.
A common trap: a question states that the analyst doubled the sample size and asks what happened to power. The right answer is "increased" — but recognize that doubling n does not double power. The improvement is non-linear and depends on where you started.
A Worked Example, Start to Finish
Scenario. Lattimer Asset Management evaluates whether a momentum signal beats the firm's 0.50% monthly baseline. They collect 36 monthly observations: x̄ = 0.68%, s = 0.84%. Significance level: 5%. Should they adopt the signal?
Step 1. H₀: μ ≤ 0.50%. H₁: μ > 0.50%. One-tailed, right side.
Step 2. α = 0.05.
Step 3. Population variance unknown, n = 36 (large but not enormous). Use a t-test. df = 35.
t = (0.68 − 0.50) / (0.84 / √36) = 0.18 / 0.14 = 1.286.
Step 4. Critical t (df ≈ 35, one-tailed, α = 0.05) ≈ 1.690. Since 1.286 < 1.690, do not reject H₀.
Step 5. At the 5% significance level, there is insufficient evidence that the momentum signal beats the 0.50% baseline.
Notice that Lattimer's sample mean (0.68%) is higher than the baseline. A weak candidate stops there and concludes the signal works. A strong candidate sees that the sample standard deviation is too large to be confident the difference is real — the test is doing exactly the job it is designed to do.
The Three Mistakes That Cost the Most Points
- Wrong tail. Setting up a two-tailed test when the prompt asked a one-tailed question (or vice versa) inflates or deflates the critical value enough to flip the decision.
- Wrong test statistic. Using a z-test when n is small and variance is unknown. The t-distribution has fatter tails, so the critical value is larger — using z falsely shrinks the threshold and over-rejects.
- Confusing α and β. When the question describes a regulator who wants to minimize the chance of approving a faulty fund, the regulator is worried about Type I error (incorrectly approving). When the question describes a manager who wants to make sure a good factor is not missed, the worry is Type II error.
A 90-Minute Final-Week Plan for Hypothesis Testing
If hypothesis testing is on your weak list and the exam is one week away:
- Day 1 (30 min): Re-derive the t-statistic and z-statistic formulas from scratch on a blank page. If you cannot, that is your gap.
- Day 2 (30 min): Work through 10 mixed problems, alternating one-tailed and two-tailed, alternating means/variances/proportions. Time yourself: 90 seconds per question.
- Day 3 (30 min): Review every miss from Day 2. Write one sentence on why you missed it. Patterns will emerge.
The point of the plan is not to study more. It is to convert hypothesis testing from a topic you reread into a process you execute on demand.
Practice Now
Ready to put the framework into practice? Try our CFA Level I question bank — the hypothesis testing items are written with the same trap categories above, so each miss maps to a fixable habit. If you get stuck on the setup, the explanations walk through which step in the five-step framework you skipped.