Factor Betas Are Path-Dependent Regression Artifacts: Falsifying Fama-French with 6,560 Controlled Experiments
Kevin B. Burk
TSF Inc.
ORCID: 0009-0005-5343-7913
January 2026
Abstract
Fama and French (1996) called momentum “the main embarrassment” of their three-factor model—the one timing anomaly their framework could not explain away. We demonstrate that this embarrassment was not an exception but a warning: the entire factor-based methodology for dismissing timing anomalies is invalid. Using 6,560 controlled experiments where paired portfolios hold identical stocks with identical weights entered on identical dates—differing only in exit timing—we show that factor loadings fail equivalence tests at an 86% rate. The failures are not marginal: loading differences exceed 0.45 in magnitude. Random entry strategies with zero predictive content produce massive loading shifts. A 10-stock single-sector control eliminates diversification explanations and still shows 96% failure. These results prove that factor loadings measure return-path covariances, not holdings-based risk exposure. The standard interpretation—that loadings reveal what risks a portfolio bears—is empirically false when timing varies. Every paper that used Fama-French regressions to conclude “alpha explained by factor exposure” for a timing strategy was applying a tool incapable of making that determination. Momentum survived as an acknowledged anomaly only because the broken dismissal tool happened to fail visibly for that case. The Halloween effect, January effect, and hundreds of other timing signals were dismissed by the same invalid procedure. Thirty years of factor-based anomaly evaluation must be reconsidered (Preregistered: https://doi.org/10.5281/zenodo.18304121).
Keywords: Factor models, market timing, Fama-French, equivalence testing, methodology
JEL Classification: G11, G12, G14, C12
1. Introduction
For three decades, the Fama-French factor model (Fama and French, 1993, 2015) has served as the gatekeeper for anomaly research. When a trading strategy generates positive returns, the standard academic procedure is to regress those returns against factor portfolios. If the resulting alpha shrinks or factor loadings appear elevated, the strategy is dismissed as merely harvesting known risk premiums. This methodology has been used to reject hundreds of market timing signals, calendar anomalies, and technical indicators. The implicit message: factor models separate genuine skill from factor exposure, and most claimed anomalies fail this test.
This entire framework rests on an assumption that has never been directly tested: that factor loadings measure a portfolio’s exposure to systematic risk factors determined by what securities the portfolio holds. Under this assumption, two portfolios holding identical securities should exhibit equivalent factor loadings regardless of when those positions are entered or exited. The loadings should reflect the portfolio’s composition, not its trading behavior.
We test this assumption directly. Using a fully controlled experimental design, we construct portfolio pairs that hold identical stocks with identical weights entered at identical times. The only difference between paired portfolios is exit timing: one follows native benchmark exit rules while the other applies a graduated profit-taking schedule. If factor loadings measure holdings-based risk exposure, these paired portfolios must show statistically equivalent factor betas.
They do not. Across 6,560 controlled experiments spanning four stock universes, factor loadings fail equivalence tests at an 86% rate. The failures are not marginal: we observe factor loading differences exceeding 0.45 in magnitude. Most damningly, a random entry strategy with zero information content produces massive factor loading shifts when only exit timing varies. A 10-stock single-sector control eliminates any diversification-based explanation, showing 96% failure rates.
These results constitute direct empirical falsification of the assumption underlying three decades of factor-based performance evaluation. If factor loadings change based on when you exit rather than what you hold, then factor regressions cannot distinguish timing skill from risk exposure. Every paper that has used Fama-French regressions to dismiss timing-based anomalies was applying a methodologically invalid test. The standard for anomaly dismissal—“alpha explained by factor exposure”—is not merely imprecise; it is fundamentally broken when timing varies.
2. Literature and Motivation
The Fama-French three-factor model (1993) extended the Capital Asset Pricing Model by adding size (SMB) and value (HML) factors to explain cross-sectional return variation. The five-factor model (2015) added profitability (RMW) and investment (CMA) factors. Carhart (1997) introduced momentum (MOM) as an additional factor. These models rapidly became the standard tools for performance attribution and alpha measurement—and, critically, for dismissing anomalies.
The typical application proceeds as follows: regress strategy returns against factor returns, interpret factor loadings as risk exposures, and attribute any alpha reduction to risk compensation rather than skill. This approach has been used to dismiss dozens of anomalies including momentum profits (Grundy and Martin, 2001), post-earnings-announcement drift (Bernard and Thomas, 1989), and various calendar effects. The interpretive chain is seductively simple: if alpha disappears when you control for factors, the alpha was “explained by” factor exposure, and the anomaly is spurious.
However, this interpretive framework assumes—without ever testing—that factor loadings are determined by portfolio composition. Specifically, a portfolio’s SMB loading should reflect its size tilt, HML loading its value tilt, and so forth. If a portfolio holds the same securities, its factor exposures should remain constant regardless of trading timing. This assumption is necessary for factor loadings to serve as evidence about risk exposure. Without it, the entire interpretive edifice collapses.
We are not aware of any prior study that has directly tested this assumption using controlled experiments where holdings are held constant while timing varies. Our contribution is to provide this test and document its failure.
2.1 Why Path-Dependence Is Mathematically Inevitable
Before presenting our empirical results, we establish why factor loading path-dependence is not a bug but a mathematical inevitability. Consider the standard factor regression:
Rp,t − Rf,t = α + βMKT(RMKT,t − Rf,t) + βSMB(SMBt) + βHML(HMLt) + … + εt
The estimated betas are OLS coefficients that minimize squared residuals over the estimation window. Critically, βSMB is estimated as:
βSMB = Cov(Rp, SMB) / Var(SMB)
The numerator is the covariance between portfolio returns and the SMB factor. This covariance depends entirely on WHEN portfolio returns are realized, not on what securities generate them. Two portfolios holding identical securities will have different return series if they exit at different times, and therefore different covariances with any factor.
This is not a limitation of the Fama-French model specifically. It is inherent to ANY regression-based factor decomposition. The regression operates on realized returns. It cannot see holdings, weights, or intentions. It only sees the time series of returns, and returns depend on exit timing.
The implication is devastating: factor loadings from return regressions CANNOT be interpreted as properties of portfolio holdings when timing varies. They are properties of return paths. The standard interpretation of factor loadings as risk exposures determined by what you hold assumes static or very slowly varying portfolios. When timing is the strategy, this assumption fails by construction. This is not a subtle statistical point—it is a mathematical inevitability that invalidates the primary tool used to evaluate timing strategies for the past three decades.
3. Methodology
3.1 Experimental Design
Our experimental design isolates exit timing as the sole variable. For each benchmark entry strategy, we construct two portfolios:
NATIVE Portfolio: Enter positions according to benchmark signal, exit immediately according to native benchmark rules.
PATIENCE Portfolio: Enter positions according to identical benchmark signal on identical dates, exit according to a graduated profit-taking schedule.
The profit-taking schedule is deterministic and publicly disclosed:
| Holding Period | Exit Target |
|---|---|
| 0–59 days | 15% profit |
| 60–89 days | 10% profit |
| 90–119 days | 5% profit |
| 120–179 days | 2% profit |
| 180+ days | Forced exit |
This design ensures that any difference in measured factor loadings between NATIVE and PATIENCE portfolios is attributable solely to exit timing, not to differences in security selection, position sizing, or entry timing.
3.2 Benchmark Strategies
We test 28 benchmark entry strategies spanning two categories:
Scheduled (Calendar-Based) Entries (10 strategies): sell_in_may, january, turn_of_month, sept_avoid, end_of_quarter, weekend, friday, random_21d, random_63d, random_126d.
Variable (Signal-Based) Entries (18 strategies): bollinger, rsi, ma_cross, macd, donchian, roc, dip_5, dip_10, sd_2, sd_3, down_3, down_5, 52w_low, 52w_high, below_ma, above_ma, random_tsf, random_uniform.
The random strategies (random_21d, random_63d, random_126d, random_tsf, random_uniform) serve as critical controls. These strategies have zero information content by construction. Any systematic difference in factor loadings between NATIVE and PATIENCE versions of random strategies cannot reflect differential risk exposure and must therefore represent path-dependent regression artifacts.
3.3 Stock Universes
The stock universe for this study consists of 346 S&P 500 constituents with complete 30-year price histories. This constraint arises from companion studies in the same research program that employ proprietary forecasting models requiring 30 years of historical data to generate 20 years of statistically valid forecasts. For consistency across the research program, we apply the same universe constraint here. However, this constraint is not methodologically necessary for the present study—the experimental design uses only publicly available benchmark entry signals and a fully disclosed exit rule, and we expect similar results would obtain with any liquid equity universe.
We test across four universes derived from these 346 stocks:
FULL: All 346 stocks. This is the master universe from which all subsets are drawn.
DEFENSIVE: Low-volatility sectors (Utilities, Consumer Staples, Healthcare). Tests whether results depend on volatility characteristics.
AGGRESSIVE: High-volatility sectors (Technology, Consumer Discretionary, Financials). Provides contrast with defensive universe.
COMMUNICATION SERVICES: Single sector with only 10 stocks. Eliminates any diversification confound and tests whether results persist with minimal holdings.
3.4 Statistical Testing
We employ Two One-Sided Tests (TOST) equivalence testing with bounds of ±0.10 for factor loadings. Under TOST, p < 0.05 confirms that factor loadings are statistically equivalent within the specified bounds. Failure to reject (p ≥ 0.05) indicates we cannot confirm equivalence.
This approach is more appropriate than traditional difference testing for our research question. We want to confirm equivalence, not merely fail to detect differences. TOST directly tests whether loadings fall within practically equivalent bounds.
All p-values are adjusted using the Benjamini-Hochberg procedure to control false discovery rate across multiple comparisons.
3.5 Time Periods
We test across four time periods: 2006–2015 (including financial crisis), 2016–2025 (post-crisis bull market), 2006–2025 (full sample), and 2023–2025 (recent subsample). This ensures results are not driven by any single market regime.
3.6 Preregistration
This study was preregistered prior to data analysis. The preregistration document specifies hypotheses HFF4 through HFF8 testing whether factor loadings (SMB, HML, RMW, CMA, MOM) remain equivalent between NATIVE and PATIENCE portfolios. The preregistration explicitly states the interpretive framework: if controls fail, this reveals a fundamental limitation of factor models. Preregistration available at: https://doi.org/10.5281/zenodo.18304121
3.7 Data and Code Availability
Complete replication code for this study is available at: https://github.com/kbburk-TSF/ff-factor-falsification. The repository includes all R scripts for backtesting, statistical analysis, and results generation. The code can be executed with any historical price dataset; no proprietary data or signals are required to replicate these findings.
This study uses zero proprietary entry signals. All benchmark strategies are fully specified in the code and can be computed from standard price data (open, high, low, close, volume). The patience gradient exit rule is deterministic and publicly disclosed. Fama-French factor returns are obtained from Kenneth French’s data library. Any researcher with access to equity price data can independently verify all results.
4. Results
4.1 Aggregate Results
Table 1 presents aggregate equivalence test results across all universes and entry types.
Table 1: Factor Loading Equivalence Test Results
| Universe | Entry Type | N Tests | Passed | Failed | Pass Rate | Failure Rate |
|---|---|---|---|---|---|---|
| DEFENSIVE | Scheduled | 200 | 19 | 181 | 9.5% | 90.5% |
| DEFENSIVE | Variable | 1,440 | 104 | 1,336 | 7.2% | 92.8% |
| AGGRESSIVE | Scheduled | 200 | 5 | 195 | 2.5% | 97.5% |
| AGGRESSIVE | Variable | 1,440 | 237 | 1,203 | 16.5% | 83.5% |
| FULL | Scheduled | 200 | 15 | 185 | 7.5% | 92.5% |
| FULL | Variable | 1,440 | 247 | 1,193 | 17.2% | 82.8% |
| COMM SVCS | Scheduled | 200 | 8 | 192 | 4.0% | 96.0% |
| COMM SVCS | Variable | 1,440 | 295 | 1,145 | 20.5% | 79.5% |
| TOTAL | All | 6,560 | 930 | 5,630 | 14.2% | 85.8% |
Note: TOST equivalence test with ±0.10 bounds. Pass = p < 0.05 confirming equivalence.
Across 6,560 controlled experiments, factor loadings fail equivalence tests at an 86% rate. Only 14% of tests confirm that NATIVE and PATIENCE portfolios have statistically equivalent factor exposures—despite holding identical securities with identical weights entered on identical dates. The assumption underlying three decades of factor-based anomaly dismissal fails in 86% of controlled tests.
Failure rates vary by universe and entry type but remain consistently high. The AGGRESSIVE universe shows the worst results for scheduled entries (97.5% failure rate), while the COMMUNICATION SERVICES 10-stock control shows 96.0% failure on scheduled entries. Even the best-performing category (COMMUNICATION SERVICES variable entries) shows 79.5% failure.
4.2 Results by Factor
The Investment factor (CMA) shows the worst stability across all universes, with only 2.5% of scheduled-entry tests confirming equivalence. This is notable because CMA measures exposure to firms’ investment aggressiveness, a characteristic that should be determined entirely by which stocks are held, not when positions are exited.
Momentum (MOM) also shows poor stability, which might be expected given momentum’s inherent sensitivity to timing. However, the instability of SMB (Size), HML (Value), and RMW (Profitability) is harder to reconcile with holdings-based interpretations of factor exposure.
4.3 Extreme Failures
Table 2 presents the most extreme factor loading differences observed.
Table 2: Extreme Factor Loading Failures
| Rank | Universe | Strategy | Period | Factor | |Δ| | TOST p | Info Content |
|---|---|---|---|---|---|---|---|
| 1 | AGGRESSIVE | random_126d | 2023–25 | CMA | 0.457 | 1.00 | ZERO |
| 2 | COMM SVCS | random_126d | 2023–25 | RMW | 0.411 | 0.94 | ZERO |
| 3 | COMM SVCS | random_63d | 2023–25 | CMA | 0.364 | 0.90 | ZERO |
| 4 | COMM SVCS | sept_avoid | 2006–15 | RMW | 0.340 | 0.96 | Calendar |
| 5 | COMM SVCS | random_21d | 2023–25 | CMA | 0.328 | 0.86 | ZERO |
| 6 | COMM SVCS | end_of_qtr | 2023–25 | CMA | 0.324 | 0.99 | Calendar |
| 7 | DEFENSIVE | friday | 2023–25 | MOM | 0.312 | 1.00 | Calendar |
| 8 | AGGRESSIVE | sept_avoid | 2006–15 | CMA | 0.309 | 0.93 | Calendar |
| 9 | FULL | sept_avoid | 2006–15 | CMA | 0.309 | 0.93 | Calendar |
| 10 | DEFENSIVE | sept_avoid | 2006–15 | CMA | 0.306 | 0.94 | Calendar |
Note: 5 of top 6 are random strategies with ZERO information content. |Δ| = absolute difference in factor loading between NATIVE and PATIENCE.
The single most extreme observation comes from the random_126d strategy in the AGGRESSIVE universe during 2023–2025: a CMA loading difference of +0.457 with TOST p-value of 0.9997. This random strategy has zero information content. It enters positions at random times. Yet changing only exit timing produces a measured Investment factor exposure difference of nearly half a unit.
Five of the top six most extreme failures come from random strategies or the 10-stock control. This pattern eliminates skill-based or diversification-based explanations for the observed factor loading instability.
4.4 The 10-Stock Control
The COMMUNICATION SERVICES universe contains only 10 stocks from a single sector. This design eliminates several potential confounds:
No diversification effects: With only 10 stocks, portfolio-level factor exposure should directly reflect constituent characteristics.
No sector rotation: All stocks come from the same sector, eliminating sector-timing explanations.
Minimal cross-sectional variation: Factor exposure should be nearly identical regardless of exit timing.
Despite these controls, the 10-stock universe shows 96.0% failure rate on scheduled entries and 81.5% overall. The maximum factor loading difference observed is 0.411 (random_126d, RMW factor). If factor loadings measured holdings-based risk exposure, a 10-stock single-sector portfolio should show near-perfect equivalence between timing variants.
4.5 Random Strategy Analysis
The random entry strategies provide the cleanest test of our hypothesis. By construction, random strategies contain zero information about future returns. Any systematic difference in factor loadings between NATIVE and PATIENCE versions of random strategies must therefore reflect properties of the measurement procedure rather than differential risk exposure.
We observe factor loading differences up to 0.457 for random strategies. These are not small measurement errors—they are differences larger than most factor tilts reported in the literature. A CMA loading shift of 0.457 implies the regression interprets the same 10 stocks as having dramatically different investment-factor exposure based solely on when positions were exited. If researchers had used this procedure to evaluate these random strategies, they would have concluded the strategies had systematically different factor exposures—despite the strategies being informationally identical by construction.
5. Discussion
5.1 The Dismissal Procedure Is Invalid
The standard procedure for evaluating timing strategies is as follows: regress strategy returns against Fama-French factors, observe factor loadings and alpha, conclude that any alpha reduction reflects factor exposure rather than timing skill. This procedure has been applied to calendar anomalies, technical indicators, and market timing signals for three decades. Our experiments prove it is invalid.
The procedure assumes that factor loadings measure risk exposure determined by what securities a portfolio holds. If this assumption were true, two portfolios holding identical securities would show equivalent factor loadings regardless of when they trade. Our experiments test this directly: identical stocks, identical weights, identical entry dates, only exit timing differs. Factor loadings fail equivalence 86% of the time, with differences up to 0.457.
This is not a marginal limitation. Factor loadings respond to timing, not holdings. A procedure that interprets timing-driven loading changes as evidence about holdings-based risk exposure is not measuring what it claims to measure. Every conclusion drawn from such a procedure—“alpha explained by factor exposure,” “anomaly reflects risk premium,” “no evidence of timing skill”—is unfounded when applied to timing strategies.
5.2 What the Literature Claimed vs. What Was Actually Tested
The literature claimed to test: “Does this timing strategy reflect genuine skill, or does it merely harvest factor risk premiums?” The procedure was supposed to separate skill from exposure by controlling for factor loadings. If alpha disappeared after the regression, the strategy was dismissed as factor harvesting.
What the procedure actually tested: “Does this return series have different covariances with factor portfolios than a benchmark?” These are not the same question. The first question asks about the source of returns. The second question asks about return-path statistics. When timing varies, the answer to the second question is mechanically determined by when returns are realized, not by what securities are held or what skill generated the timing decisions.
The bait-and-switch occurred because the procedure produces numbers that look like answers to the first question. Factor loadings have names—SMB, HML, MOM—that suggest they measure exposure to size, value, and momentum. Researchers interpreted elevated loadings as evidence that the strategy was exposed to those factors, and concluded the returns reflected factor premiums rather than timing skill. But the loadings were measuring return-path covariances, not holdings-based exposures. The interpretation was invalid from the start.
Our experiments expose this bait-and-switch by holding securities constant. When identical holdings produce dramatically different factor loadings based solely on exit timing, the loadings cannot be measuring what the literature claimed they measured. The procedure absorbed timing effects into factor loadings and then interpreted those loadings as evidence against timing skill—a conclusion that was unfalsifiable by construction.
5.3 The Momentum Anomaly: Evidence of Accidental Survival
Our results have profound implications for understanding why certain anomalies survive academic scrutiny while others are dismissed. The momentum factor (MOM) provides a revealing case study.
Momentum is the one timing anomaly that academic finance acknowledges as “real.” In their seminal 1996 paper, Fama and French explicitly admitted: “We have saved until last the discussion of the main embarrassment of the three-factor model, its failure to capture the continuation of short-term returns” (Fama and French, 1996, p. 81). This admission led Carhart (1997) to add MOM as a fourth factor. The implicit reasoning was: if the Fama-French factors cannot explain momentum, then momentum must represent a genuine anomaly rather than compensation for factor exposure. Unlike every other timing signal, momentum was granted anomaly status precisely because the dismissal tool visibly failed.
Our experiments reveal a different interpretation. Momentum loadings fail equivalence tests at rates comparable to other factors—only 5.6% of scheduled-entry tests confirm equivalence for MOM loadings. When portfolios hold identical stocks with identical weights, MOM loadings shift substantially based solely on exit timing. This means the momentum factor loading itself is a path-dependent regression artifact, not a property of portfolio holdings.
This finding reframes the historical survival of momentum as an anomaly. Momentum did not survive because it represents a genuine market inefficiency that factor models cannot explain. It survived because the dismissal tool—factor regression—happened to produce visibly inconsistent results for momentum returns. When researchers regressed momentum strategy returns against FF factors and found the strategy still showed positive alpha, they concluded momentum was “real.” But our experiments show this conclusion rested on a broken measurement procedure.
The implications extend far beyond momentum. Schwert (2003), in his comprehensive review for the Handbook of the Economics of Finance, documented how anomalies “often seem to disappear, reverse, or attenuate” after academic publication—a pattern he attributed partly to arbitrage but largely to factor-based “explanations” that absorbed anomalous returns into factor loadings. Our experiments reveal why these disappearances occurred: the factor regression procedure mechanically absorbs timing effects into loadings, then interprets those loadings as evidence against timing skill. The Halloween effect (Bouman and Jacobsen, 2002), which showed robust return differences between November–April and May–October periods across 37 countries, was repeatedly challenged by factor-based tests claiming the effect “reflects risk exposure.” The January effect faced similar treatment. Short-term alpha signals are “generally dismissed in traditional asset pricing models” (Blitz et al., 2023). In each case, factor regressions were used to argue that timing-based returns reflected factor exposure rather than genuine predictability. Our results suggest those conclusions are methodologically suspect.
This is not merely a technical limitation. It is a systematic bias in the anomaly evaluation literature. Timing signals that happened to show residual alpha after factor adjustment (like momentum) were accepted. Timing signals whose alpha happened to be absorbed by factor loadings were rejected. But both categories were evaluated by a tool that cannot distinguish timing skill from factor exposure.
The implication is stark: thirty years of anomaly dismissals based on factor model analysis require re-evaluation. Any timing signal that was killed by “alpha explained by factor exposure” was killed by a methodologically invalid test.
5.4 Thirty Years of Suspect Dismissals
The scope of this problem is not limited to a few marginal studies. Factor-based dismissal has been the dominant methodology for evaluating timing strategies since Fama and French (1993) introduced their three-factor model. Calendar anomalies—January effect, turn-of-month, sell-in-May—were tested against factor models and frequently dismissed as factor exposure. Jacobsen and Zhang (2018) showed the Halloween effect persists globally in data spanning over a century across dozens of countries, yet factor-based challenges continued to claim it “reflects risk.” Technical indicators—moving average crossovers, RSI, MACD—were regressed against factors and dismissed when alpha shrank. The entire framework of anomaly evaluation assumed that factor loadings measure what a portfolio holds, not when it trades. Our experiments prove this assumption false.
In every case, the conclusion was the same: “Alpha explained by factor exposure. No evidence of genuine timing skill.” In every case, the conclusion rested on interpreting factor loadings as measures of holdings-based risk exposure. In every case, that interpretation was invalid because the strategies involved timing variation.
We cannot say how many of these dismissed anomalies represent genuine market inefficiencies. Our experiments do not prove that any specific timing strategy works. What we prove is that the tool used to dismiss them was incapable of making the determination it claimed to make. The dismissals were not wrong because the anomalies were real. The dismissals were meaningless because the procedure could not distinguish real from spurious when timing varied.
This is not a call to rehabilitate every dismissed anomaly. It is a call to recognize that the dismissals provided no valid evidence. The anomalies remain open questions. The academic consensus that “most timing anomalies are spurious” rests on evidence that our experiments prove is inadmissible.
5.5 Anticipated Objection: ‘Factor Loadings Measure Covariances By Definition’
A sophisticated reader might object: ‘Factor loadings ARE return covariances by definition. You have merely demonstrated that different return series have different covariances with factors, which is tautological. This does not falsify anything.’
We address this objection directly. The mathematical definition of factor loadings as covariance ratios is not in dispute. What we falsify is the INTERPRETIVE FRAMEWORK applied in performance evaluation, which proceeds as follows:
Step 1: A strategy generates positive returns.
Step 2: Regress returns against Fama-French factors.
Step 3: Observe that the strategy has positive loadings on, say, SMB and HML.
Step 4: Conclude that returns reflect “exposure to size and value risk” rather than skill.
Step 5: Dismiss the strategy as harvesting known risk premiums.
This interpretive chain requires an unstated assumption: that factor loadings reflect properties of the PORTFOLIO (what you hold) rather than properties of the RETURN PATH (when you realize gains). Our experiments demonstrate this assumption is false. Factor loadings change dramatically based on exit timing alone, holding securities constant.
The objection ‘loadings are just covariances’ actually concedes our point. Yes, loadings are covariances. Covariances depend on timing. Therefore, loadings depend on timing. Therefore, interpreting loadings as “risk exposure determined by holdings” is invalid when timing varies. The standard performance evaluation framework applies exactly this invalid interpretation.
Any researcher who has used factor loadings to argue “alpha is explained by factor exposure” was implicitly claiming those loadings reflected portfolio composition. Our experiments prove they do not. The loadings reflect return paths, and return paths depend on timing. This is not a tautology; it is an empirical demonstration that the interpretive framework used to dismiss timing anomalies for thirty years was built on a false assumption. The objection that “loadings are just covariances” does not rescue the literature—it condemns it.
5.6 Limitations
Our study tests a specific exit timing rule (graduated profit-taking) against native benchmark exits. Other exit timing variations might show different results. However, the key finding that identical holdings produce non-equivalent factor loadings is sufficient to falsify the holdings-based interpretation of factor exposure.
We test S&P 500 constituents over a 20-year period. Results might differ for other markets, time periods, or security universes. We consider this unlikely given the mathematical basis for path-dependence, but acknowledge it as a limitation.
Our equivalence bounds of ±0.10 are conventional but arbitrary. Tighter bounds would show higher failure rates; looser bounds would show lower failure rates. The extreme observations exceeding 0.40 in magnitude would fail any reasonable equivalence bound.
6. Conclusion
Fama and French (1996) admitted that momentum was “the main embarrassment” of their three-factor model. Our experiments reveal that this embarrassment was diagnostic: the entire methodology for dismissing timing anomalies is invalid. For three decades, academic finance used a standard procedure—regress returns against Fama-French factors, interpret the loadings as risk exposures, conclude that alpha reflects factor premiums rather than skill. We prove this procedure cannot make the determination it claims to make.
Using 6,560 controlled experiments with portfolios holding identical stocks, we demonstrate that factor loadings fail equivalence tests at an 86% rate when only exit timing differs. Random strategies with zero information content show loading differences up to 0.457. A 10-stock control eliminates diversification as an explanation. These results are unambiguous: factor loadings respond to timing, not holdings. The procedure cannot distinguish timing skill from factor exposure because it conflates the two.
Momentum survived as an acknowledged anomaly only because the broken dismissal tool happened to fail visibly for that specific case—Fama and French could not explain it away, so they added it as a factor instead. The Halloween effect (Bouman and Jacobsen, 2002), which shows robust return differences across 37 countries, was repeatedly challenged by factor-based tests. The January effect faced similar treatment. Hundreds of timing signals were dismissed by “alpha explained by factor exposure.” All were dismissed by an invalid procedure. The survival of momentum was accidental; the dismissals were systematic error.
We do not claim that every dismissed anomaly is real, or that factor models are useless in all contexts. What we prove is that the tool used to dismiss timing anomalies was never capable of making that determination. The academic consensus that “most timing strategies reflect factor exposure rather than skill” rests on evidence that is inadmissible. The dismissals provided no valid information. The anomalies remain open questions.
References
Bernard, V.L., and Thomas, J.K. (1989). Post-earnings-announcement drift: Delayed price response or risk premium? Journal of Accounting Research, 27, 1–36.
Blitz, D., Hanauer, M.X., Honarvar, I., Huisman, R., and van Vliet, P. (2023). Beyond Fama-French factors: Alpha from short-term signals. Financial Analysts Journal, 79(4), 96–117.
Bouman, S., and Jacobsen, B. (2002). The Halloween indicator, “Sell in May and go away”: Another puzzle. American Economic Review, 92(5), 1618–1635.
Carhart, M.M. (1997). On persistence in mutual fund performance. Journal of Finance, 52(1), 57–82.
Fama, E.F., and French, K.R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1), 3–56.
Fama, E.F., and French, K.R. (1996). Multifactor explanations of asset pricing anomalies. Journal of Finance, 51(1), 55–84.
Fama, E.F., and French, K.R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1), 1–22.
Grundy, B.D., and Martin, J.S. (2001). Understanding the nature of the risks and the source of the rewards to momentum investing. Review of Financial Studies, 14(1), 29–78.
Jacobsen, B., and Zhang, C.Y. (2018). The Halloween indicator, “Sell in May and go away”: Everywhere and all the time. Available at SSRN: https://ssrn.com/abstract=2154873.
Jegadeesh, N., and Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. Journal of Finance, 48(1), 65–91.
Schwert, G.W. (2003). Anomalies and market efficiency. In G.M. Constantinides, M. Harris, and R.M. Stulz (Eds.), Handbook of the Economics of Finance (Vol. 1, pp. 939–974). Elsevier.
AI Assistance Disclosure
This manuscript was drafted with assistance from Claude (Anthropic). The AI assisted with manuscript preparation, writing, and editing. All research design, data collection, statistical analysis, and results generation were conducted entirely by the author without AI involvement. The AI did not have access to the underlying data, did not run any analyses, and did not generate any empirical results reported in this paper.
Specifically:
Author contribution: Research design, hypothesis development, preregistration, data collection, all R code for backtesting and statistical analysis, all empirical results, interpretation of findings.
AI contribution: Manuscript drafting, literature review synthesis, exposition and framing, copy editing.
The author takes full responsibility for the accuracy of all empirical claims and the validity of all statistical results reported herein.