Open-Source Barra Risk Model Guide

Building an Open-Source Barra-Like Specific Return Model Abstract This working paper outlines how the open-source hornli-hsr project builds a free alternative to Barra’s proprietary specific return model. It combines daily market data and quarterly fundamentals to compute standardized descriptors (e.g., size, value, momentum, leverage, volatility) which are then aggregated into style factor exposures. On each trading day, factor returns are estimated via cross-sectional regressions of stock returns on their industry, country and style exposures; specific returns are obtained as residuals and used to build an EWMA covariance estimator. To avoid look-ahead bias, fundamental data are aligned by quarter-end dates and shifted so only information available at the time is used. Factor and specific return series and descriptor-level exposures are released publicly, while raw price and fundamental data remain proprietary. HSR (Hornli Specific Return) is an open-source research tool, not affiliated with or endorsed by MSCI/Barra and not intended as investment advice. Keywords — Barra risk model; factor models; cross-sectional regression; style factors; open-source risk model; look-ahead bias; EWMA covariance; financial risk management. 1 Introduction Risk models such as MSCI’s BARRA US Equity models forecast portfolio risk by representing each asset’s return as a linear combination of factor exposures multiplied by factor returns plus a specific (idiosyncratic) return. Factors include industry dummies, country/region exposures and style factors (size, value, momentum). Factor returns are estimated via cross-sectional regression of asset returns on their exposures, and the resulting time series build the factor covariance matrix. Because exposures are neutral in aggregate, the cap-weighted sum of industry returns approximates the market return. The hornli-hsr project reproduces a Barra-like specific return model by releasing descriptor-level exposures and the resulting factor and specific return panels while preserving vendor proprietary data. 2 Data and Universe • Universe and dates. The model targets U.S. equities in the Russell 3000 universe. Daily market data include open, high, low, close and adjusted prices, while fundamental data include quarterly income-statement and balance-sheet items. The default backtest begins on 1 January 2015 and runs to the present. • Non-public data. Raw market and fundamental data (under DEFAULT_PATH/input) cannot be released. Instead, the project constructs descriptors for each style factor and stores them under DEFAULT_PATH/intermediate/descriptor; these descriptor files and the processing code are provided to the public. Final factor and specific return panels are written to DEFAULT_PATH/output. 1 • Avoiding look-ahead bias. Look-ahead bias arises when a study uses information not yet available at the decision time. Fundamental data are reported with a lag; to avoid bias, fundamentals are aligned by quarter-end and shifted by one period. After loading a company’s fundamental history, the series is shifted and forward-filled so the model only uses the most recently reported data. 3 Descriptor Construction Style factors are inspired by MSCI/Barra. Each factor is built from one or more descriptors computed from price and fundamental data. Key descriptors include: • Market capitalisation & Size — Market cap is close price times diluted shares; size is the log of market cap, and a nonlinear term uses the cube of this log. • Value — Book-to-price (total common equity divided by price), sales-to-price and cash-flowto-price descriptors. • Earnings yield — Trailing 12-month net income per share divided by price. • Growth — Measures payout ratio, asset growth rate, earnings growth and recent earnings change. • Leverage — Combines market leverage and book leverage. • Momentum — Twelve-month momentum (excluding the most recent month) computed as the mean of daily returns over a 252-day window, shifted by 21 days. • Volatility — Standard deviation of daily returns over 60 days and high-low range over 21 days. • Trading activity — Rolling sums of daily volume divided by average shares outstanding over annual, quarterly and 5-year windows. • Earnings variability — Rolling standard deviation divided by mean of earnings, sales and cash flows over five years; accruals are net income minus cash flow divided by total assets. • Dividend yield — Trailing four quarters of dividends per share divided by price. • Management quality — Growth rates of assets, shares outstanding and capital expenditures, and the CAPEX-to-assets ratio. Daily alignment maps quarterly values to daily series by shifting each quarter’s value to the first available trading day and forward-filling across days. After computing descriptors, they are winsorized (e.g., at 1% and 99% quantiles) and standardized cross-sectionally so that zero denotes average exposure and each unit represents one standard deviation. 4 Aggregating Descriptors into Style Exposures Descriptors are grouped into high-level style factors. After standardizing individual descriptors, the code averages z-scores across descriptors within each style (e.g., value factor averages normalized book-to-price, sales-to-price and cash-flow-to-price). Industry one-hot dummies are built from GICS names, and since all stocks are U.S., country exposure is a single column of ones. Exposures are stored in intermediate/loadings.parquet, and market cap data are stored separately to support regression weighting. 2 5 Cross-Sectional Regression and Factor Returns On each trading day, the project regresses the vector of stock returns on the design matrix of exposures. The cross-sectional regression model is: rt,i = K X xt,i,k ft,k + ut,i , (1) k=1 where x is the matrix of exposures, ft,k are factor returns and ut,i are specific returns. Daily style and industry factor returns are estimated via weighted least squares with constraints: 1. Winsorize returns to clip extreme values at the 1% and 99% quantiles. 2. Regression weights are proportional to the inverse of specific variance estimates and scaled by the square root of market capitalization. 3. Constraints enforce that cap-weighted sums of country and industry factor returns equal zero. 4. Solve regression using a block matrix system to estimate factor returns and specific returns. The resulting panel of factor returns (dates × factors) is stored in output/factor_return.parquet and specific returns (dates × assets) in output/specific_return.parquet. 6 Risk Model and Covariance Estimation The factor covariance matrix is estimated using an EWMA of factor returns. Separate half-life parameters for volatility and correlation are used, and a regime proxy adjusts volatility for changing market conditions. The factor covariance on day t is Σt = Dt ρt Dt + λI, (2) where Dt is a diagonal matrix of factor volatilities scaled by the regime multiplier, ρt is the correlation matrix, and λ is a ridge parameter for numerical stability. Specific variances are updated via an EWMA of squared specific returns with Bayesian shrinkage toward the cross-sectional median. 7 Risk Attribution Given factor exposures and the risk model, the proportion of variance explained by factors for a portfolio is computed by β⊤ p Σf β p 2 Rport = ⊤ , (3) β p Σf β p + w⊤ diag(σ 2 ) w where β p is the vector of portfolio exposures, Σf is the factor covariance matrix and σ 2 is the vector of specific variances. The research script stores R2 panels per asset and date. 3 8 Robustness Considerations Winsorization and standardization control extreme observations; clipping descriptors and returns at 1% and 99% quantiles reduces the influence of outliers. Regression weighting by the square root of market capitalisation and the inverse of specific variance reduces noise from small-cap stocks, and constraints ensure industry and country factor returns measure relative performance. Shifting quarterly fundamentals backward and aligning them to the first trading date prevents look-ahead bias. Half-life parameters (volatility ≈ 42 days, correlation ≈ 200 days, regime ≈ 21 days) follow common risk-model practice. 9 Limitations The HSR framework has limitations: it models only U.S. equities (Russell 3000); global exposures and emerging-market dynamics are not included. Descriptor sets and factor definitions are simplified relative to MSCI’s commercial models. Factor returns may differ from MSCI’s due to proprietary weighting schemes; replicators must obtain equivalent raw data to recompute descriptors. Thus, HSR is a research-oriented approximation rather than a professional risk model replacement. 10 Data Release and Usage Deliverables include descriptor files, loadings, factor and specific returns, factor covariance and specific variance panels, and R2 panels. Researchers should cite the work and underlying methodology; raw inputs remain proprietary. 11 Versioning and Reproducibility Each release is tied to a specific code commit and its output files. This paper corresponds to commit f512a6ee02b4b10cceb4511b667efa52830b4b67 on the main branch of the ajde0606/hornli-hsr GitHub repository (14 Oct 2025). Researchers can reproduce results by checking out this commit and executing the processing scripts. When citing this work, reference the commit hash and retrieval date; an example citation is: “[Author(s)]. Building an Open-Source Barra-Like Specific Return Model. Working paper, 2025. Version f512a6e.” 12 Conclusion The working paper presents an open-source implementation of a Barra-like specific return model. It follows the MSCI/Barra methodology for constructing descriptors, aggregating them into style factors, estimating factor returns via constrained cross-sectional regression and building a factor covariance matrix using EWMA techniques. Key safeguards include winsorising and standardising descriptors, weighting regressions by market capitalisation and specific variance, imposing sum-tozero constraints, and shifting fundamental data to avoid look-ahead bias. By releasing descriptor-level data and code, the project provides a transparent framework for exploring factor-based risk models while respecting data licensing restrictions. 4

Open-Source Barra Risk Model Guide

Related documents

Products

Support

Open-Source Barra Risk Model Guide

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib