Building an Open-Source Barra-Like Specific Return Model
Abstract
This working paper outlines how the open-source hornli-hsr project builds a free alternative
to Barra’s proprietary specific return model. It combines daily market data and quarterly
fundamentals to compute standardized descriptors (e.g., size, value, momentum, leverage,
volatility) which are then aggregated into style factor exposures. On each trading day, factor
returns are estimated via cross-sectional regressions of stock returns on their industry, country
and style exposures; specific returns are obtained as residuals and used to build an EWMA
covariance estimator. To avoid look-ahead bias, fundamental data are aligned by quarter-end
dates and shifted so only information available at the time is used. Factor and specific return
series and descriptor-level exposures are released publicly, while raw price and fundamental data
remain proprietary. HSR (Hornli Specific Return) is an open-source research tool, not affiliated
with or endorsed by MSCI/Barra and not intended as investment advice.
Keywords — Barra risk model; factor models; cross-sectional regression; style factors; open-source
risk model; look-ahead bias; EWMA covariance; financial risk management.
1 Introduction
Risk models such as MSCI’s BARRA US Equity models forecast portfolio risk by representing
each asset’s return as a linear combination of factor exposures multiplied by factor returns plus a
specific (idiosyncratic) return. Factors include industry dummies, country/region exposures and
style factors (size, value, momentum). Factor returns are estimated via cross-sectional regression of
asset returns on their exposures, and the resulting time series build the factor covariance matrix.
Because exposures are neutral in aggregate, the cap-weighted sum of industry returns approximates
the market return. The hornli-hsr project reproduces a Barra-like specific return model by releasing
descriptor-level exposures and the resulting factor and specific return panels while preserving vendor
proprietary data.
2 Data and Universe
• Universe and dates. The model targets U.S. equities in the Russell 3000 universe. Daily
market data include open, high, low, close and adjusted prices, while fundamental data
include quarterly income-statement and balance-sheet items. The default backtest begins on
1 January 2015 and runs to the present.
• Non-public data. Raw market and fundamental data (under DEFAULT_PATH/input) cannot
be released. Instead, the project constructs descriptors for each style factor and stores them
under DEFAULT_PATH/intermediate/descriptor; these descriptor files and the processing
code are provided to the public. Final factor and specific return panels are written to
DEFAULT_PATH/output.
1
• Avoiding look-ahead bias. Look-ahead bias arises when a study uses information not
yet available at the decision time. Fundamental data are reported with a lag; to avoid bias,
fundamentals are aligned by quarter-end and shifted by one period. After loading a company’s
fundamental history, the series is shifted and forward-filled so the model only uses the most
recently reported data.
3 Descriptor Construction
Style factors are inspired by MSCI/Barra. Each factor is built from one or more descriptors
computed from price and fundamental data. Key descriptors include:
• Market capitalisation & Size — Market cap is close price times diluted shares; size is the
log of market cap, and a nonlinear term uses the cube of this log.
• Value — Book-to-price (total common equity divided by price), sales-to-price and cash-flowto-price descriptors.
• Earnings yield — Trailing 12-month net income per share divided by price.
• Growth — Measures payout ratio, asset growth rate, earnings growth and recent earnings
change.
• Leverage — Combines market leverage and book leverage.
• Momentum — Twelve-month momentum (excluding the most recent month) computed as
the mean of daily returns over a 252-day window, shifted by 21 days.
• Volatility — Standard deviation of daily returns over 60 days and high-low range over 21
days.
• Trading activity — Rolling sums of daily volume divided by average shares outstanding
over annual, quarterly and 5-year windows.
• Earnings variability — Rolling standard deviation divided by mean of earnings, sales and
cash flows over five years; accruals are net income minus cash flow divided by total assets.
• Dividend yield — Trailing four quarters of dividends per share divided by price.
• Management quality — Growth rates of assets, shares outstanding and capital expenditures,
and the CAPEX-to-assets ratio.
Daily alignment maps quarterly values to daily series by shifting each quarter’s value to the
first available trading day and forward-filling across days. After computing descriptors, they are
winsorized (e.g., at 1% and 99% quantiles) and standardized cross-sectionally so that zero denotes
average exposure and each unit represents one standard deviation.
4 Aggregating Descriptors into Style Exposures
Descriptors are grouped into high-level style factors. After standardizing individual descriptors, the
code averages z-scores across descriptors within each style (e.g., value factor averages normalized
book-to-price, sales-to-price and cash-flow-to-price). Industry one-hot dummies are built from GICS
names, and since all stocks are U.S., country exposure is a single column of ones. Exposures are
stored in intermediate/loadings.parquet, and market cap data are stored separately to support
regression weighting.
2
5 Cross-Sectional Regression and Factor Returns
On each trading day, the project regresses the vector of stock returns on the design matrix of
exposures. The cross-sectional regression model is:
rt,i =
K
X
xt,i,k ft,k + ut,i ,
(1)
k=1
where x is the matrix of exposures, ft,k are factor returns and ut,i are specific returns. Daily style
and industry factor returns are estimated via weighted least squares with constraints:
1. Winsorize returns to clip extreme values at the 1% and 99% quantiles.
2. Regression weights are proportional to the inverse of specific variance estimates and scaled
by the square root of market capitalization.
3. Constraints enforce that cap-weighted sums of country and industry factor returns equal
zero.
4. Solve regression using a block matrix system to estimate factor returns and specific returns.
The resulting panel of factor returns (dates × factors) is stored in output/factor_return.parquet
and specific returns (dates × assets) in output/specific_return.parquet.
6 Risk Model and Covariance Estimation
The factor covariance matrix is estimated using an EWMA of factor returns. Separate half-life
parameters for volatility and correlation are used, and a regime proxy adjusts volatility for changing
market conditions. The factor covariance on day t is
Σt = Dt ρt Dt + λI,
(2)
where Dt is a diagonal matrix of factor volatilities scaled by the regime multiplier, ρt is the correlation
matrix, and λ is a ridge parameter for numerical stability. Specific variances are updated via an
EWMA of squared specific returns with Bayesian shrinkage toward the cross-sectional median.
7 Risk Attribution
Given factor exposures and the risk model, the proportion of variance explained by factors for a
portfolio is computed by
β⊤
p Σf β p
2
Rport
= ⊤
,
(3)
β p Σf β p + w⊤ diag(σ 2 ) w
where β p is the vector of portfolio exposures, Σf is the factor covariance matrix and σ 2 is the vector
of specific variances. The research script stores R2 panels per asset and date.
3
8 Robustness Considerations
Winsorization and standardization control extreme observations; clipping descriptors and returns at
1% and 99% quantiles reduces the influence of outliers. Regression weighting by the square root
of market capitalisation and the inverse of specific variance reduces noise from small-cap stocks,
and constraints ensure industry and country factor returns measure relative performance. Shifting
quarterly fundamentals backward and aligning them to the first trading date prevents look-ahead
bias. Half-life parameters (volatility ≈ 42 days, correlation ≈ 200 days, regime ≈ 21 days) follow
common risk-model practice.
9 Limitations
The HSR framework has limitations: it models only U.S. equities (Russell 3000); global exposures
and emerging-market dynamics are not included. Descriptor sets and factor definitions are simplified
relative to MSCI’s commercial models. Factor returns may differ from MSCI’s due to proprietary
weighting schemes; replicators must obtain equivalent raw data to recompute descriptors. Thus,
HSR is a research-oriented approximation rather than a professional risk model replacement.
10 Data Release and Usage
Deliverables include descriptor files, loadings, factor and specific returns, factor covariance and
specific variance panels, and R2 panels. Researchers should cite the work and underlying methodology;
raw inputs remain proprietary.
11 Versioning and Reproducibility
Each release is tied to a specific code commit and its output files. This paper corresponds to commit
f512a6ee02b4b10cceb4511b667efa52830b4b67 on the main branch of the ajde0606/hornli-hsr
GitHub repository (14 Oct 2025). Researchers can reproduce results by checking out this commit
and executing the processing scripts. When citing this work, reference the commit hash and retrieval
date; an example citation is: “[Author(s)]. Building an Open-Source Barra-Like Specific Return
Model. Working paper, 2025. Version f512a6e.”
12 Conclusion
The working paper presents an open-source implementation of a Barra-like specific return model.
It follows the MSCI/Barra methodology for constructing descriptors, aggregating them into style
factors, estimating factor returns via constrained cross-sectional regression and building a factor
covariance matrix using EWMA techniques. Key safeguards include winsorising and standardising
descriptors, weighting regressions by market capitalisation and specific variance, imposing sum-tozero constraints, and shifting fundamental data to avoid look-ahead bias. By releasing descriptor-level
data and code, the project provides a transparent framework for exploring factor-based risk models
while respecting data licensing restrictions.
4