1. Introduction
Multiple linear regression uses two or more independent variables to describe the variation of the dependent variable rather than just one independent variable, as in simple linear regression. It allows the analyst to estimate using more complex models with multiple explanatory variables and, if used correctly, may lead to better predictions, better portfolio construction, or better understanding of the drivers of security returns. If used incorrectly, however, multiple linear regression may yield spurious relationships, lead to poor predictions, and offer a poor understanding of relationships.
The analyst must first specify the model and make several decisions in this process, answering the following, among other questions: What is the dependent variable of interest? What independent variables are important? What form should the model take? What is the goal of the model-prediction or understanding of the relationship?
The analyst specifies the dependent and independent variables and then employs software to estimate the model and produce related statistics. The good news is that the software, such as shown in Exhibit 1, does the estimation, and our primary tasks are to focus on specifying the model and interpreting the output from this software, which are the main subjects of this content.
Exhibit 1: Examples of Regression Software
Software Programs/Functions Excel Data Analysis > Regression Python scipy.stats.linregress statsmodels.lm sklearn.linear_model.LinearRegression R lm SAS PROC REG PROC GLM STATA regress
Learning Module Overview
-
Multiple linear regression is used to model the linear relationship between one dependent variable and two or more independent variables.
-
In practice, multiple regressions are used to explain relationships between financial variables, to test existing theories, or to make forecasts.
-
The regression process covers several decisions the analyst must make, such as identifying the dependent and independent variables, selecting the appropriate regression model, testing if the assumptions behind linear regression are satisfied, examining goodness of fit, and making needed adjustments.
-
A multiple regression model is represented by the following equation:
Yi = b0 + b1X1i + b2X2i + b3X3i + . + bkXki + ei, i = 1, 2, 3, ., n,
where Y is the dependent variable, Xs are the independent variables from 1 to k, and the model is estimated using n observations.
-
Coefficient b0 is the model's "intercept," representing the expected value of Y if all independent variables are zero.
-
Parameters b1 to bk are the slope coefficients (or partial regression coefficients) for independent variables X1 to Xk. Slope coefficient bj describes the impact of independent variable Xj on Y, holding all the other independent variables constant.
-
There are five main assumptions underlying multiple regression models that must be satisfied, including (1) linearity, (2) homoskedasticity, (3) independence of errors, (4) normality, and (5) independence of independent variables.
-
Diagnostic plots can help detect whether these assumptions are satisfied. Scatterplots of dependent versus and independent variables are useful for detecting non-linear relationships, while residual plots are useful for detecting violations of homoskedasticity and independence of errors.
2. Uses of Multiple Linear Regression
Learning Outcome
The candidate should be able to:
- describe the types of investment problems addressed by multiple linear regression and the regression process
There are many investment problems in which the analyst needs to consider the impact of multiple factors on the subject of research rather than a single factor. In the complex world of investments, it is intuitive that explaining or forecasting a financial variable by a single factor may be insufficient. The complexity of financial and economic relations calls for models with multiple explanatory variables, subject to fundamental justification and various statistical tests.
Examples of how multiple regression may be used include the following:
-
A portfolio manager wants to understand how returns are influenced by a set of underlying factors; the size effect, the value effect, profitability, and investment aggressiveness. The goal is to estimate a Fama-French five-factor model that will provide an understanding of the factors that are important for driving a particular stock's excess returns.
-
A financial adviser wants to identify whether certain variables, such as financial leverage, profitability, revenue growth, and changes in market share, can predict whether a company will face financial distress.
-
An analyst wants to examine the effect of different dimensions of country risk, such as political stability, economic conditions, and environmental, social, and governance (ESG) considerations, on equity returns in that country.
Multiple regression can be used to identify relationships between variables, to test existing theories, or to forecast. We outline the general process of regression analysis in Exhibit 2. As you can see, there are many decisions that the analyst must make in this process.
For example, if the dependent variable is continuous, such as returns, the traditional regression model is typically the first step. If, however, the dependent variable is discrete-for example, an indicator variable such as whether a company is a takeover target or not a takeover target-then, as we shall see, the model may be estimated as a logistic regression.
In either case, the process of determining the best model follows a similar path. The model must first be specified, including independent variables that may be continuous, such as company financial features, or discrete (i.e., dummy variables), indicating membership in a class, such as an industry sector. Next, the regression model is estimated and analyzed to ensure it satisfies key underlying assumptions and meets the analyst's goodness-of-fit criteria. Once the model is tested and its out-of-sample performance is deemed acceptable, then it can be used for further identifying relationships between variables, for testing existing theories, or for forecasting.
Exhibit 2: The Regression Process Expand / Collapse Extended Description1st step followed by, is dependent variable continuous? If no, use logistic regression. If yes, estimate regression model followed by analyzing residuals. This is followed by are assumptions of regression satisfied? If no, adjust model. If yes, examine the goodness of fit of the model, followed by, is overall fit significant? If no, adjust model and start from beginning. If yes, is this model the best of possible models? If no, adjust again, restart. If yes, use model for analysis/prediction.
Knowledge Check
Assessment: Multiple Regression-Types of Investment Problems and Process
You are a junior analyst assisting in the development of various multiple regression models for your industry sector. Identify the action you should take to resolve each of the following issues:
Issue Action The dependent variable takes on a value of 1 if the company is a merger target and 0 otherwise. The analyst estimates a model with five independent variables, and none of these variables are significant explanatory variables. The residuals do not appear to be homoskedastic, thus violating a regression assumption. The regression assumptions are satisfied, the overall fit is significant, and the model is the best model of the possible models.
Solution
Issue Action The dependent variable takes on a value of 1 if the company is a merger target and 0 otherwise. Use logistic regression. The analyst estimates a model with five independent variables, and none of these variables are significant explanatory variables. Adjust the model and re-estimate. The residuals do not appear to be homoskedastic, thus violating a regression assumption. Adjust the model and re-estimate. The regression assumptions are satisfied, the overall fit is significant, and the model is the best model of the possible models. Use the model...