Concept#
Joint Expectation#
Definition 133 (Joint Expectation)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
The joint expectation is
if \(X\) and \(Y\) are discrete, or
if \(X\) and \(Y\) are continuous. Joint expectation is also called correlation.
Definition 134 (Cosine Dot Product)
Let \(\boldsymbol{x} \in \mathbb{R}^N\) and \(\boldsymbol{y} \in \mathbb{R}^N\) be two vectors.
The cosine angle \(\cos \theta\) can be defined as
where \(\|\boldsymbol{x}\|=\sqrt{\sum_{i=1}^N x_i^2}\) is the norm of the vector \(\boldsymbol{x}\), and \(\|\boldsymbol{y}\|=\sqrt{\sum_{i=1}^N y_i^2}\) is the norm of the vector \(\boldsymbol{y}\).
The inner product \(\boldsymbol{x}^T \boldsymbol{y}\) defines the degree of similarity/correlation between two vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\), where the cosine angle \(\cos \theta\) is the cosine of the angle between the two vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\).
Theorem 35 (Cauchy-Schwarz Inequality)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
We can then view the joint expectation as the cosine dot product between the two random variables. See [Chan, 2021] section 5.2.1, page 259-261.
Covariance and Correlation Coefficient#
Definition 135 (Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then the covariance of \(X\) and \(Y\) is defined as,
where \(\mu_X=\mathbb{E}[X]\) and \(\mu_Y=\mathbb{E}[Y]\) are the mean of \(X\) and \(Y\) respectively.
Note that if \(X = Y\), then \(\operatorname{cov}(X, Y)\) can be reduced to the variance of \(X\). Consequently, the covariance is a generalization of the variance.
Theorem 36 (Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
Proof. The proof is relative straightforward, we can just apply the definition of Definition 135:
Theorem 37 (Linearity of Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
where \(\alpha\) and \(\beta\) are constants.
And,
Property 23 (Covariance)
For any two random variables \(X\) and \(Y\) with sample space \(\Omega_X\) and \(\Omega_Y\) respectively, we have the following properties:
\(\operatorname{Cov}(X, Y)=\operatorname{Cov}(Y, X)\)
\(\operatorname{Cov}(X, Y)=0\) if \(X\) and \(Y\) are independent
\(\operatorname{Cov}(X, X)=\operatorname{Var}(X)\)
After we have defined the covariance, we can define the correlation coefficient of \(X\) and \(Y\) formally below. We can treat the correlation coefficient \(\rho\) as the cosine angle of the centralized random variables \(X\) and \(Y\) [Chan, 2021]. Note to fully appreciate why the correlation coefficient is defined as the cosine angle, one can see the derivation in [Chan, 2021] section 5.2.1, page 259-261.
Definition 136 (Correlation Coefficient)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then the correlation coefficient of \(X\) and \(Y\) is defined as,
Property 24 (Correlation Coefficient)
\(-1 \leq \rho(X, Y) \leq 1\), an immediate consequence of the definition of cosine angle.
If \(\rho(X, Y)=1\), then \(X\) and \(Y\) are perfectly positively correlated, in other words, \(Y = \alpha X + \beta\) for some constants \(\alpha\) and \(\beta\), \(\alpha > 0\).
If \(\rho(X, Y)=-1\), then \(X\) and \(Y\) are perfectly negatively correlated, in other words, \(Y = \alpha X + \beta\) for some constants \(\alpha\) and \(\beta\), \(\alpha < 0\).
If \(\rho(X, Y)=0\), then \(X\) and \(Y\) are uncorrelated, in other words, or in linear algebra lingo, \(X\) and \(Y\) are orthogonal and are linearly independent.
\(\rho(\alpha X + \beta, \gamma Y + \delta) = \rho(X, Y)\), where \(\alpha\), \(\beta\), \(\gamma\), and \(\delta\) are constants.
Independence and Correlation Coefficient#
Theorem 38 (Independence and Joint Expectation)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then we have,
Theorem 39 (Independence and Covariance)
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Let the following two statements be:
\(X\) and \(Y\) are independent;
\(\operatorname{Cov}(X, Y)=0\).
Then statement 1 implies statement 2, but statement 2 does not imply statement 1. Independence is therefore a stronger condition than correlation [Chan, 2021].
In other words:
Independence \(\implies\) Uncorrelated;
Uncorrelated \(\not\implies\) Independence.
Empirical (Sample) Correlation Coefficient#
Everything defined previously is for the population, but we can also define the correlation coefficient for the sample via estimation.
Theorem 40 (Empirical Correlation Coefficient)
Given a dataset of \(N\) samples and \(D=1\) features with a target variable \(Y\),
where \(x^{(n)}\) is the \(n\)-th sample and \(y^{(n)}\) is the \(n\)-th target variable.
Then the empirical correlation coefficient of \(X\) and \(Y\) is defined as,
where \(\bar{x} = \frac{1}{N} \sum_{n=1}^N x^{(n)}\) and \(\bar{y} = \frac{1}{N} \sum_{n=1}^N y^{(n)}\) are the sample mean of \(X\) and \(Y\) respectively.
As \(N \rightarrow \infty\), \(\hat{\rho}\left(\mathcal{S}_{\{x, y\}}\right) \rightarrow \rho(X, Y)\).
In order to generate some plots of correlation, we introduce prematurely the concept of covariance matrix in terms of a \(2 \times 2\) matrix, this can be scaled to higher dimensions as well, which we will learn later.
Definition 137 (Covariance Matrix (2D))
Let \(X\) and \(Y\) be two random variables with sample space \(\Omega_X\) and \(\Omega_Y\) respectively.
Then the covariance matrix of \(X\) and \(Y\) is defined as,
In addition, we define a 2D Gaussian distribution categorized by its mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\).
Definition 138 (Multivariate Gaussian Distribution (2D))
Let \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\) be the mean vector and covariance matrix of a 2D Gaussian distribution respectively.
Then the multivariate Gaussian distribution is defined as,
We then generate some data from a 2D Gaussian distribution with the following parameters:
A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 0\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}\).
A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 0.5\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\).
A bivariate Gaussian distribution with a correlation coefficient of \(\rho = 1\) can be generated by mean vector \(\boldsymbol{\mu} = [0, 0]^T\) and covariance matrix \(\boldsymbol{\Sigma} = \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix}\).
Notice that the empirical correlation coefficient \(\hat{\rho}(\mathcal{S}_{\{x, y\}})\) is close to the true correlation coefficient \(\rho(X, Y)\) with \(N\) large.