Re-Understanding Z Score

Hongtao Hao / 2022-08-20


Ask anyone who has attended Stats101 and s/he will tell you that they understand Z-score. But, really? Could you answer the following questions without thinking?

  1. What is the mean of z-scores?
  2. What is the standard deviation of z-scores?
  3. What is the sum of squared z-scores?
  4. Is the z-score distribution the same as the original distribution of sample values?
  5. What do z-scores above 0 mean?

If you cannot answer them without thinking, then you don’t really understand z-scores.

Next, let’s gain a deeper understanding of z-scores by looking at the above questions.

The above questions are from this post .

What is z-score? #

Z-score is also called the standard score. In a one-dimensional array, i.e., a vector, the z-score of a number within this array indicates the distance between this number and the expected value of this array, i.e., the mean, measured by the standard deviation of this array.

$$z = \frac{x_i - \mu}{\sigma}$$

Where $x_i$ is a number in an array, $\mu$ is the mean of this array and $\sigma$ is the standard deviation.

Before we can calcuate z-scores,we need to calculate the mean and the standard deviation:

import math 

def my_mean(array):
    return sum(array)/len(array)

def my_std(array):
    mn = my_mean(array)
    my_sum = 0
    for i in array:
        my_sum += (i - mn)**2
    return math.sqrt(my_sum/len(array))
# an example, 
a = [1,4,6,8,10]
my_mean(a)
5.8
my_std(a)
3.1240998703626617
def zscores(array):
    mn = my_mean(array)
    std = my_std(array)
    return [(i-mn)/std for i in array]
z_scores = zscores(a)
z_scores
[-1.5364425591947517,
 -0.5761659596980319,
 0.06401843996644804,
 0.7042028396309279,
 1.3443872392954077]

The sum and the mean of z-scores #

The sum, and the mean of z-scores are always zero. Why?

$$\sum_{i=1}^n z_i = \sum_{i=1}^n \frac{x_i - \mu}{\sigma} = \frac{\sum_{i=1}^n (x_i - \mu)}{\sigma}$$

We have:

$$$$

$$\sum_{i=1}^n (x_i - \mu) = \sum_{i=1}^n x_i - n\cdot \mu$$

Because

$$\mu = \frac{\sum_{i=1}^n x_i}{n}$$

We have

$$\sum_{i=1}^n (x_i - \mu) = 0$$

So the sum of z-scores is zero. When the sum is zero, the mean is of course zero as well.

The standard deviation of z-scores #

Let’s calculate the standard devitaiton of z-scores.

$$\sigma_z = \sqrt{\frac{\sum_{i=1}^n (z_i - E(z))^2}{n}}$$

Because $E(z) = 0$, we have:

$$\sigma_z = \sqrt{\frac{\sum_{i=1}^n (z_i)^2}{n}}$$

This, in fact, leads to our third question:

The sum of squared z-scores #

$$\sum_{i=1}^n (z_i)^2 = \sum_{i=1}^n \frac{(x_i - \mu)^2}{\sigma^2} = \frac{\sum_{i=1}^n (x_i - \mu)^2}{\sigma^2}$$

Because

$$\sigma = \sqrt{\frac{\sum_{i=1}^n{(x_i - \mu)^2}}{n}}$$

So we have

$$\sigma^2 = \frac{\sum_{i=1}^n{(x_i - \mu)^2}}{n}$$

Therefore

$$\sum_{i=1}^n (z_i)^2 = n$$

That is to say, the sum of squared z-scores is the number of items in an array.

Then, go back to the standard deviation of z-scores, we can know that

$$\sigma_z = \sqrt{\frac{\sum_{i=1}^n (z_i)^2}{n}} = \sqrt{\frac{n}{n}} = 1$$

The distribution of z-scores #

The distribution of z-scores is the same as the original array. I think of it this way: the z-scores are the result of original numbers moving leftward horizontally by $\mu$ and squeezed vertically by $\sigma$. Because they all move together, their relative positions stay the same. And that’s why their distributions are also the same.

The meaning of the sign of z-scores #

From the formula of z-scores, we know that if a z-score is above zero, it means the corresponding number in the array is bigger than the mean of the array. If a z-scores is below zero, it means the corresponding number in the array is smaller than the mean.

#ML

Last modified on 2023-01-18