Ask anyone who has attended Stats101 and s/he will tell you that they understand Z-score. But, really? Could you answer the following questions without thinking?
- What is the mean of z-scores?
- What is the standard deviation of z-scores?
- What is the sum of squared z-scores?
- Is the z-score distribution the same as the original distribution of sample values?
- What do z-scores above 0 mean?
If you cannot answer them without thinking, then you don’t really understand z-scores.
Next, let’s gain a deeper understanding of z-scores by looking at the above questions.
The above questions are from this post .
What is z-score? #
Z-score is also called the standard score. In a one-dimensional array, i.e., a vector, the z-score of a number within this array indicates the distance between this number and the expected value of this array, i.e., the mean, measured by the standard deviation of this array.
$$z = \frac{x_i - \mu}{\sigma}$$
Where $x_i$
is a number in an array, $\mu$
is the mean of this array and $\sigma$
is the standard deviation.
Before we can calcuate z-scores,we need to calculate the mean and the standard deviation:
import math
def my_mean(array):
return sum(array)/len(array)
def my_std(array):
mn = my_mean(array)
my_sum = 0
for i in array:
my_sum += (i - mn)**2
return math.sqrt(my_sum/len(array))
# an example,
a = [1,4,6,8,10]
my_mean(a)
5.8
my_std(a)
3.1240998703626617
def zscores(array):
mn = my_mean(array)
std = my_std(array)
return [(i-mn)/std for i in array]
z_scores = zscores(a)
z_scores
[-1.5364425591947517,
-0.5761659596980319,
0.06401843996644804,
0.7042028396309279,
1.3443872392954077]
The sum and the mean of z-scores #
The sum, and the mean of z-scores are always zero. Why?
$$\sum_{i=1}^n z_i = \sum_{i=1}^n \frac{x_i - \mu}{\sigma} = \frac{\sum_{i=1}^n (x_i - \mu)}{\sigma}$$
We have:
$$$$
$$\sum_{i=1}^n (x_i - \mu) = \sum_{i=1}^n x_i - n\cdot \mu$$
Because
$$\mu = \frac{\sum_{i=1}^n x_i}{n}$$
We have
$$\sum_{i=1}^n (x_i - \mu) = 0$$
So the sum of z-scores is zero. When the sum is zero, the mean is of course zero as well.
The standard deviation of z-scores #
Let’s calculate the standard devitaiton of z-scores.
$$\sigma_z = \sqrt{\frac{\sum_{i=1}^n (z_i - E(z))^2}{n}}$$
Because $E(z) = 0$
, we have:
$$\sigma_z = \sqrt{\frac{\sum_{i=1}^n (z_i)^2}{n}}$$
This, in fact, leads to our third question:
The sum of squared z-scores #
$$\sum_{i=1}^n (z_i)^2 = \sum_{i=1}^n \frac{(x_i - \mu)^2}{\sigma^2} = \frac{\sum_{i=1}^n (x_i - \mu)^2}{\sigma^2}$$
Because
$$\sigma = \sqrt{\frac{\sum_{i=1}^n{(x_i - \mu)^2}}{n}}$$
So we have
$$\sigma^2 = \frac{\sum_{i=1}^n{(x_i - \mu)^2}}{n}$$
Therefore
$$\sum_{i=1}^n (z_i)^2 = n$$
That is to say, the sum of squared z-scores is the number of items in an array.
Then, go back to the standard deviation of z-scores, we can know that
$$\sigma_z = \sqrt{\frac{\sum_{i=1}^n (z_i)^2}{n}} = \sqrt{\frac{n}{n}} = 1$$
The distribution of z-scores #
The distribution of z-scores is the same as the original array. I think of it this way: the z-scores are the result of original numbers moving leftward horizontally by $\mu$
and squeezed vertically by $\sigma$
. Because they all move together, their relative positions stay the same. And that’s why their distributions are also the same.
The meaning of the sign of z-scores #
From the formula of z-scores, we know that if a z-score is above zero, it means the corresponding number in the array is bigger than the mean of the array. If a z-scores is below zero, it means the corresponding number in the array is smaller than the mean.
#MLLast modified on 2022-10-01