Introduction to Statistics – Summary Statistics

Summary Statistics

Downloads and transcripts

Video

Download video file

Transcripts

Video Transcript

Start of transcript. Skip to the end.
So we just discussed visual ways to summarize data,
now let’s discuss numerical ways to summarize data.
So, I’ll present some summary statistics
starting with the sample mean.
So, the sample mean is just the average of the data points you
have, and you shouldn’t confuse x-bar with mu.
Usually, mu is the mean of the whole distribution,
whereas x-bar is just the average of the points
that we draw from the distribution.
So just be careful, because everything in these slides
are quantities that we compute from a sample of the data,
they’re not the computations that we would
do if we had the underlying population or the PDF.
So, as long as the sample’s large enough though,
as long as the n is large enough, then x-bar is very,
very similar to mu, the population mean.
So that’s called the sample mean.
And then quantiles and
percentiles are also useful summary statistics.
And to compute those, you’d order the data, and
then you find which data point is the one that divides
the data into two parts, where a fraction p of
the data values are less than or equal to this guy, and
the remaining fraction are greater than that guy.
Okay, so just to graphically put it out, so
you put the data points in order from lowest to largest.
And then you find the point, let’s say you’re looking for
the 0.55 quantile or the 55th percentile, you look for
the point where 55% of the data are below it and
the other 45% are above it.
And that’s your point, that’s that quantile, okay?
And then the sample median is just the middle data point.
It’s also, you can use the definition of quantiles and
percentiles to get to it because it’s on the 50th percentile or
the quantile 0.5.
Now, I’ve put up the formula for the sample variance here, and
there’s something strange going on here.
So this part you would expect, right?
This looks very similar to the definition of the population
variance.
You take each point, you subtract something like a mean,
and you square it and you add it up.
But you’d think this would be a 1/n here, wouldn’t you?
But, surprise, it’s n-1.
So what is that n-1 doing there?
And the reason for this is actually very technical,
it has to do with something called bias.
If we put a 1/n there, the sample variance would be what’s
called a biased estimator of the true population variance.
But you shouldn’t worry about this, but
let the software do it for you.
Every piece of software that I have ever used has had a command
to compute the sample variance.
You just give it the data, and you ask it for the variance, and
it will give you, it will remember this 1/n-1 to give you
the sample variance so that you don’t have to remember it.
And then luckily, the sample standard deviation is the same,
it’s just the square root of the sample variance.
I have 500 points, they’re in this giant vector,
which I didn’t feel like writing the whole thing out.
And I decided to make a histogram.
So I type my histogram command, and this is what I got.
And now, I just want to summarize the important bits.
I don’t wanna preserve all of this detail in here which I
don’t necessarily need.
So one thing I could look at is the five number summary.
Let me break down for you what this five number summary is.
The range is the distance between the minimum and
the maximum.
No surprise.
The interquartile range is the distance between
the first quartile, which is the 25th percentile,
and then the 75th percentile, which is the third quartile.
Now that you know how to compute percentiles,
you can compute these things.
Okay, so the five number summary is actually the smallest value,
and the largest value, the median, which is Q2,
here the median, and then Q1 and Q3,
which are the 25th percentile and the 75th percentile.
So once you’ve got that,
that tells you a lot about the distribution.
So the five number summary is this collection of five values,
it’s the minimum and the maximum.
The median, which is the 50th percentile.
And then Q1 and Q3, which are the 25th percentile and
the 75th percentile.
So it’s a nice simple way to characterize the distribution.
Now, back to the visualizations.
A box plot is a great visual way to represent a distribution,
it has less information than the full distribution.
This is the histogram, the full distribution.
Here’s the box plot.
And it just conveys a summary of information.
Now, you can see that the center is the same, right, instead of,
the middle is about at 4, and so that’s where the median is.
This is the median over here.
And if I’m okay with summarizing a probability distribution
like this, then I can get away with plotting many
distributions succinctly on the same page.
Cuz it’s much easier to plot 50 of these things next to each
other than 50 of those things.
It’s just visually easier for humans to understand.
But how to get from the distribution to the box plot is
a little bit weird, so let me tell you about that.
Okay, so first, the middle bit is the median,
as I mentioned, it’s not the mean, it’s the median.
And strangely enough,
the mean usually doesn’t get marked on a box plot.
I’m not sure why not.
Some people mark it on there, which I think is a good idea.
But just because I think it’s a good idea doesn’t mean that
people who write visualization software think it’s a good idea.
But in any case, that is the median.
And then the first quartile is marked and
the third quartile is marked.
This main box just tells you where the quartiles are.
By the way, don’t get the word quartile mixed up with quantile.
There are only four quartiles.
Quartiles sounds like quarter, okay?
But there’s an arbitrary number of quantiles.
And then, the whiskers are more complicated.
So this point here is actually,
you take the inter-quartile range, which is Q3 minus Q1,
you multiply that by 1.5 and you add it to Q3.
Okay, so it’s Q3 + 1.5 times the inter-quartile range,
which is Q3 minus Q1.
And then you go, you find the point,
the data point, the point that’s actually in your data
that is slightly below that, okay?
It’s a very weird set of instructions,
but that’s what it is.
They want these whiskers to be exactly at a data point.
So they have you go up 1.5 times the IQR, and
then down to the nearest data point.
And then this bottom one,
analogously is the data point just above 1.5
times the IQR below the first quartile.
Okay, and then the other, and
any other point outside the whiskers get marked with a plus.
So the five number summary is here.
One, two, three, four, and five.
So you’ve got the min and the max marked.
Also, the first, second, and third quartiles.
And the nice thing about these box plots, as I mentioned, is
that you can visually understand the distribution very quickly.
You don’t need all the little wiggles in the histogram
to understand what’s going on.
And yet, you can still pack several box plots into a figure
to have many of them on the same screen because they’re so
compact.
End of transcript. Skip to the start.

Data Science Essentials & Machine Learning

Curriculum

Introduction to Statistics – Summary Statistics

Summary Statistics

Downloads and transcripts

Video

Transcripts

Modal title