Introduction to Statistics – Demo: Viewing Summary Statistics

Demo: Viewing Summary Statistics

For a demonstration of how to view summary statistics using R, watch the first video in this topic. For a Python-based demonstration, scroll down and view the second video.

Viewing Summary Statistics in R

Downloads and transcripts

Video – Viewing Summary Statistics in R

Download video file

Transcripts

Viewing Summary Statistics in Python

Downloads and transcripts

Video

Download video file

Transcripts

Video Transcript -Viewing Summary Statistics in R

Start of transcript. Skip to the end.
so Cynthia has been discussing how
summary statistics are computed with the
formulas are a little bit about how we
interpret them and this demo i’d like to
show you how we actually compute the
summary statistics using our and we’ll
talk about some practical issues of what
we can do to interpret summary
statistics
so my screen here i have a notebook
which I’ve started in Azure machine
learning and this first cell contains
the auto-generated code which takes my
data set
automobile price data raw dot CSV and
gives me back and our data frame here so
let me just run that code
they’re so we have the data frame loaded
it’s named at
in another auto-generated piece of code
we can see the head so basically the
first five rows of that data frame
about 200 automobiles in this data set
and i’m not going to talk about every
column at this point let’s just talk
about the ones we care about so this
column called horsepower gives you the
horsepower of the car’s engine city
miles per gallon gives you the miles per
gallon of that car and city driving and
price is the price of that car
simple enough
but there is a problem there are missing
values in this
in some of these numeric columns so this
code
basically iterates over these numeric
columns we’re looking for a text string
? that indicates a missing value we’re
going to replace it with and RNA value
and then we use there’s an R function
called complete cases so complete cases
returns a true if the row has no missing
values it returns false
if the row has a missing value so we
wind up with a clean data frame and then
we make sure we have coerced all those
columns to numeric so let me run this
for you oh and then we’re going to run
the stir function on that so we’ll we’ll
see a summary of some of those data
alright so there’s the summary of these
columns in the data
the ones we care about horsepower it’s a
numeric looks like they’re all integers
city miles per gallon again now is
actually integer is an integer and they
look like integers price is numeric
so those are the three columns that we
care about
we’re not going to look at the rest of
these just now
and we can compute some summary
statistics so there’s a
summary function in our
actually called summary
and we’ll so we select the column we
want from the data frame we run summary
on it i’m doing some things too
add the standard deviation which doesn’t
normally show up in the summary
statistics the way our does it and give
it some proper names and so we’ll just
do that for one column will do it for
price
and there we have it
so first off let’s look at the mean mean
is just a little over 13,000
but notice that the median is just
barely over 10,000 so it’s much lower
and that indicates that discrepancy
between mean and median indicates that
we have probably some asymmetry in this
distribution we also have a pretty wide
standard deviation about of 8,000 on a
mean of 13,000 so it indicates a
widespread and you can see the minimum
is only about five thousand whereas our
most expensive car in this data set is
over 45,000 so again indicating there’s
a wide range of data values a wide
spread or dispersion of those data and
if you look at the
these first
third quartile so it’s the 25-percent in
75% you see there’s quite a around the
median there’s quite a bit of asymmetry
in those differences so we’re expecting
a distribution where they’re more
cheaper cars and fewer more fewer
expensive cars just from examining those
summary statistics
and we can check that we can check that
visually so this code here
i’m using the g plot to package which
we’ll talk about in future lessons but
i’m going to make a histogram and I’m
going to make a box plot
and
and we’re just going to lay those one on
top of the other
two robes basically so let me run that
for you
alright so let’s start with our box plot
Cynthia discuss the box plot the dark
line here is our median value which as
we expect is just over 10,000 and look
at this first lower quartile it’s pretty
narrow compared to this first upper
quartile again very much indicating
strong asymmetry and this whisker is
pretty short here down to the minimum
value which was around little over five
thousand whereas this whisker is quite
long as it’s one and a half times the
interquartile range which is as long as
the whisker can be and then we have some
cars that are really quite expensive you
see these few outlier cars very
expensive cars are shown in the dots
we can get a different view of that from
the histogram and you see the most
frequent price of cars is in here maybe
maybe it’s around eight thousand dollars
or something so relatively inexpensive
cars there’s a lot of an expensive cars
and the distribution tapers off and we
only have these few cars that are over
30 thousand dollars those correspond to
are outliers here
and we can look at one other
column engine size which will use later
to get we see it’s a little less
asymmetric because you see the court
interquartile the quartile range of this
little first lower quartile is only
slightly smaller than the range of this
first upper quartile the whiskers only
slightly shorter than the upper whisker
we still do have a few outliers and the
histogram helps confirm that idea
you see we have a few cars with very
small engines we have kind of a cluster
of cars plus or minus around a hundred
cubic inches and we have a few cars with
really large engines
so I hope that this demo has given you
some idea of
practical aspects of just computing
summary statistics but also when you
look at a dataset you’re trying to
understand the variables how a few
summary statistics can really give you
an initial guidance into some aspects of
the behavior of those variables
End of transcript. Skip to the start.

Video Transcript – Viewing Summary Statistics in Python

Start of transcript. Skip to the end.
so Cynthia has been discussing summary
statistics and how we compute summary
statistics and how we interpret summary
statistics and in this demo I’m going to
show you some actual code where we’re
computing summary statistics and
specifically we’re going to use some
Python to compute some summary
statistics and we’ll talk a little bit
about what they mean
so my screen here i have my notebook
that i started in Azure machine learning
and this first
cell is the auto-generated code from
Azure machine learning when I started
the notebook from this automobile price
data raw dot CSV data set
ok so let me just load that’ll get us a
data frame
there we go and the auto-generated code
would just give you the whole frame it
would just have the word frame here
which is the the name of our pandas
dataframe but i’ve added this dot head
so we just see the first five rows
there we go and I’m not going to discuss
all these wrote columns but surprises to
say there’s about 200 automobiles in
this sample and what we’re going to look
at only is the horsepower of each car
the park city miles per gallon of each
car and the price of each car so we’re
going to do some summary statistics on
those three columns but there’s a little
issue there are some missing values
there so this code here for some of
these numeric columns we that look
through and find where the value is
coded as a ? a text string ? and then we
remove those rows we drop those rows so
we’re just going to throw them out
not going to worry about them and then
we have to coerce them back to numeric
so that’s all we’re doing there
let me run that for you and I printed
what’s called the info on that data
frame and let’s just scroll to the
bottom so which are the columns we care
about
so we’ve got horsepower
and we see there’s a hundred ninety-five
values that are non null and it’s a
integer we have city miles per gallon
which again the 995 non null values
which is also an integer in our price
which is the same as the others also an
integer so that’s what we’re going to
work on us those three so let’s start by
using
some the pandas described method here
computes summary statistics so this
little function we give it the name of a
data frame when we call it and we give
it the column name that we want the
summary statistics for so in this case
price and
there’s a little bit of managing here
just to get the name median in our list
so that’s just a detail and I’ve
computed those summary statistics and
you can see as we already knew there’s a
hundred ninety five cases
the mean is a little over 13 thousand
dollars the median is quite a bit less
it’s just barely ten thousand dollars so
that would indicate to me that there’s
quite a lot of asymmetry in the
distribution of the price of these
automobiles the standard deviations
about eight thousand dollars are so we
expect quite a spread of
auto prices the minimum is a little over
five thousand dollars and the maximum is
over 45,000 dollars so again confirming
we have a widespread asymmetric
distribution here arm and if you look at
these twenty-five percent 75% want
quantiles that also confirms that
initial hypothesis
but we can also visualize that
distribution and I’ve just got a little
bit of code here to do that and
all I’m going to do is compute a
histogram
and a boxplot
of those two and i’m going to stack them
one on top of the other so that’s all
that code is doing and we’ll talk about
later the details of Python plotting but
let me just run that for you
and there you have it so as Cynthia
discuss the box plot
we’ve got the median shown in the red
bar here we’ve got the this first lower
quartile here in this first upper
quartile here around the median so again
there’s definitely a symmetry look at
this range here is much shorter than
this range there and we have a short
whisker here a long whisker they’re
going out to one of the head it’ll be
one and a half times the interquartile
range and then we have a bunch of
outlier so basically we have a lot of
relatively inexpensive cars
in a few very expensive cars if we look
at the histogram we get a different view
of that that’s basically has the same
interpretation you can see the most
frequent auto price is probably around
seven or eight thousand dollars in this
data set and only a few cars or say over
thirty thousand dollars so it’s very
asymmetric and of course because it’s
the price of cars there are no cars the
at or near zero and so let’s look at one
other statistic here we’ll just look at
engine size
we’re just going to plot that and again
it it’s a little more symmetric looking
in the box plot but again we have a few
outliers here
and if we look at that histogram we can
see that the most frequent value is
probably in the low hundreds of horse
pass hundred and something hundred and
twenty horsepower maybe are very few
cars with extremely low horsepower
and a number of cars just a few cars
with very high horsepower
so I hope that gives you some idea of
how to compute in practice some summary
statistics and how we think about them
to to do an initial exploration of some
data
End of transcript. Skip to the start.

Data Science Essentials & Machine Learning

Curriculum

Introduction to Statistics – Demo: Viewing Summary Statistics

Demo: Viewing Summary Statistics

Viewing Summary Statistics in R

Downloads and transcripts

Video – Viewing Summary Statistics in R

Transcripts

Viewing Summary Statistics in Python

Downloads and transcripts

Video

Transcripts

Modal title