- Home
- Courses
- Data Science
- Data Science Essentials & Machine Learning
Curriculum
- 8 Sections
- 69 Lessons
- 4 Weeks
Expand all sectionsCollapse all sections
- Before You StartIntroduction4
- Module 1: Introduction to Data Science12
- 3.1Principles of Data Science – Data Analytic Thinking
- 3.2Principles of Data Science – The Data Science Process
- 3.3Further Reading
- 3.4Data Science Technologies – Introduction to Data Science Technologies
- 3.5Data Science Technologies – An Overview of Data Science Technologies
- 3.6Data Science Technologies – Azure Machine Learning Learning Studio
- 3.7Data Science Technologies – Using Code in Azure ML
- 3.8Data Science Technologies – Jupyter Notebooks
- 3.9Data Science Technologies – Creating a Machine Learning Model
- 3.10Data Science Technologies – Further Reading
- 3.11Lab Instructions
- 3.12Lab Verification
- Module 2: Probability & Statistics for Data Science21
- 4.1Probability and Random Variables – Overview of Probability and Random Variables
- 4.2Probability and Random Variables – Introduction to Probability
- 4.3Probability and Random Variables – Discrete Random Variables
- 4.4Probability and Random Variables – Discrete Probability Distributions
- 4.5Probability and Random Variables – Binomial Distribution Examples
- 4.6Probability and Random Variables – Poisson Distributions
- 4.7Probability and Random Variables – Continuous Probability Distributions
- 4.8Probability and Random Variables – Cumulative Distribution Functions
- 4.9Probability and Random Variables – Central Limit Theorem
- 4.10Probability & Random Variables – Further Reading
- 4.11Introduction to Statistics – Overview of Statistics
- 4.12Introduction to Statistics – Descriptive Statistics
- 4.13Introduction to Statistics – Summary Statistics
- 4.14Introduction to Statistics – Demo: Viewing Summary Statistics
- 4.15Introduction to Statistics – Z-Scores
- 4.16Introduction to Statistics – Correlation
- 4.17Introduction to Statistics – Demo: Viewing Correlation
- 4.18Introduction to Statistics – Simpson’s Paradox
- 4.19Introduction to Statistics – Further Reading
- 4.20Introduction to Statistics – Lab Instructions
- 4.21Introduction to Statistics – Lab Verification
- Module 3: Simulation & Hypothesis Testing16
- 5.1Simulation – Introduction to Simulation
- 5.2Simulation – Start
- 5.3Lab
- 5.4Simulation – Demo: Performing a Simulation
- 5.5Simulation – Further Reading
- 5.6Hypothesis Testing – Overview
- 5.7Hypothesis Testing – Introduction
- 5.8Hypothesis Testing – Z-Tests, T-Tests, and Other Tests
- 5.9Hypothesis Testing – Test Examples
- 5.10Hypothesis Testing – Type 1 and Type 2 Errors
- 5.11Hypothesis Testing – Confidence Intervals
- 5.12Hypothesis Testing – Demo with R & Python
- 5.13Hypothesis Testing – Misconceptions
- 5.14Hypothesis Testing – Further Reading
- 5.15Hypothesis Testing – Lab Instructions
- 5.16Hypothesis Testing – Lab Verification
- Module 4: Exploring & Visualizing Data4
- Module 5: Data Cleansing & Manipulation4
- Module 6: Introduction to Machine Learning4
- Final Exam & Survey4
Probability and Random Variables – Discrete Random Variables
Discrete Random Variables
Downloads and transcripts
Video transcript
- Start of transcript. Skip to the end.
- Let’s talk about discrete random variables.
- Now, the kind of quintessential discreet random
- variable is the value on a die when you role it.
- Now, the name of the random variable is x,
- it’s the value of that die when you roll it.
- Now, little x is gonna be an outcome.
- So here it’s gonna take values either 1, 2, 3, 4, 5 or 6.
- And then the probability that random variable X takes
- outcome value little x Is denoted this way okay?
- So this is the probability that random variable
- X will have outcome little x.
- And the probabilities have to sum to 1.
- But for the role of a die, the outcomes are equally probable,
- they all have probability 1/6.
- So make sure you understand the notation here, okay?
- So this says that the probability
- that random variable x takes outcome 4 is 1/6th.
- And this table is called the probability mass function.
- It tells your for each outcome what is
- the probability of attaining that outcome.
- Now, here the probability mass function is constant,
- it’s always 1/6.
- But there are other random variables where it’s not
- constant.
- So here, for instance, is a PMF or a very weird die.
- This die, instead of having values 1, 2, 3, 4, 5, for
- 6 on its sides, it has 10,.20, 30, 40, 50, and 60.
- And it’s also a weighted die,
- where not all the probabilities are 1/6.
- So here, one of the probabilities is 3/12 and
- another is 1/12.
- So just double check for yourself that the probabilities
- all add up to 1, and it looks like they do.
- Well, okay, so let’s go back and review the general notation.
- So the random variable is called capital X, okay, and
- the possible outcomes are called little x, and the probability
- for outcome little x is denoted like this, okay?
- So, this is the probability that random variable X equals
- outcome little x, little x3, say, and that equals P3.
- Okay, so when I’m talking about a discrete random variable,
- I am talking about it’s PMF.
- You cannot refer to a discrete random variable without
- at least thinking about what its PMF is because that’s
- what defines probability in this context.
- Okay.
- So, armed with this notation, let us talk about how to
- summarize a discreet random variable.
- Now, it’s important that when I refer to a random variable,
- I’m referring to its probability mass function.
- Let us talk about how one might summarize a PMF
- without having to present the whole thing.
- The whole thing could be very big,
- it could be very overwhelming, and we just want one or
- two numbers that really summarize what it looks like.
- Okay, so how are we gonna compute the mean of the PMF,
- mean of the random variable, and
- how are we gonna compute the spread of it?
- So let’s talk about the mean.
- The mean is a measure of centrality for
- the probability mass function.
- What is the middle number of the PMF?
- So we’ll use this die’s PMF, which is constant here.
- And this is the die that
- has the labeled sides 10, 20, 30, 40, 50, 60.
- Okay, so what is the average outcome that you should get when
- you roll this die over and over again?
- And I’m sure you all know the answer, which is that it’s 35,
- which is right in the middle there.
- And here’s the computation you sorta did in your head in order
- to get that.
- You multiplied each outcome
- by the probability that you get that outcome.
- And then you add them all up.
- So that’s the computation you did to get to the middle
- number there.
- So I can write that in general notation this way.
- I can say that it’s outcome 1 times the probability of
- outcome 1 plus outcome 2 times the probability of outcome 2
- and so on.
- And then I can write that in summation notation like this.
- Okay, it’s the sum over i for outcome i and
- probability of outcome i.
- Okay, and that’s the formula for the mean of a discrete random
- variable Okay, so let’s try this die.
- So it’s PMF is slightly different, but
- we can just apply the formula.
- So it’s each outcome times its probability of occurring,
- add them all up, and that is the mean.
- Okay, and then here,
- when I did that, it was a little bit less than 35.
- And you can see why that is by looking at the probability
- mass function.
- See, we’ve got more mass on the smaller values here, right?
- We moved some of the mass down lower, so we get a smaller mean.
- Okay, so we’re now likely to choose 20 more often, so
- that lowers the average.
- Okay, more practice.
- Here’s a new random variable, and here is its PMF.
- And you can sorta look at that for
- a bit, and realize that the values in the middle are more
- likely than the values at the extremes, okay?
- And you can see that from looking at these
- numbers directly.
- But see what if there were heck of a lot more numbers,
- and what if the table was sort of hundreds of
- lines longer, thousands of lines long, then you wouldn’t be
- able to look all the numbers and figure out that some of the ones
- in the middle are higher than the ones at the extremes.
- Okay, so one way to handle that is to actually visualize it.
- So I’m gonna do a bar chart of the PMF.
- So for outcome zero, I plotted the probability to get zero.
- For outcome 1, I plotted the probability to get 1, and so on.
- And you can see the nice PMF without having to try and
- summarize a table of numbers in your head.
- Okay, more practice.
- So computing the mean using the formula that I discussed
- earlier with you, insanity check, does it look right?
- Is 2.45 in the center of the distribution?
- Yes, it is, so that looks good.
- Okay, so there’s the formula for the mean,
- that’s what you just learned.
- Now the mean is also called the expectation, by the way, or
- the expected value.
- And I’m gonna use both terminology and both notations
- kind of throughout, so let’s just remember that they stand
- for the same thing Okay, so now that
- we have a measure of the center of the distribution, let us try
- to get some way of measuring the spread of the distribution.
- Here’s my random variable.
- This is the PMF for my random variable, and I want some
- measure of the spread of this thing around the mean.
- I wanna know how spread out it is, so let me try a guess.
- So here’s my guess.
- I take the distance of each outcome xi from the mean,
- that’s the first step.
- Now, remember that xi could be on either side of the mean, so
- I need the absolute value here, right, to compute distance.
- And then I’m going to multiply each distance
- by how often it occurs, and I will call that the spread.
- Okay, so if the distances are very large, fairly often,
- then this thing will be large.
- Okay, so what do you think of that?
- Cool?
- Well, so this thing, it’s a good guess but
- it’s not quite what I’m looking for.
- But it is something that is only a little bit different.
- But the intuition for this thing holds for
- the real definition of the spread that I’m going to use.
- So here’s the real definition.
- Instead of computing the mean distance from the mean outcome,
- it’s actually the mean squared distance, okay?
- So, since you’re still looking at distances from the center of
- the distribution this really is the measure of the spread,
- right?
- This is the official definition of the variance of
- a distribution.
- Okay, so when you look at this,
- what you should see is distances from the mean of
- the distribution weighted by their probability of occurring.
- And that is the variance.
- Let’s do this computation here.
- Let’s compute the variance of x.
- Here are the outcomes and here are the probabilities.
- And here is the formula that we just derived.
- Now, lets look at this top line here.
- We have probability of 0.03 times the distance of x,
- which is zero minus 2.45, which is the mean we computed earlier.
- And then we square that, the squared distance from the mean.
- Okay, good.
- So that’s for the first term, so that’s this term, and
- now let’s do the rest of the terms.
- Now here’s the second term.
- 0.14 is the probability of that outcome
- times the distance of 1 from 2.45 squared.
- And then I just put some dot dot dots just so
- we don’t have to write them all out.
- And then here is the last term for that line right there.
- And you get 1.0675.
- Now that’s great.
- So we’ve got the variance, but
- the problem with the variance is that it’s not in units that
- really make any sense, cuz it’s in units of distance squared.
- So that’s why we wanna talk about the standard deviation.
- Now, the standard deviation is just the square root
- of the variance.
- Okay, so I put it here.
- Standard deviation, it’s also written sigma,
- it’s just the square root of the variance.
- And that is in units that make sense.
- So it’s back in dollars again, not dollars squared.
- Okay, so the value here is 1.033 and you can
- actually measure that along the horizontal axis here, and it
- makes sense because it’s in the same units as the outcomes are.
- So if these are in dollars,
- the standard deviation is in dollars.
- And you can see that by moving away from the mean here,
- the mean’s 2.45, which is sort of right there.
- And you can see what one standard deviation will get you.
- End of transcript. Skip to the start.