- Home
- Courses
- Data Science
- Data Science Essentials & Machine Learning
Curriculum
- 8 Sections
- 69 Lessons
- 4 Weeks
Expand all sectionsCollapse all sections
- Before You StartIntroduction4
- Module 1: Introduction to Data Science12
- 3.1Principles of Data Science – Data Analytic Thinking
- 3.2Principles of Data Science – The Data Science Process
- 3.3Further Reading
- 3.4Data Science Technologies – Introduction to Data Science Technologies
- 3.5Data Science Technologies – An Overview of Data Science Technologies
- 3.6Data Science Technologies – Azure Machine Learning Learning Studio
- 3.7Data Science Technologies – Using Code in Azure ML
- 3.8Data Science Technologies – Jupyter Notebooks
- 3.9Data Science Technologies – Creating a Machine Learning Model
- 3.10Data Science Technologies – Further Reading
- 3.11Lab Instructions
- 3.12Lab Verification
- Module 2: Probability & Statistics for Data Science21
- 4.1Probability and Random Variables – Overview of Probability and Random Variables
- 4.2Probability and Random Variables – Introduction to Probability
- 4.3Probability and Random Variables – Discrete Random Variables
- 4.4Probability and Random Variables – Discrete Probability Distributions
- 4.5Probability and Random Variables – Binomial Distribution Examples
- 4.6Probability and Random Variables – Poisson Distributions
- 4.7Probability and Random Variables – Continuous Probability Distributions
- 4.8Probability and Random Variables – Cumulative Distribution Functions
- 4.9Probability and Random Variables – Central Limit Theorem
- 4.10Probability & Random Variables – Further Reading
- 4.11Introduction to Statistics – Overview of Statistics
- 4.12Introduction to Statistics – Descriptive Statistics
- 4.13Introduction to Statistics – Summary Statistics
- 4.14Introduction to Statistics – Demo: Viewing Summary Statistics
- 4.15Introduction to Statistics – Z-Scores
- 4.16Introduction to Statistics – Correlation
- 4.17Introduction to Statistics – Demo: Viewing Correlation
- 4.18Introduction to Statistics – Simpson’s Paradox
- 4.19Introduction to Statistics – Further Reading
- 4.20Introduction to Statistics – Lab Instructions
- 4.21Introduction to Statistics – Lab Verification
- Module 3: Simulation & Hypothesis Testing16
- 5.1Simulation – Introduction to Simulation
- 5.2Simulation – Start
- 5.3Lab
- 5.4Simulation – Demo: Performing a Simulation
- 5.5Simulation – Further Reading
- 5.6Hypothesis Testing – Overview
- 5.7Hypothesis Testing – Introduction
- 5.8Hypothesis Testing – Z-Tests, T-Tests, and Other Tests
- 5.9Hypothesis Testing – Test Examples
- 5.10Hypothesis Testing – Type 1 and Type 2 Errors
- 5.11Hypothesis Testing – Confidence Intervals
- 5.12Hypothesis Testing – Demo with R & Python
- 5.13Hypothesis Testing – Misconceptions
- 5.14Hypothesis Testing – Further Reading
- 5.15Hypothesis Testing – Lab Instructions
- 5.16Hypothesis Testing – Lab Verification
- Module 4: Exploring & Visualizing Data4
- Module 5: Data Cleansing & Manipulation4
- Module 6: Introduction to Machine Learning4
- Final Exam & Survey4
Probability and Random Variables – Central Limit Theorem
Central Limit Theorem
Downloads and transcripts
Video transcript
- Start of transcript. Skip to the end.
- Well, let me tell you about the central limit theorem and
- the normal distribution.
- Now, everyday a bunch of customers go into a store,
- each with their own kooky distribution
- of how much they wanna spend.
- Like, this guy wants to spend, sometimes a little bit,
- sometimes a lot, but not much in between.
- This person spends a little bit, a medium amount, or
- a lot, sometimes an awful lot.
- And this guy, he just likes ties.
- So he always buys ties and there’s always this price.
- Somewhere in this price range.
- Now, the question is, at the end of the day,
- the store counts the total amount of money that
- they collect from all of these customers.
- So how much total do they spend at the store?
- And let’s assume we have a lot of customers in the store.
- And we even have some students taking this course going
- to the store, and bringing their computers to the store just for
- fun, while they’re learning data science.
- And each of them has their own distribution
- of how much they spend.
- Every evening, the store counts their total profit, and every
- day it was slightly different than it was the day before.
- And now we ask the store to
- plot the distribution of it’s profits.
- Here’s what it looks like.
- A bump.
- And we ask them what the center of the bump is, and magically as
- it turns out, the center of this distribution of total profits
- is exactly the sum of the means of the individual customers.
- My goodness, that is cool.
- Now, start over again.
- Now, this is a totally different store with totally different
- customers.
- In fact, it’s on the other side of the world.
- Everything’s totally different.
- And now we ask,
- what is the distribution of this store’s profits?
- And rather interestingly, ha, ha,
- they also end up with a very similar looking sort of bump.
- Totally different store.
- Totally different customers, and
- yet, a single bump with the same shape.
- And the mean has the same formula too.
- It’s the sum of the means of the customers’ individual means.
- But what the heck is going on here?
- Are these stores in cahoots?
- Did they to each other and
- negotiate the distribution of their sales?
- No way.
- It turns out, and get this, it always happens.
- The same bump with the same shape, right?
- It might be stretched out a bit, or it might be squeezed a bit,
- or scaled a bit, or
- shifted a bit, but it’s really the same shape.
- So this is called the normal distribution.
- And its shape is given by this particular formula, and
- it’s always the same formula.
- It’s always the same shape.
- And the formula has a mean in it, mu, the mean of this thing.
- And the standard deviation sigma,
- that’s the measure of the spread of that distribution.
- Now if you know those two things, if you know the mean and
- you know the sigma, you have the whole formula.
- So you know the whole shape of that curve.
- And it alway integrates to one.
- Now funny things happen when you fiddle with the mean and
- the variance there.
- So you can get these very kinda peaky normal distributions
- with small standard deviations.
- Or you could get these very broad ones with large standard
- deviations.
- And the mean can actually be any value it wants to be.
- And the standard deviation, as long as it’s positive,
- can be anything.
- And there’s that formula again.
- So, as long as you know the mean and the standard deviation,
- you got the full shape.
- Now, the cool thing is this fact that I told you,
- that the sum of a large number of independent random variables
- is approximately normal.
- And this is actually called the central limit theorem.
- And this is one of the most famous theorems in the world,
- the central limit theorem.
- So, even though all of the different customers had
- a totally different distribution,
- when you add them up, the sum is approximately normal.
- So let’s have X1 through Xn be independent random variables.
- Their means are mu 1 through mu n.
- The standard deviations are sigma 1 through sigma n.
- Now I take their sum.
- Okay, this is the total sales for the store.
- And the Xs are the sales for the individual customers.
- And now as it turns out, that sum is approximately
- normal with mean, which is the sum of the means.
- So the mean of the sum is the sum of the means.
- And the variance of the sum turns out to be
- the sum of the variances of the independent random variables.
- Of course, the standard deviation is just the square
- root of the variance.
- Now, this theorem only works when
- the variables are independent.
- So you just have to make sure that that’s true.
- And that’s true for
- sales because each customer comes into the store and
- doesn’t worry about what another customer is doing.
- Now the larger n is, the closer to normal,
- that s, that sum is.
- And it turns out that if the Xn’s are actually normal in
- the first place, then their sum is normal anyway.
- Exactly normal.
- So if you start out with weird distributions though,
- then n needs to be a bit larger to look normal.
- Now, because of the central limit theorem,
- this distribution and this formula pops up all the time.
- Now, this formula is not something that a person made up,
- it’s some thing that exists in nature.
- It’s just as natural as the patterns you might find when
- looking at water waves or seashells.
- It’s something that comes with the earth.
- It’s the distribution of the amount of rain in
- Boston over the year.
- It’s the distribution of test grades
- assuming no systematic cheating.
- And it’s the distribution of anything that’s a sum of
- independent events.
- This is like equals mc squared for probability.
- You might say, well, but I already
- thought you told us about sums of independent random variables?
- Doesn’t the binomial distribution come from a sum of
- independent random variables?
- Well, as it turns out, the limiting binomial distribution
- is actually normal, so isn’t that lovely?
- So, if I start out with just one trial with probability 0.5,
- the binomial distribution is pretty boring,
- it just looks like this, right?
- This is just a fair coin with a single coin flip.
- Half the time you’ll get 0, half the time you’ll get 1.
- Then when I start flipping more coins
- it starts to look a little bit more normal.
- Flip 3 coins.
- 4 coins.
- 5, and then we can flip 10 coins, 100 coins, and
- 1000 coins.
- And then you’ll see what looks like very much a beautiful
- normal looking distribution.
- So the limiting binomial is a normal.
- Isn’t that lovely?
- Okay, so, yes, so that’s the point.
- For large n, the binomial distribution actually becomes
- the normal distribution.
- End of transcript. Skip to the start.