- Home
- Courses
- Data Science
- Data Science Essentials & Machine Learning
Curriculum
- 8 Sections
- 69 Lessons
- 4 Weeks
Expand all sectionsCollapse all sections
- Before You StartIntroduction4
- Module 1: Introduction to Data Science12
- 3.1Principles of Data Science – Data Analytic Thinking
- 3.2Principles of Data Science – The Data Science Process
- 3.3Further Reading
- 3.4Data Science Technologies – Introduction to Data Science Technologies
- 3.5Data Science Technologies – An Overview of Data Science Technologies
- 3.6Data Science Technologies – Azure Machine Learning Learning Studio
- 3.7Data Science Technologies – Using Code in Azure ML
- 3.8Data Science Technologies – Jupyter Notebooks
- 3.9Data Science Technologies – Creating a Machine Learning Model
- 3.10Data Science Technologies – Further Reading
- 3.11Lab Instructions
- 3.12Lab Verification
- Module 2: Probability & Statistics for Data Science21
- 4.1Probability and Random Variables – Overview of Probability and Random Variables
- 4.2Probability and Random Variables – Introduction to Probability
- 4.3Probability and Random Variables – Discrete Random Variables
- 4.4Probability and Random Variables – Discrete Probability Distributions
- 4.5Probability and Random Variables – Binomial Distribution Examples
- 4.6Probability and Random Variables – Poisson Distributions
- 4.7Probability and Random Variables – Continuous Probability Distributions
- 4.8Probability and Random Variables – Cumulative Distribution Functions
- 4.9Probability and Random Variables – Central Limit Theorem
- 4.10Probability & Random Variables – Further Reading
- 4.11Introduction to Statistics – Overview of Statistics
- 4.12Introduction to Statistics – Descriptive Statistics
- 4.13Introduction to Statistics – Summary Statistics
- 4.14Introduction to Statistics – Demo: Viewing Summary Statistics
- 4.15Introduction to Statistics – Z-Scores
- 4.16Introduction to Statistics – Correlation
- 4.17Introduction to Statistics – Demo: Viewing Correlation
- 4.18Introduction to Statistics – Simpson’s Paradox
- 4.19Introduction to Statistics – Further Reading
- 4.20Introduction to Statistics – Lab Instructions
- 4.21Introduction to Statistics – Lab Verification
- Module 3: Simulation & Hypothesis Testing16
- 5.1Simulation – Introduction to Simulation
- 5.2Simulation – Start
- 5.3Lab
- 5.4Simulation – Demo: Performing a Simulation
- 5.5Simulation – Further Reading
- 5.6Hypothesis Testing – Overview
- 5.7Hypothesis Testing – Introduction
- 5.8Hypothesis Testing – Z-Tests, T-Tests, and Other Tests
- 5.9Hypothesis Testing – Test Examples
- 5.10Hypothesis Testing – Type 1 and Type 2 Errors
- 5.11Hypothesis Testing – Confidence Intervals
- 5.12Hypothesis Testing – Demo with R & Python
- 5.13Hypothesis Testing – Misconceptions
- 5.14Hypothesis Testing – Further Reading
- 5.15Hypothesis Testing – Lab Instructions
- 5.16Hypothesis Testing – Lab Verification
- Module 4: Exploring & Visualizing Data4
- Module 5: Data Cleansing & Manipulation4
- Module 6: Introduction to Machine Learning4
- Final Exam & Survey4
Introduction to Statistics – Z-Scores
Z-Scores
Downloads and transcripts
Video transcript
- Start of transcript. Skip to the end.
- So this lecture is all about how to view data with respect to
- other data.
- So if you’re telling me that you’re an excellent salesman
- because you sold $100,000 worth of widgets,
- I have no idea what that means.
- You can be terrible compared to your peers, but
- how would I know that based on what you told me?
- If you tell me the z-score or what percentile you were,
- then that’s a different story, right, that’s meaningful.
- Obviously, variance in correlation are about
- the relationships between two random variables.
- Let’s start.
- It’s helpful to think of values relative to other values within
- the same distribution, and
- that’s what a z-score tells you, it tells you where a point is
- relative to other points in the distribution.
- It’s the number of standard deviations above or
- below the mean for a particular point.
- It’s helpful to think of values relative to other values within
- the same distribution.
- If I tell you that I just sold 1200 units,
- that doesn’t mean much, cuz you have no idea how good of
- a salesman I am, because you don’t have enough context.
- What if I told you the mean was a 1000 units?
- Still, that doesn’t tell you very much.
- You have no idea how unusual it is to go above 1200 units.
- Is that very unusual?
- Or it was just I was slightly above the mean?
- What you need also is the standard deviation.
- If I tell you it is 100 units, then you know I’m in business
- because I’m selling two standard deviations above the mean.
- Now, here’s say a histogram here and
- I’m selling here at 1200 and the mean is over here.
- Now the vast majority of other salespeople sell much
- less than 1200, so I’m at the top of the pile if I’m up here.
- Of course, this assumes that salesmen have an approximately
- normal distribution, which may or may not be true for
- a specific company, but we’ll let that go for now.
- The z score of a point x is the number of standard deviations
- above or below the mean of X.
- And an easy way to compute that is to use this formula
- right here.
- But that looks a little complicated, so
- let’s break it down a bit.
- Well, let us start with the original PDF of X.
- Let’s say that this is X’s PDF, X is random variable.
- Then I’m gonna subtract the mean, so now this
- thing has mean 0 because I subtracted the mean, what I did.
- When I divide by the standard deviation here,
- I actually squish the distribution, so
- the distribution now has mean 0 and standard deviation 1.
- When I think about one standard deviation above the mean of X,
- it’s exactly at the point 1 of this new distribution where
- z is 1.
- What I did when I subtracted the mean and divided by the standard
- deviation is that I shifted the distribution to have mean 0 and
- I scaled it to have standard deviation 1,
- where I standardized the distribution.
- In this way, z measures how many standard deviations X is
- above or below the mean.
- Now if you’re working with data, you don’t actually have
- the mean mu, you only have the sample mean X bar.
- People often get confused and call these z-scores, in fact,
- I do it myself but they’re actually really sample z-scores.
- The sample z-score of X is actually the number of sample
- standard deviations above or below the sample mean.
- Okay, so here is a histogram of my data, and
- you can see that the sample mean is 1000.
- And my sample z-score is about 2 because
- I’m 2 sample standard deviations above the sample mean.
- Just to give you some perspective, let me discuss for
- you how rare that actually is.
- Here is a standard normal with mean 0 and variance 1.
- Now it turns out that 68% of the time,
- you’re within one standard deviation of the mean.
- Now you can’t calculate this analytically, by the way,
- you actually need a computer to do this to get that 68%.
- As it turns out that 95% of the time,
- you’re within two standard deviations of the mean.
- And 99.7% of the time,
- you’re within three standard deviations of the mean.
- Now you can put it into context.
- Though I sold 1200 units, the mean is 1000 and
- the standard deviation is 100, but the z-score is 2.
- So I sold two standard deviations above the mean and
- the probability to be that extreme is actually only 2.5%.
- So I’m a pretty unusual sales person.
- End of transcript. Skip to the start.