Using freely-available material from Codecademy's Introduction to Data Analysis intensive, we'll learn the basics of hypothesis testing using SciPy. You’ll practice making mathematically informed decision-making and experimental design used by marketers, social scientists, product managers, and data analysts everywhere.
When we are trying to compare datasets, we often need a way to be confident knowing if datasets are significantly different from each other.
Some situations involve correlating numerical data, such as:
- A professor expects an exam average to be roughly 75%, and wants to know if the actual scores line up with this expectation. Was the test actually too easy or too hard?
- A manager of a chain of stores wants to know if certain locations have different revenues on different days of the week. Are the revenue differences a result of natural fluctuations or a significant difference between the stores' sales patterns?
- A Product Manager for a website wants to compare the time spent on different versions of a homepage. Does one version make users stay on the page significantly longer?
Others involve categorical data, such as:
- A pollster wants to know if men and women have significantly different yogurt flavor preferences. Does a result where men more often answer "chocolate" as their favorite reflect a significant difference in the population?
- Do different age groups have significantly different emotional reactions to different ads?
In this lesson, you will learn how about how we can use hypothesis testing to answer these questions. There are several different types of hypothesis tests for the various scenarios you may encounter. Luckily, SciPy has built-in functions that perform all of these tests for us, normally using just one line of code.
For numerical data, we will cover:
One-Sample T-Tests
Two-Sample T-Tests
ANOVA
Tukey Tests
For categorical data, we will cover:
Binomial Tests
Chi Square
After this lesson, you will have a wide range of tools in your arsenal to find meaningful correlations in data.