Hello readers! Today we will be focusing on an important statistical test in Data science — **ANOVA test** in Python programming, in detail.

So, let us get started!!

## Emergence of ANOVA test

In the domain of data science and machine learning, the data needs to be understood and processed prior to modelling. That is, we need to analyze every variable of the dataset and its credibility in terms of its contribution to the target value.

Usually there are two kinds of variables–

**Continuous variables****Categorical variables**

Below are the mostly used statistical tests to analyze the numeric variables:

**T-test**- Correlation regression analysis, etc.

ANOVA test is a categorical statistical tests i.e. it works on the categorical variables to analyze them.

## What is ANOVA test all about?

**ANOVA test** is a statistical test to analyze and work with the understanding of the categorical data variables. It estimates the extent to which a dependent variable is affected by one or more independent categorical data elements.

With ANOVA test, we estimate and analyze the difference in the statistical mean of every group of the independent categorical variable.

#### Hypothesis for ANOVA testing

As well all know, the Hypothesis claims are represented using two categories: Null Hypothesis and Alternate Hypothesis, respectively.

- In the case of the ANOVA test, our
**Null hypothesis**would claim the following: “The statistical mean of all the groups/categories of the variables is the same.” - On the other hand, the
**Alternate Hypothesis**would claim as follows: “The statistical mean of all the groups/categories of the variables is not the same.”

Having said this, let us now focus on the Assumptions or considerations for ANOVA testing.

#### Assumptions of ANOVA testing

- The data elements of the columns follow a normal distribution.
- The variables share a common variance.

## ANOVA test in Python – Simple Practical Approach!

In this example, we will be making use of the Bike Rental Count Prediction dataset wherein we are required to predict the number of customers who would opt for a rented bike based on different conditions provided.

You can find the dataset here!

So, initially, we load the dataset into the Python environment using `read_csv()`

function. Further, we change the data type of the variables upon (EDA) to a defined data type. We also use the os module and the Pandas library to work with system variables and parse CSV data respectively

```
import os
import pandas
#Changing the current working directory
os.chdir("D:/Ediwsor_Project - Bike_Rental_Count")
BIKE = pandas.read_csv("day.csv")
BIKE['holiday']=BIKE['holiday'].astype(str)
BIKE['weekday']=BIKE['weekday'].astype(str)
BIKE['workingday']=BIKE['workingday'].astype(str)
BIKE['weathersit']=BIKE['weathersit'].astype(str)
BIKE['dteday']=pandas.to_datetime(BIKE['dteday'])
BIKE['season']=BIKE['season'].astype(str)
BIKE['yr']=BIKE['yr'].astype(str)
BIKE['mnth']=BIKE['mnth'].astype(str)
print(BIKE.dtypes)
```

**Output:**

```
instant int64
dteday datetime64[ns]
season object
yr object
mnth object
holiday object
weekday object
workingday object
weathersit object
temp float64
atemp float64
hum float64
windspeed float64
casual int64
registered int64
cnt int64
dtype: object
```

Now, is the time to apply ANOVA test. Python provides us with `anova_lm()`

function from the `statsmodels`

library to implement the same.

Initially, we perform **Ordinary Least Square test** on the data, further to which the ANOVA test is applied on the above resultant.

```
import statsmodels.api as sm
from statsmodels.formula.api import ols
for x in categorical_col:
model = ols('cnt' + '~' + x, data = BIKE).fit() #Oridnary least square method
result_anova = sm.stats.anova_lm(model) # ANOVA Test
print(result_anova)
```

**Output:**

```
df sum_sq mean_sq F PR(>F)
season 3.0 9.218466e+08 3.072822e+08 124.840203 5.433284e-65
Residual 713.0 1.754981e+09 2.461404e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
yr 1.0 8.813271e+08 8.813271e+08 350.959951 5.148657e-64
Residual 715.0 1.795501e+09 2.511190e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
mnth 11.0 1.042307e+09 9.475520e+07 40.869727 2.557743e-68
Residual 705.0 1.634521e+09 2.318469e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
holiday 1.0 1.377098e+07 1.377098e+07 3.69735 0.054896
Residual 715.0 2.663057e+09 3.724555e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
weekday 6.0 1.757122e+07 2.928537e+06 0.781896 0.584261
Residual 710.0 2.659257e+09 3.745432e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
workingday 1.0 8.494340e+06 8.494340e+06 2.276122 0.131822
Residual 715.0 2.668333e+09 3.731935e+06 NaN NaN
df sum_sq mean_sq F PR(>F)
weathersit 2.0 2.679982e+08 1.339991e+08 39.718604 4.408358e-17
Residual 714.0 2.408830e+09 3.373711e+06 NaN NaN
```

Considering significance value as 0.05. we say that if the p value is less than 0.05, we assume and claim that there is considerable differences in the mean of the groups formed by each level of the categorical data. That is, we reject the NULL hypothesis.

## Conclusion

By this, we have reached the end of this topic. Feel free to comment below, in case you come across any question.

*Recommended read: Chi-square test in Python*

Happy Analyzing!! 🙂