Encoding Categorical Variables in Machine Learning
Encoding Categorical Variables means Converting categorical variables into numerical values.
Encoding categorical variables is a crucial step in data preprocessing.
Machine learning algorithms require numerical input, so categorical data must be transformed into a numerical format.
Encoding Common techniques include Label Encoding and One-Hot Encoding.
There are also different encoding techniques, but in this video, we cover only Label encoding and One-Hot encoding.
Label encoding assigns a unique integer to each category.
While it's simple, it can create an ordinal relationship between categories that doesn't exist.
Using the fit_transform()
function from the LabelEcncoder
class, convert Python array categorical values into numeric values.
from sklearn.preprocessing import LabelEncoder
data = ["cat", "dog", "mouse"]
# Label encoding
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data)
print(encoded_data)
[0 1 2]
After running, the Python array categorical variables are converted into unique integer values.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("home_price_prediction.csv")
encoder = LabelEncoder()
encode_data = encoder.fit_transform(df["location"])
df["new_location"] = encode_data
df
The new encoded values are displayed in the data frame.
One-hot encoding creates a new binary feature for each category.
This method avoids ordinal relationships and is widely used.
Using the get_dummies()
function from the panda's
library we apply one hot encoding on the animal column.
import pandas as pd
data = pd.DataFrame({"animal": ["cat", "dog", "mouse"]})
# One-hot encoding
one_hot_encoded_data = pd.get_dummies(data)
print(one_hot_encoded_data)
animal_cat animal_dog animal_mouse
0 1 0 0
1 0 1 0
2 0 0 1
After running, the three columns are created corresponding to the values of an animal column.
If the value is present, set the numeric value to 1
, and the rest is 0
in every column.
import pandas as pd
df = pd.read_csv("home_price_prediction.csv")
one_hot_encoded_data = pd.get_dummies(df["location"])
new_df = df.join(one_hot_encoded_data)
new_df
The new encoded values are displayed in the data frame.
The result is similar if the corresponding value is present then it takes the numeric value 1
, otherwise it takes 0
.