Data Transformation in Machine Learning
Data Transformation is use to convert data into a suitable format for modeling.
Step 1: Normalization & Standardization (scaling)
Step 2: Encoding Categorical Variables
Step3: Log Transformation
Normalization & Standardization (scaling) are essential preprocessing steps in machine learning, ensuring that features are on a comparable scale.
This can lead to improved model performance and faster convergence during training.
Normalization Formula
Where,
X is the original value,
X_min is the minimum value of the feature, and
X_max is the maximum value of the feature.
Apply the Normalization on the 2-D array using the fit_transform
function.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
data = np.array([
[1], [2], [3], [4], [5]
])
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
[[0. ]
[0.25]
[0.5 ]
[0.75]
[1. ]]
After running, you can see all numeric values are converted range in between 0
to 1
.
You can also apply Normalization to the data frame, the steps are the same as we performed in the above example.
In this data frame, we apply Normalization on the price
column.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv("homeprices.csv")
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df[["price"]])
df["new_price"] = normalized_data
df
After running, you can see that all price
column values range from 0
to 1
.
Standardization Formula
Where,
X is the original value,
μ (mu) is the mean of the feature, and
σ (sigma) is the standard deviation of the feature.
Apply the Standardization on the 2-D array using the fit_transform
function.
import numpy as np
from sklearn.preprocessing import StandardScaler
data = np.array([
[1], [2], [3], [4], [5]
])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
[[-1.41421356]
[-0.70710678]
[ 0. ]
[ 0.70710678]
[ 1.41421356]]
After running, all numeric values transformed into a mean of 0
and a standard deviation of 1
.
You can also apply Standardization to the data frame, the steps are the same as we performed in the above example.
In this data frame, we apply Standardization on the price
column.
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("homeprices.csv")
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[["price"]])
df["new_price"] = scaled_data
df
After running, you can see that all price
column values range from a mean of 0
and a standard deviation of 1
.