Data Integration in Machine Learning


data-integration-machine-learning





Data Integration


Purpose

To combine data from multiple sources into a single coherent dataset.
Combining two data frames in Python can be done in several ways using the pandas library, depending on the desired operation.


The data integration steps are

  1. Concatenation Datasets: Stacking data frames vertically (means concatenate row-wise) or horizontally (means concatenate column-wise).
  2. Merging Datasets: Combining data frames based on a key or index, similar to SQL joins.
  3. Joining Datasets: Similar to merging but designed for combining data frames based on their indexes.

1. Concatenation Datasets

Concatenating data frames can be done using the concat() function.


Concatenate data frames vertically (row-wise)


There are two data frames with two columns and four rows each.
Now, concatenate data frames vertically (row-wise) for that we use the concat() function.

Example:
import pandas as pd

df1 = pd.DataFrame({
  "A" : ["A0", "A1", "A2", "A3"],
  "B" : ["B0", "B1", "B2", "B3"]
})

df2 = pd.DataFrame({
  "A" : ["A4", "A5", "A6", "A7"],
  "B" : ["B4", "B5", "B6", "B7"]
})

# Concatenating row-wise
result = pd.concat([df1, df2], axis=0)
result
Data Integration Concatenating Data Frames Row Wise

An axis=0 means, the concat() function considers the rows, if we put an axis=1 then it considers the columns of the data frame.

After running, the data frames concatenated vertically. The data frame 2 (df2) rows are appended to data frame 1 ((df1)).


Concatenate data frames horizontally (column-wise)

To concatenate horizontally, simply set axis=1 in the concat() function.


Example:
import pandas as pd

df1 = pd.DataFrame({
  "A" : ["A0", "A1", "A2", "A3"],
  "B" : ["B0", "B1", "B2", "B3"]
})

df2 = pd.DataFrame({
  "A" : ["A4", "A5", "A6", "A7"],
  "B" : ["B4", "B5", "B6", "B7"]
})

# Concatenating column-wise
result = pd.concat([df1, df2], axis=1)
result
Data Integration Concatenating Data Frames Column Wise

After running, the data frames are concatenated horizontally.
The data frame 2 (df2) columns are appended to the data frame 1 ((df1)) columns.


2. Merging Datasets

Merging data frames can be done using the merge() function, which is similar to SQL joins.


Inner Join

An inner join combines records from two tables (or data frames) based on a common key, and it returns only the rows with matching keys in both tables.

There are two data frames with two columns and four rows each.
The first column contains keys, and the second column contains values.

Now, merge data frames with inner join, for that we use the merge() function from the panda's library.


Example:
import pandas as pd

df1 = pd.DataFrame({
  "key" : ["K0", "K1", "K2", "K3"],
  "A" : ["A0", "A1", "A2", "A3"]
})

df2 = pd.DataFrame({
  "key" : ["K0", "K1", "K2", "K4"],
  "B" : ["B0", "B1", "B2", "B3"]
})

# Merging with inner join
result = pd.merge(df1, df2, on="key")
result
Data Integration Merging Data Frames With Inner Join

After running this code, it combines the common keys from both data frames.


Outer Join

An outer join combines records from two tables (or data frames) based on a common key, and it returns all the rows from both tables, filling in NaN (or NULL) for missing matches on either side.

For outer join, pass input parameter as how="outer" in the merge() function.


Example:
import pandas as pd

df1 = pd.DataFrame({
  "key" : ["K0", "K1", "K2", "K3"],
  "A" : ["A0", "A1", "A2", "A3"]
})

df2 = pd.DataFrame({
  "key" : ["K0", "K1", "K2", "K4"],
  "B" : ["B0", "B1", "B2", "B3"]
})

# Merging with outer join
result = pd.merge(df1, df2, on="key", how="outer")
result
Data Integration Merging Data Frames With Outer Join

After running, both data frames are merged based on the outer join condition.


3. Joining Datasets

Joining data frames is done using the join() function.

The join() function combines two data frames based on their indexes or a key column.

By default, it performs a left join, but you can specify other types of joins (inner, right, outer) as well.


There are two data frames. The first data frame has two columns and three rows, the second data frame has one column and three rows.
As you can see, both data frame keys are set using the index parameter by passing in the data frame (DataFrame()) function.
K0 and K2 keys are common in both data frames. K1 is present in data frame 1 (df1), and K3 is present in data frame 2 (df2).

Join data frames using the join() function.


Example:
import pandas as pd

df1 = pd.DataFrame({
  "A" : ["A0", "A1", "A2"],
  "B" : ["B0", "B1", "B2"]
},index=["K0", "K1", "K2"])

df2 = pd.DataFrame({
  "C" : ["C0", "C1", "C2"]
},index=["K0", "K2", "K3"])

# Joining data frames
result = df1.join(df2, how="outer")
result
Data Integration Joining Data Frames

After running, both data frames are joined and assigned null values if the key is not present in the data frames.


Summary

  1. Concatenation (concatenation is done using pd.concat() function): Concatenation is used for stacking data frames either vertically or horizontally.
  2. Merging (data frames merge using the pd.merge() function): Merging is used for combining data frames based on keys or columns, similar to SQL joins (inner, outer, left, right).
  3. Joining (Join is done using DataFrame.join() function): Joining use for joining data frames based on their indexes.

These functions allow you to combine data frames in flexible ways to suit your data analysis and manipulation needs.