Data Integration in Machine Learning
To combine data from multiple sources into a single coherent dataset.
Combining two data frames in Python can be done in several ways using the pandas library, depending on the desired operation.
Concatenating data frames can be done using the concat()
function.
There are two data frames with two columns and four rows each.
Now, concatenate data frames vertically (row-wise) for that we use the concat()
function.
import pandas as pd
df1 = pd.DataFrame({
"A" : ["A0", "A1", "A2", "A3"],
"B" : ["B0", "B1", "B2", "B3"]
})
df2 = pd.DataFrame({
"A" : ["A4", "A5", "A6", "A7"],
"B" : ["B4", "B5", "B6", "B7"]
})
# Concatenating row-wise
result = pd.concat([df1, df2], axis=0)
result
An axis=0
means, the concat()
function considers the rows, if we put an axis=1
then it considers the columns of the data frame.
After running, the data frames concatenated vertically. The data frame 2 (df2
) rows are appended to data frame 1 ((df1
)).
To concatenate horizontally, simply set axis=1
in the concat()
function.
import pandas as pd
df1 = pd.DataFrame({
"A" : ["A0", "A1", "A2", "A3"],
"B" : ["B0", "B1", "B2", "B3"]
})
df2 = pd.DataFrame({
"A" : ["A4", "A5", "A6", "A7"],
"B" : ["B4", "B5", "B6", "B7"]
})
# Concatenating column-wise
result = pd.concat([df1, df2], axis=1)
result
After running, the data frames are concatenated horizontally.
The data frame 2 (df2
) columns are appended to the data frame 1 ((df1
)) columns.
Merging data frames can be done using the merge()
function, which is similar to SQL joins.
An inner join combines records from two tables (or data frames) based on a common key, and it returns only the rows with matching keys in both tables.
There are two data frames with two columns and four rows each.
The first column contains keys, and the second column contains values.
Now, merge data frames with inner join, for that we use the merge()
function from the panda's library.
import pandas as pd
df1 = pd.DataFrame({
"key" : ["K0", "K1", "K2", "K3"],
"A" : ["A0", "A1", "A2", "A3"]
})
df2 = pd.DataFrame({
"key" : ["K0", "K1", "K2", "K4"],
"B" : ["B0", "B1", "B2", "B3"]
})
# Merging with inner join
result = pd.merge(df1, df2, on="key")
result
After running this code, it combines the common keys from both data frames.
An outer join combines records from two tables (or data frames) based on a common key, and it returns all the rows from both tables, filling in NaN (or NULL) for missing matches on either side.
For outer join, pass input parameter as how="outer"
in the merge()
function.
import pandas as pd
df1 = pd.DataFrame({
"key" : ["K0", "K1", "K2", "K3"],
"A" : ["A0", "A1", "A2", "A3"]
})
df2 = pd.DataFrame({
"key" : ["K0", "K1", "K2", "K4"],
"B" : ["B0", "B1", "B2", "B3"]
})
# Merging with outer join
result = pd.merge(df1, df2, on="key", how="outer")
result
After running, both data frames are merged based on the outer join condition.
Joining data frames is done using the join()
function.
The join()
function combines two data frames based on their indexes or a key column.
By default, it performs a left join, but you can specify other types of joins (inner, right, outer) as well.
There are two data frames. The first data frame has two columns and three rows, the second data frame has one column and three rows.
As you can see, both data frame keys are set using the index
parameter by passing in the data frame (DataFrame()
) function.
K0
and K2
keys are common in both data frames. K1
is present in data frame 1 (df1
), and K3
is present in data frame 2 (df2
).
Join data frames using the join()
function.
import pandas as pd
df1 = pd.DataFrame({
"A" : ["A0", "A1", "A2"],
"B" : ["B0", "B1", "B2"]
},index=["K0", "K1", "K2"])
df2 = pd.DataFrame({
"C" : ["C0", "C1", "C2"]
},index=["K0", "K2", "K3"])
# Joining data frames
result = df1.join(df2, how="outer")
result
After running, both data frames are joined and assigned null
values if the key is not present in the data frames.
pd.concat()
function): Concatenation is used for stacking data frames either vertically or horizontally.
pd.merge()
function): Merging is used for combining data frames based on keys or columns, similar to SQL joins (inner, outer, left, right).
DataFrame.join()
function): Joining use for joining data frames based on their indexes.
These functions allow you to combine data frames in flexible ways to suit your data analysis and manipulation needs.