Comparing Two Large Data Sets and Returning Differences

At work I’ve been tasked with finding differences between two large data sets. Mainly, I just need to have the data sets run against one another and then return any differences between the two. This is normally solved with a combination of VLOOKUP statements. But, these two data sets are too large to be handled in excel.

The data sets are pretty simple, just large. Two columns, one column labeled “ID” and another labeled “Value” for the sake of simplicity.

I was thinking that this is a job for python and that I could probably solve this with some sort of nested list comprehension. Perhaps I could compare these two data sets and then throw each row that doesn’t match into it’s own list of lists?

I think I’ve seen this done in a python lesson before, I just have a few questions.

  1. Is this the best way to go about solving this problem? Is it efficient? I may have to do this again with other data sets.

  2. Would a nested list comprehension compare every single row of data set A to every single row of data set B? These data sets may or may not be sorted so I’d have to make sure that I’m comparing every row. Ex: It’s not enough to compare row 1 to row 1 and row 2 to row 2, I have to compare row 1 of data set A to every row of data set B.

If this post isn’t appropriate for this forum, I understand. I thought this might be a good place though, because this is one of the first times applying what I’ve learned here to a real world scenario.

Thanks.

Hey, Charles.

Not necessarily the best, but probably good enough. An alternative that comes to mind is the command-line tool AWK.

I feel like this is a bit vague for the kind of detail you want. I suggest you include sample data in your question and explain what the desired output is.

It’s perfectly adequate :slight_smile:

For Clarification:
There will be two columns within each data set: ID and Value. There’s two different scenarios that might occur. There will be an ID in one that is not in the other and there will be an ID match that will have a different value. Example:

Data Set A

ID     Value
1      35
2      22
4      98
5      64

Data Set B

ID     Value
1      35
2      22
4      78
6      56

Return:

4      98
5      64
6      56

It’s not really important which of the two rows the program returns if there is a difference in the value, the most important thing is that I know that there is a difference.
The more I look at this the more familiar this seems. I’ll revisit the python function course and have another look. In the meantime, I’ll accept any and all advice.

Hi @charlesd!

Hopefully I understood your question right - you have 2 tables with IDs and values that are supposed to be the same, and you want to flag an occurrence of one ID having a different value in table 1 from the one it has in table 2?

You could possibly try an outer join, by joining on the ‘ID’ column. This should give you a table with 3 columns - ID, Value from 1st table, Value from 2nd table.

Then, assuming your table looks like this:

ID Value 1 Value 2
1 50 50
2 60 61

What you can do is make a 4th column that indicates explicitly if the columns ‘Value 1’ and ‘Value 2’ are equal.

2 Likes

@charlesd Although BBP didn’t mention it, this solution will also be able to capture lines whose ID isn’t present in both files, which I think you want to do.

This solution also suggests using some library like pandas or pandasql, it’s not clear to me whether you know them or if you want a “pure” Python approach. Let us know!

I’m open to any approach that gets the job done. I have some experience with the pandas library, but I’ll have to brush up on it. I have no experience with pandasql, but I’m familiar with SQL. I’ll look into this approach.

Thank you both!

I see 2 questions here, in this order:

  1. Find the mismatching ID, no matter what their values are
  2. Find the matching ID with mismatching values

I’ll assume the ID column contains unique values in both datasets to simplify the operations.
Please alert if this is wrong.

Some pseudo-code:

Part 1:
unique_id_A = set(A.ID)
unique_id_B = set(B.ID)
Mismatched ID = (unique_id_A - unique_id_B).union(unique_id_B - unique_id_A)

Part 2:
Matched ID = unique_id_A.intersection(unique_id_B)
Filtered A, Filtered B = A[Matched ID], B[Matched ID]
Sorted A, Sorted B = Filtered A.sort_values(by=‘ID’), Filtered B.sort_values(by=‘ID’)
Sorted A.Value - Sorted B.Value to find which ID is not 0 in differenced Value.

This worked! Thank you.

I used the outer join and I combined both dataframes into one dataframe. It appears that they’re both the same length and from what I can see there are no mismatches, but there are a lot of rows to go through. What I have now looks like:

Value A          ID       Value B
33               1        33
45               2        45
98               3        98

I was thinking about adding a fourth column and I wanted to do something like this:

df3['Matching'] = np.where((df3['Value A'] == df3['Value B']), df3['Match'], df3['No Match'])

I was under the impression that this would create a 4th column within the combined dataframe and in that fourth column it would label the row as ‘Match’ or ‘No Match’ based on whether column 1 and 3 were equal to one another, but I’m running into a pretty unusual error output. Ultimately, it says "KeyError: 'Match'". I’m not quite sure how to proceed. I was thinking that it might be because it’s not a list of list, or I might have to rethink the where statement.

@hanqi, I’ll give your method a shot.

Ok, so the way that the where statement works is actually like this:

df3['Matching'] = np.where((df3['Value A'] == df3['Value B']), 'Match', 'No Match')

This will create a new column named 'Matching' and fill that column’s row with Match if Value A and Value B in that row match and ‘No Match’ if they do not match. Now all I have to do is either manually go through the dataframe and look for 'No Match' or sort any rows that have No Match into a new data frame so that my coworkers and myself can further examine the IDs that don’t have a match.

I’d like to thank everyone for their help and advice. :smiley: I think I know how to get this last part done, but I’ll post if otherwise.

1 Like

There’s a far simpler way to look for the problem rows without combing the whole dataframe!

I’ll use this piece of code to generate the example dataframe:

df = pd.DataFrame({'ID':[1,2,3],
                  'Value 1': [50,60,70],
                  'Value 2': [50,61,70]})

df

Output:

ID Value 1 Value 2
1 50 50
2 60 61
3 70 70

Using your method, we get:

df['Matching'] = np.where(df['Value 1'] == df['Value 2'], "Matching", "Not Matching")

df

Output:

ID Value 1 Value 2 Matching
1 50 50 Matching
2 60 61 Not Matching
3 70 70 Matching

Narrowing it down to just the columns where there is a ‘Not Matching’ result is a simple matter of Boolean indexing using the Matching column we just made!

df['Matching'] = np.where(df['Value 1'] == df['Value 2'], "Matching", "Not Matching")

df2 = (df[df["Matching"] == "Not Matching"]).copy()

df2

Output:

ID Value 1 Value 2 Matching
2 60 61 Not Matching

I used the .copy method to explicitly create a new dataframe object, df2, that holds the filtered rows, otherwise it would simply be referring to a view of the original dataframe (which might still show you the right rows you want, but it might lead to errors later when trying to manipulate the dataframe with further code).

Apart from using the np.where method above, you could also apply a function across each row to compare column values within the row against each other.

My personal method of choice would have been something like the following:

df['Matching'] = df.apply(lambda x: "Matching" if x['Value 1'] == x['Value 2'] else "Not Matching", axis = 1)

df

Output:

ID Value 1 Value 2 Matching
1 50 50 Matching
2 60 61 Not Matching
3 70 70 Matching

Thanks! I’ll give this a shot as refactoring code and whatnot is a great way to get better and learn new things.

By the way, how did you create those nice looking tables in your comment?

Thanks.

You can quote someone’s comment to see how it was formatted :slight_smile:

|ID|Value 1|Value 2|Matching|
|---|---|---|---|
|1|50|50|Matching|
|2|60|61|Not Matching|
|3|70|70|Matching|