Seeking clarity on commutative property

Hi! I am working on the first guided project and am cleaning the data sets. I see that I have a choice between changing the original data or making a duplicate. I can see where changing the original data may not be best practice. If I create the duplicate as an empty list I can use .append and leave the original data set unchanged.
https://bit.ly/3sIetCz

app_data = [['A', 'B', 'C', 'D'], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3], [1, 2, 3, 4]]
new_data_set = []

def refiner(data_set):
    for row in data_set:
        if len(row) == len(data_set[0]):
            new_data_set.append(row)
        
print(len(app_data))
print(len(new_data_set))
refiner(app_data)
print(len(app_data))
print(len(new_data_set))

and the output

5
0
5
4

But if I try to create the duplicate by copying the original, the commutative property results in a change to the original data set.
https://bit.ly/3bTJrRk

app_data = [['A', 'B', 'C', 'D'], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3], [1, 2, 3, 4]]
new_data_set = []
new_data_set = app_data

def refiner(data_set):
    for row in data_set:
        if len(row) != len(data_set[0]):
            row_index = data_set.index(row)
            del data_set[row_index]
        
print(len(app_data))
print(len(new_data_set))
refiner(new_data_set)
print(len(app_data))
print(len(new_data_set))

and the output

5
5
4
4

So my question is 1) what is the effect on memory if I use the first method, and 2) if that effect is large, can I use some work-around with the second method and not get stuck changing the original data.

Thanks everyone. Hope you’re having a great week.

You can profile memory by some methods here: https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python

If you don’t want to generate the full refined dataset, rather than saving the full data, you can just save the index of the rows you want to keep. A lot of data manipulation in pandas/scikit-learn is just manipulating indexes rather than the row itself.

This line in your 2nd code is doing linear search and O(n) complexity. You can do it faster by using enumerate to generate an index on the fly that moves together with the data.

Normally, a copy is needed when the same source data must be used to generate more than 1 type of analysis. If you don’t copy, the 1st analysis would have manipulated information in a way that may not be suitable for the 2nd analysis. If you don’t need this split path and it’s one way, then just manipulating in-place is fine. A lot of pandas methods have inplace parameter.

Thanks for the great answer. I take your point on using indexes instead of complete data tables. I have used enumerate a little bit in a different course but not yet in DQ.

Thanks again.

Also thanks for the link to the memory management article.