Question using the 'while' function in python

Screen Link:
https://app.dataquest.io/m/1000351/cleaning-and-preparing-data-in-python-practice-problems/7/cleaning-house-listings-3

My Code:

def clean_id_col():
    # Read the CSV file
    f = open('listings.csv')
    reader = csv.reader(f)
    rows = list(reader)
    # Clean id column
    id_set = set()
    for row in rows[1:]:
        if row[0]:
            id_set.add(int(row[0]))
    cur_id = 1000
    *for row in rows[1:]:*
*        if not row[0]:*
*            while cur_id in id_set:*
*                cur_id += 1*
*            row[0] = str(cur_id)*
*            cur_id += 1*
    # Write the cleaned file
    write_csv(rows)

I have a question regarding the * lines of the code above
The task here was to insert a random four-digit number in row[0] if row[0] was empty.

My question is why there are two lines of “cur_id += 1”, one under the while function and the second in the second-to-last row.

Setting cur_id at 1000, if the while function is true (in other words if cur_id is in id_set) then cur_id +=1 (the first cur_id+=1) and row[0] becomes 1001.

Even without the second cur_id += 1, would’t the code work?
Without the seocnd cur_id+=1, cur_id would remain 1001 and since the while function is true ( cur_id is in id_set), the process would repeat.
The first cur_id+=1 will change cur_id to 1002 and row[0] becomes 1002.

Just consider a simple example of this with the following data -

original = [ ["1234", "something"],
             ["", "something"],
             ["1000", "something"],
             ["", "something"],
             ["1002", "something"] ]

You first create the id_set, which would be {1234, 1000, 1002} in this case.

Then you have the following code without the update at the end as you suggest -

    cur_id = 1000
        for row in original:
            if not row[0]:
                while cur_id in id_set:
                    cur_id += 1
                row[0] = str(cur_id)

For the 2nd row, we hit the if condition.

We then check if cur_id is in id_set or not. It is. So, we update cur_id by 1. Now cur_id is not in id_set.

We move onto the next code line and set that id for that row.

We continue with the loop. We reach the 4th row now (since it’s empty).

We hit the while loop and we check if cur_id is in id_set or not.

It’s not. Our cur_id is 1001 which is not in id_set. So, we don’t update cur_id.

We end up assigning row[0] the value 1001.

However, we assigned that same value to our 2nd row previously. So, we end up with duplicate values here.

The source of the confusion here, as per me, is that id_set is not being updated in this code. Updating cur_id twice makes it a tad bit to understand.

Alternative Solution 1

I do think that the provides solution is a bit complex and shouldn’t have to be.

I think, a better solution would be -

cur_id = 1000
    for row in rows[1:]:
        if not row[0]:
            while cur_id in id_set:
                cur_id += 1
            id_set.add(cur_id)
            row[0] = str(cur_id)

I added id_set.add(cur_id) to the above and removed the second update to cur_id.

The above change makes it clear that we have a new id assigned to a row and removes the confusion of having to update cur_id again.

Alternative Solution 2

As per me, an even more logical approach would be -

cur_id = 1000
    for row in rows[1:]:
        if not row[0]:
            if cur_id not in id_set:
                id_set.add(cur_id)
                row[0] = str(cur_id)
            cur_id += 1

The above has a more logical breakdown -

  1. Iterate through the rows
  2. Check if there is an id or not
  3. If there is no id, check if the current id exists in the set of existing ids.
  4. If the current id is not in the set of existing ids, then
    4.1 Add the current id to the set of existing ids
    4.2 Assign the current id to the row
  5. Update the current id by 1.

I am unsure of whether there is additional context to these problems or a ceiling on what kind of python knowledge one must have to solve this. But I think my solution shouldn’t be problematic. The only different thing here really is the not in part. Which shouldn’t be complicated to understand.

Hopefully this helps.

1 Like

Yes, thank you very much for your explanation!
It clears up so many questions that I had.
Thanks for your time! :slight_smile:

Glad I could help. If you think my answer could help other students then feel free to mark is as a solution to your question.

1 Like