How to get unique values in a cross street column

Hi guys,
I’m cleaning a dataset about vehicle collisions and there is a column with the name of streets where the accident happened.

However there are “duplicated addresses” with this format:

street A & street B
street B & street A
street A & street C
street C & street A

How can I solve this problem? I’m really stuck with this.

Example dataset:

df = pd.DataFrame({'id':['1FA','2WA','3SQ','4GA'],
                   'cross_street': ['street A & street B', 'street B & street A',
                                  'street A & street C', 'street C & street A']})

Are you trying to say that

street A & street C
street C & street A

are duplicate rows because the streets are the same in both? And you only wish to remove the duplicate row?

Hi @the_doctor, exactly those streets are the same point of intersection and I just want to keep a unique value for each point.

I’ve tried this idea and it works, you can check it out in this code:

# Dataset example
df = pd.DataFrame({'id':['1FA','2WA','3SQ','4GA'],
                   'cross_street': ['street A & street B', 'street B & street A',
                                  'street A & street C', 'street C & street A']})

# Create a status column with values 0 and 1
df['flag'] = 0

df.head()
def reverse_address(address):
  street_x = address.split('&')[0]
  street_y = address.split('&')[1]

  cross_yx = str.strip(street_y + ' & ' + street_x)

  return cross_yx

for c in df['cross_street'].unique():
  flag = df.loc[df['cross_street']==c,'flag']
  if flag.all() == 0:
    df.loc[df['cross_street']==reverse_address(c), 'flag'] = 1
    df.loc[df['cross_street']==reverse_address(c), 'cross_street'] = c

df.head()

But I think that it’s not efficient code for a dataset with 200k rows if I use a for loop and if statement…

I looked a few things up, and came across frozenset that is a built-in Python function that can be passed to Pandas’ apply() method.

So, you take df["cross_street"].str.split(' & ') and get -

0 [street A, street B]
1 [street B, street A]
2 [street A, street C]
3 [street C, street A]

Then you use frozenset -

df["cross_street"].str.split(' & ').apply(frozenset)

to get -

0 (street A, street B)
1 (street A, street B)
2 (street A, street C)
3 (street A, street C)

Notice how the order of the strings has changed above.

Then you find out which rows are duplicates using the duplicated() method in Pandas -

df["cross_street"].str.split(' & ').apply(frozenset).duplicated()

the above gives you -

0 False
1 True
2 False
3 True

As you can see, index 1 and 3 are duplicates.

And then you use normal boolean operation on the dataframe to remove those duplicates -

df[~df["cross_street"].str.split(' & ').apply(frozenset).duplicated()]

to get -

          cross_street   id
0  street A & street B  1FA
2  street A & street C  3SQ

I think this should be better than for loops.

1 Like

I was looking for something like that. First i thought set would work but it did not.

Thanks @the_doctor,
the problem is that I can’t remove any rows because each one corresponds to a different accident (time, date, victims, vehicles…) even if it occurs at the same place.

So I just need change the name of streets when it’s the same place (for this reason I added the id column in this example)

That’s completely different from what you pointed out before when I asked if you wanted to remove the duplicate rows.

I’m sorry, I explained it wrong.

I want to keep a unique place (or address) for each row but not eliminate them:

id cross_street
0 1FA street A & street B
1 2WA street A & street B
2 3SQ street A & street C
3 4GA street A & street C

Something similar to the result of my code but more efficient to use with 200k rows.

Then you can just split the string and sort it in lexicographic (alphabetical) order.

df["cross_street"] = df["cross_street"].str.split(' & ').apply(sorted)

will return -

image

If you want cross_street in the format you shared above, then you can use another apply() for that -

df["cross_street"] = df["cross_street"].str.split(' & ').apply(sorted).apply(lambda val: val[0] + " & " + val[1])

image

1 Like

Hi @the_doctor I solved it! Both solutions have been very useful :slight_smile:

Thank so much!

1 Like