# How to get unique values in a cross street column

Hi guys,
I’m cleaning a dataset about vehicle collisions and there is a column with the name of streets where the accident happened.

However there are “duplicated addresses” with this format:

street A & street B
street B & street A
street A & street C
street C & street A

How can I solve this problem? I’m really stuck with this.

Example dataset:

``````df = pd.DataFrame({'id':['1FA','2WA','3SQ','4GA'],
'cross_street': ['street A & street B', 'street B & street A',
'street A & street C', 'street C & street A']})
``````

Are you trying to say that

street A & street C
street C & street A

are duplicate rows because the streets are the same in both? And you only wish to remove the duplicate row?

Hi @the_doctor, exactly those streets are the same point of intersection and I just want to keep a unique value for each point.

I’ve tried this idea and it works, you can check it out in this code:

``````# Dataset example
df = pd.DataFrame({'id':['1FA','2WA','3SQ','4GA'],
'cross_street': ['street A & street B', 'street B & street A',
'street A & street C', 'street C & street A']})

# Create a status column with values 0 and 1
df['flag'] = 0

``````
``````def reverse_address(address):

cross_yx = str.strip(street_y + ' & ' + street_x)

return cross_yx

for c in df['cross_street'].unique():
flag = df.loc[df['cross_street']==c,'flag']
if flag.all() == 0:

``````

But I think that it’s not efficient code for a dataset with 200k rows if I use a for loop and if statement…

I looked a few things up, and came across `frozenset` that is a built-in Python function that can be passed to Pandas’ `apply()` method.

So, you take `df["cross_street"].str.split(' & ')` and get -

0 [street A, street B]
1 [street B, street A]
2 [street A, street C]
3 [street C, street A]

Then you use `frozenset` -

``````df["cross_street"].str.split(' & ').apply(frozenset)
``````

to get -

0 (street A, street B)
1 (street A, street B)
2 (street A, street C)
3 (street A, street C)

Notice how the order of the strings has changed above.

Then you find out which rows are duplicates using the `duplicated()` method in Pandas -

``````df["cross_street"].str.split(' & ').apply(frozenset).duplicated()
``````

the above gives you -

0 False
1 True
2 False
3 True

As you can see, index `1` and `3` are duplicates.

And then you use normal boolean operation on the dataframe to remove those duplicates -

``````df[~df["cross_street"].str.split(' & ').apply(frozenset).duplicated()]
``````

to get -

``````          cross_street   id
0  street A & street B  1FA
2  street A & street C  3SQ
``````

I think this should be better than for loops.

1 Like

I was looking for something like that. First i thought `set` would work but it did not.

Thanks @the_doctor,
the problem is that I can’t remove any rows because each one corresponds to a different accident (time, date, victims, vehicles…) even if it occurs at the same place.

So I just need change the name of streets when it’s the same place (for this reason I added the id column in this example)

That’s completely different from what you pointed out before when I asked if you wanted to remove the duplicate rows.

I’m sorry, I explained it wrong.

I want to keep a unique place (or address) for each row but not eliminate them:

id cross_street
0 1FA street A & street B
1 2WA street A & street B
2 3SQ street A & street C
3 4GA street A & street C

Something similar to the result of my code but more efficient to use with 200k rows.

Then you can just split the string and sort it in lexicographic (alphabetical) order.

``````df["cross_street"] = df["cross_street"].str.split(' & ').apply(sorted)
``````

will return -

If you want `cross_street` in the format you shared above, then you can use another `apply()` for that -

``````df["cross_street"] = df["cross_street"].str.split(' & ').apply(sorted).apply(lambda val: val[0] + " & " + val[1])
``````

1 Like

Hi @the_doctor I solved it! Both solutions have been very useful

Thank so much!

1 Like