Extracting domains using split instead of regex

My Code:

test_urls_clean = test_urls.str.split('/').str[2].str.split('?').str[0]

domains = hn['url'].str.split('/').str[2].str.split('?').str[0]

top_domains = domains.value_counts().head()

when i run the above code,

  1. test_urls_clean and top_domains become green and match the expected answer
  2. domains comes up red

how to i check what the answer series for this is to compare my results?

Click here to open the screen in a new tab.

1 Like

Yes, this can be quite difficult to do in such a case. This might be easier to do outside of the DQ platform, like in your own Jupyter Notebook, but I am currently unaware of a sure shot way of cross-checking such results for large dataframes without an automatic grader checking it for us (cc - @Sahil possible content suggestion; how to validate our own results for large dataframes)

But in regards to where the issue might actually be, it’s with urls like the following -


Your approach fails for edge cases like the ones above. You can either modify your current approach to try and suit to each individual url or you can rely on using regex for this since it’s more powerful in such cases.

Do note that the third url above is particularly nasty. As per DQ, the expected output for that one is just ftp. I would have expected that to be ftp.tcl.tk, but it doesn’t seem to be the case (cc - @Sahil could you please provide clarification for this? it seems like a rare edge case, but it is confusing to work through since DQ’s grader accepts ftp as the correct extracted url for it)

1 Like