test_urls_clean = test_urls.str.split('/').str.str.split('?').str
domains = hn['url'].str.split('/').str.str.split('?').str
top_domains = domains.value_counts().head()
when i run the above code,
- test_urls_clean and top_domains become green and match the expected answer
- domains comes up red
how to i check what the answer series for this is to compare my results?
Click here to open the screen in a new tab.
Yes, this can be quite difficult to do in such a case. This might be easier to do outside of the DQ platform, like in your own Jupyter Notebook, but I am currently unaware of a sure shot way of cross-checking such results for large dataframes without an automatic grader checking it for us (cc - @Sahil possible content suggestion; how to validate our own results for large dataframes)
But in regards to where the issue might actually be, it’s with
urls like the following -
Your approach fails for edge cases like the ones above. You can either modify your current approach to try and suit to each individual url or you can rely on using regex for this since it’s more powerful in such cases.
Do note that the third url above is particularly nasty. As per DQ, the expected output for that one is just
ftp. I would have expected that to be
ftp.tcl.tk, but it doesn’t seem to be the case (cc - @Sahil could you please provide clarification for this? it seems like a rare edge case, but it is confusing to work through since DQ’s grader accepts
ftp as the correct extracted url for it)