How to find the row with problem in a big data?

In this case, dq told us that row 10472 has a problem but in real-life, how do we find the row with problems from a big data set?

1 Like

Hi @candiceliu93:

May I know which specific mission slide you are referring to? Please provide a question link as per these guidelines

1 Like

First in real life you will define what is a problem, like what is normal/abnormal in anomaly detection.
The same data could be a problem for some analysis but completely clean for another purpose depending on which row/column subset you use, or how important it is the values are of a certain property. There are also a whole fields of outlier detection/unsupervised learning methods to find problems.

Data could have problems by itself, such as None, np.nan,NaT missing values. Data could also have a problem when you join with other tables, creating None, inflating number of rows with duplicate values causing danger of groupby overcounting, or showing that a metric from a downstream position in a conversion funnel is impossibly higher than a metric from an upstream position in the funnel.

If you talk about big data, i assume you’re not using in memory pandas but databases. You can search for the problematic value (assuming it’s a single value, not the whole row has problem.), find all the rows with that value using WHERE, do some analysis to zoom in on one(for a few) rows, look at its primary key value, and use that to index that row in future, to DELETE or UPDATE.

If you are aware of UPDATE already, are you asking how SQL is finding the rows under the hood? People usually don’t think about that since SQL is a declarative language.

1 Like

screen link: https://app.dataquest.io/m/467/communicating-results/2/the-scenario

I want to know how dq find the row with the problem from a big data.

I want to know how to find problematic value or row from a big data in pandas. Just like this chapter. In this chapter, dq told us that row 10472 has a problem. I want to know how to find the problematic row from a big data?
screen link: https://app.dataquest.io/m/467/communicating-results/2/the-scenario