In this case, dq told us that row 10472 has a problem but in real-life, how do we find the row with problems from a big data set?
Hi @candiceliu93:
May I know which specific mission slide you are referring to? Please provide a question link as per these guidelines
First in real life you will define what is a problem, like what is normal/abnormal in anomaly detection.
The same data could be a problem for some analysis but completely clean for another purpose depending on which row/column subset you use, or how important it is the values are of a certain property. There are also a whole fields of outlier detection/unsupervised learning methods to find problems.
Data could have problems by itself, such as None
, np.nan
,NaT
missing values. Data could also have a problem when you join with other tables, creating None
, inflating number of rows with duplicate values causing danger of groupby overcounting, or showing that a metric from a downstream position in a conversion funnel is impossibly higher than a metric from an upstream position in the funnel.
If you talk about big data, i assume you’re not using in memory pandas but databases. You can search for the problematic value (assuming it’s a single value, not the whole row has problem.), find all the rows with that value using WHERE
, do some analysis to zoom in on one(for a few) rows, look at its primary key value, and use that to index that row in future, to DELETE
or UPDATE
.
If you are aware of UPDATE
already, are you asking how SQL is finding the rows under the hood? People usually don’t think about that since SQL is a declarative language.
screen link: https://app.dataquest.io/m/467/communicating-results/2/the-scenario
I want to know how dq find the row with the problem from a big data.
I want to know how to find problematic value or row from a big data in pandas. Just like this chapter. In this chapter, dq told us that row 10472 has a problem. I want to know how to find the problematic row from a big data?
screen link: https://app.dataquest.io/m/467/communicating-results/2/the-scenario