How should a training dataset be distributed?

Sorry if this is a beginner question. Suppose I’m building a text-to-speech model. I was wondering if my training dataset should be “realistically” distributed (i.e. same distribution as the data it will be used on), or should it be uniformly distributed to make sure it performs well on all kinds of sentences. Thanks for your insight.

@JSteve: I would say it depends on your use case… if you are just doing it for a research and not intending to use it in production, then maybe uniform might be more suitable (and then you can adjust (hyper)parameters as needed later on). If its mission-critical and/or production then it’s obviously better to use the realistic dataset.

Some additional resource you can look into: Structuring ML Projects on Coursera by Andrew Ng.

Hope this helps!

1 Like

I’d say that this depends on the dataset size. If you have a really, really small dataset, which is common in some domains and rare in others, then you’d want to ensure that all the “important kinds of data” (whatever that means for your task) would be represented there even if they’re relatively rare, but a realistic distribution is better if you have a large enough dataset that all the key scenarios would be adequately represented anyway.

Also, if mistakes on certain data items are more important than others (which is likely in some domains), then it may make sense to overrepresent them, as you’re not optimizing for the average case of the real distribution.

There’s also the case of the targeted annotation where you look at the errors your model is making and specifically annotate extra data to overrepresent those cases - because there are scenarios where some types of data happen to be both very common and trivial to solve, so adding extra training data for them takes effort but doesn’t change the results in any way.

1 Like