Duplicated data in databases

Hello, all.

To my understanding, what makes relational databases efficient in storing and retrieving data is the lack of duplicated data (assuming a proper schema). However, in the Intermediate Joins In SQL mission there are two tables (track and invoice_line) that share a field: unit_price. This is different than having a primary key linking tables. I get that this one field does not significantly affect the performance or quality of the schema. However, does it suffice that unit_price is as relevant to both track and invoice_line fields for this data to appear twice? More broadly speaking, when is it okay to duplicate data?


1 Like

Here is a good discussion on 4 patterns of how a database can be designed: https://stackoverflow.com/questions/9584387/items-and-specialized-items-multiple-tables-with-duplicate-columns-main-table
(The 2 primary keys in one table in eg 1 seems impossible/wrong)

It describes more of modelling inheritance in OOP, while your question of track vs invoice_line seems closer to composition in OOP (just my observation, I don’t know them well enough to give further design advice, but you can bring this study further after reading https://realpython.com/inheritance-composition-python)

While answering another unrelated question, The difference of Union and join
I found an example (the 2nd Replacing UNION with JOIN link) showing how having duplicated non-joining columns (DateMachine) allows for easy UNION (without them, JOIN must be used).

Generalizing away from the JOIN replace UNION argument, columns are duplicated across tables so joins are not required for common analysis requiring a frequently used column (eg. unit_price) together with other columns from different tables each time. (eg. unit price and columns from track or unit price and columns from invoice_line). It’s a trade-off between speed of query and storage space.

Here is an example of someone trying to duplicate email column in 2 tables: https://dba.stackexchange.com/questions/21950/how-to-avoid-duplicated-column-in-two-different-tables, and a discussion (tangential to this question) of how columns/tables should be thrown away/merged based on cardinality of 1: 0-1/n

1 Like