I just finished lessons 1.1 to 1.7 in the DQ course and thought it would be interesting if I were to run an independent project on a public dataset, of which I chose Kaggle to be the lucky provider, using techniques that were adjusted from those that I learned during those lessons in order to answer new types of questions that I’d presume I’d encounter further down the Data Science journey.
The Jupyter Notebook showcasing my project is attached: Groceries dataset - Freeflow practice.ipynb (50.0 KB)
I’m of course open to any constructive feedback, so don’t be shy!
The dataset contains store purchase orders, where each order is a row. The source can be found at Kaggle, at this link specifically: https://www.kaggle.com/heeraldedhia/groceries-dataset
The 1st question that I decided to tackle was which item types were the most likely to appear for their respective correlative item types, aka item type x is more correlative to item type y than any other item type pairing in the whole order dataset.
This is because it might be useful for a store manager to know what item is most likely to go hand in hand with another item.
I generated a correlative table for this, of which of course I showed my method to generate it.
The 2nd question was which item types were the most correlative for a chosen item type, so if one wanted to know which type of items were most likely to appear for let’s say meat, can a table showing the most likely item types at its top be generated?
A use case that I thought of for this question’s usefulness was promotions, as perhaps a store manager would want to know what is the best pairing item type to also promote when promoting a discount for another item type.
I defined a correlative-table-generating function and provided an example of it in use, with tropical fruits as the item type to have top correlations be drawn and easily shown for it.
That about sums it up, hope you all enjoy it and have a great day or evening!