Dataframes Merge

Hello
I’m working in a personal Project,
I have 2 datasets:

  • personal Information from Netflix accounts
  • Netflix Originals from kaggle.

The personal information dataset doesn’t have the type of content (movie or TV show) since both datasets have a common Colunm (‘Title’) I’m trying to merge both datasets usiung the following code:.
Netflix Originals = netflix_processed.
Netflix personal Data = netflix_updated

result1 = netflix_processed.merge(netflix_updated, indicator=True, how=‘outer’)

When I review the new dataset
result1.shape
(85121, 15)

Since the Netflix Originals contanis all the data from netflix and the personal data only have the title available for the country of the accounts, only should have the comun titles, but when I run the shape in the personal data is diffirent.
(7267, 7)

What could be doing wrong?

I’m sorry I might not understand it fully, are there titles in your Netflix personal dataframe that are not included in the Netflix Originals dataframe?
I think you are getting more rows and columns than you expected because you are using ‘outer’ merge.
If you want to keep all the data from the Netflix Originals dataframe and only add title from the Netflix personal dataframe that are in the Netflix Original dataframe as well then ‘left’ merge might be more appropriate. Or ‘inner’ merge in case you only want to keep data that are present in both dataframes.

It seems to be a fun project! Good luck!

Thanks for your response!!
Let me try to be more clear on this.
The Netflix originals has an information that user data doesn’t have witch is the “Type” but the comon column is the “Title” but my doubt is why when merge both datasets now has more rows than the user data one, since should take only the title in both datasets.

Does the Netflix Originals dataset contain all the movie/show titles from your personal data?

Yes, since have all the Netflix titles,

Even if they are both in Netflix, they might not have the same titles. Your personal dataset might include movies/shows that are not Netflix “Originals” or only available in your region. Also, it depends on when the data for the two datasets were collected, the newer dataset might have titles that weren’t available when the first one was collected.
Have you checked if they are the same?

trying to see steps how to compare the name in both columns, you have any sugestions?

I used this line
merged = pd.merge(netflix_processed, netflix_updated, on=[‘title’])

when the dataset is finish I have a new column called _merge and show Three options
left_only
right_only
both

I attach the final CSV.
https://drive.google.com/file/d/1u9Oz04a5txNtyeqddbdK9uTBporsTSBU/view?usp=sharing

I’ve just sent you as request to grant me access to check your csv.

I’m a beginner as well, so I don’t know what’s the best way to compare for common titles.
One way I can think of is using merge :smiley: Using ‘inner’ merge will return only the titles that are common!

access granted, but I found the problem,

there’s a title present in the user dataset but not present on the netflix titles.

I’m glad you managed to figure it out! Good luck with your project!