Regex for cleaning show titles

HI, I want to analyze my data on Netflix. There is a column “Title” in which are the titles of the shows you watches. These shows are movies like “Karate Kid” and series like “Cobra Kai”. Series are printed in this pattern:
SpongeBob Schwammkopf: Staffel 4: Der allerschönste Tag / Gummilein (Folge 20)
Karate Kid IV – Die nächste Generation

I need a Regex which extracts the bolded part. I have no problem doing this with a movie title – but doing this with an episode of a series is hard for me. (It needs to work also without colons) I tried this:
Thinking I could exract everything before a possible colon and before a possible second column. But it did not work.

Help is appreciated. Best and thank you.

I don’t understand what each string actually is. For movies, for instance, it looks like you want to extract everything.

Absolutely. I need a pattern which extract everything from movie but in case of series episoe only the part befor the first and second colon. Like: TitleOfTheSeries**:**Season and not the rest.

I don’t understand what the dataset looks like, but the pattern [\w\s]*:[\w\s]*(?=:) works to capture SpongeBob Schwammkopf: Staffel 4 in SpongeBob Schwammkopf: Staffel 4: Der allerschönste Tag / Gummilein (Folge 20).

1 Like

Hi Bruno,
many thanks. Yes it works for an episode of a series with this pattern name_of_series:seasonxy .

I learned a lot in seeing how you used the square brackets, the asterisk ans the Positive Lookaead. Thank you.

But the regex should also work for cases where no colon is placed – a movie. Like Karate Kid IV or Critters

Based on your pattern I tried one which only searches for titles without colon. I did this: r'([\w\s]*(?!:))' But it did not work.

So I need a way to find both cases of a title: With and Without colon. And in case of a colon only the parts before the first and after the first colon. Not more. Thank you for your time and advice.



Hopefully [\w\s]*(:)?[\w\s]*(?=\1)|.*[^:].* works.

Ultimately, this is a game, though, because you can use other tools in addition to regular expressions (e.g. Python and if/else statements), there’s no need to try to catch all cases with a single regular expression. I like to do it for fun, not to accomplish things.

1 Like