I want a hint or a solution for a problem statement as to how to approach it

converted.zip (2.1 MB) This Is the problem

Data Analyst Test
The data provided HERE contains information about a random selection of 5,000 players
who installed one of our apps in the last ~30 days.
The dataset provided contains one file, players.csv.
The columns in the files are defined as follows.
level_progress.csv:

  • event_datetime: The date and time at which the event was received (Local time)
  • player_id: A unique identifier for each player.
  • level_number: The level number the event corresponds to.
  • status: The outcome the event corresponds to. (start, fail, complete)
  • session_id: A unique identifier for the session in which the event was produced.
    *complete means the player was successful.
    Task
    Q. On which level are players most likely to fail?
    You should consider the statistical significance of your answer carefully.
    You should include any code used to process the data and any calculations used to compute
    an answer.

I would do some exploration to understand the entities and their relationships.

My guesses:

  1. Each player can have multiple sessions
  2. Each session can have multiple levels, each level can take multiple sessions
  3. Lower level has to be complete before higher level is started
  4. Level has to be started before mutually exclusive states fail or complete

Since the question is about comparing levels, level_number is a main unit of analysis. Drilling in, you can analyze on player level or session level (not referring to the game level, just general english).

For each level_number, find percent of distinct players that started the level and fail. This directly answers the question. You can go one step further to study in how many different sessions (no matter which player) was each level started and failed. This gives more granularity than just the percent of players failing each level. One relevant question is does the start event get saved again
with a new event_datetime when player restarts a session or not. Whether you analyze on player level only or expand it to sessions too depends on which approach you think is a better proxy for difficulty, assuming you want to use the abstract concept of difficulty as a proxy measure of “likely to fail”. It could be that this player is just particularly bad at this level and requires 10 attempts while normal people need 2, inflating the difficulty of a level. So you can get into philosophy here about the means vs the ends. Do you measure the result (fail/complete) or the journey (number sessions attempted).

On statistical significance you can do 1-tail binomial test to see how likely is the fail percentage for each level, with the null hyp being every player is equally likely to fail/complete any level.https://www.youtube.com/watch?v=J8jNoF-K8E8&t=676s&ab_channel=StatQuestwithJoshStarmer
You can also get the top 2 highest fail percentage levels and compare their difference with something from https://www.stat.berkeley.edu/~stark/SticiGui/Text/percentageTests.htm#:~:text=To%20test%20at%20approximate%20significance,Z%20>%20z1−α.&text=Because%20the%20null%20hypothesis%20specifies,both%20sample%20percentages%20is%20p.
You probably can not just do top 2 levels but consider all levels. You may want to consider whether the levels are independent, which are assumptions in statistical tests.

Hope this helps you get started, please share how you do it eventually.