I believe the last solution of the I/O Bound Program is wrong

Dear TAs,

I believe there is a mistake in the last exercise within the I/O Bound Program module. The related links are given here at:
Description
Instruction

The Best Fielder is defined by someone with the bigger RF score, which has a formula of RF = (A + PO) / G, where G must be > 100.

So the project wanted us to find the top 20 fielders with SQL, and here is the final solution:

    query = '''
    SELECT 
        (CAST(A AS FLOAT) + PO) / G as RF,  
        playerID
    FROM Fielding
    GROUP BY Fielding.playerID
    HAVING G > 100
    ORDER BY RF desc
    LIMIT 20;
    '''

However, I believe the answer is incorrect. This is because each player can has more than one RF scores. Therefore we need to extract the MAX(RF) score from each player. Otherwise the clause (CAST(A AS FLOAT) + PO)/G will simply return None if that person has more than one entry. So I believe the correct solution should be:

    query = '''
    SELECT 
        MAX((CAST(A AS FLOAT) + PO) / G) as RF,  
        playerID
    FROM Fielding
    GROUP BY Fielding.playerID
    HAVING G > 100
    ORDER BY RF desc
    LIMIT 20;
    '''

The answers for the Best Pitcher and Best Batter are also questionable.
I believe the answers should be:

    # best batter
    query = '''
    SELECT
        MAX((CAST(H AS FLOAT) + BB  + HBP) / (AB + BB + HBP + SF) + 
        (H + "2B" + 2*"3B" + 3*HR) / AB) as OPS, 
        playerID
    FROM Batting
    GROUP BY playerID
    HAVING AB > 100
    ORDER BY OPS desc
    LIMIT 20;
    '''

    # best pitcher
    query = '''
    SELECT 
        MIN((13 * CAST(HR AS FLOAT) + 3*BB - 2*SO) / IPOuts + 3.2) as FIP, 
        playerID
    FROM Pitching
    GROUP BY playerID
    HAVING IPOuts > 100
    ORDER BY FIP asc
    LIMIT 20;
    '''

Correct me if I am wrong.
Thanks

Hi @scoodood,

I agree with your suggestion that using a MAX makes more sense for the exercise. However, the reason why you see None results, it not that there are several rows with the same player id.

Using (CAST(A AS FLOAT) + PO) / G will not always return None if there are several players with the same id. The real reason why it returns None, in some cases, is that the last entry having that player id has either A, PO or G defined to none.

For example if you do this:

conn = sqlite3.connect("lahman2015.sqlite", check_same_thread=False)
cur = conn.cursor()
cur.execute("""
    SELECT playerID, A, PO, G
    FROM Fielding
    WHERE G > 100 AND playerID = 'aaronha01'
""")
res = cur.fetchall()
print(len(res))
for r in res:
    print(r)
    
cur.execute("""
    SELECT playerID, (CAST(A AS FLOAT) + PO) / G as RF
    FROM Fielding
    WHERE G > 100 AND playerID = 'aaronha01'
    GROUP BY playerID
""")
res = cur.fetchall()
print(len(res))
for r in res:
    print(r)

You will see the following result:

34
('aaronha01', 4, 205, 105)
('aaronha01', 5, 223, 116)
('aaronha01', 9, 254, 126)
('aaronha01', 8, 212, 104)
('aaronha01', 17, 316, 152)
('aaronha01', 17, 313, 152)
('aaronha01', 9, 346, 150)
('aaronha01', 12, 305, 153)
('aaronha01', 11, 227, 119)
('aaronha01', 12, 261, 152)
...
('aaronha01', 66, 968, 109)
('aaronha01', 5, 206, 105)
('aaronha01', None, None, 128)  # <---- this is the problem
1
('aaronha01', None)

When you don’t use an aggregate function, GROUP BY will use the last result. As you can see, the last entry for player with id 'aaronha01' has None values so that is why in this case it returns no results. If you add AND A != 'None' to the queries, you will get the result for ('aaronha01', 5, 206, 105) instead (the previous to last entry).

Hi @Francois,
You are absolutely correct that (CAST(A AS FLOAT) + PO) / G will simply return the last result of the GROUP BY, instead of None.

Thanks for confirming my suggested answers.