Split dataset train/test in Project Guided Predicting Car Prices

Screen Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/3/univariate-model

Your Code:

def knn_train_test(df,features,target):
    np.random.seed(1)
    tmp_df=df.copy()
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(tmp_df.index)
    tmp_df = tmp_df.reindex(shuffled_index)
    #tmp_df=df.loc[np.random.permutation(len(df))]
      
    # Split full dataset into train and test sets
    train_df=tmp_df.iloc[0:int(len(tmp_df)*.5)]
    test_df=tmp_df.iloc[int(len(tmp_df*.5)):]
    
    # Instantiate model
    model=KNeighborsRegressor()
    
    # Fit a KNN model to the training data (using k=5 default)
    model.fit(train_df[features],train_df[target])
    
    # Make predictions using model
    predictions=model.predict(test_df[features])
    
    # Calculate RSME
    mse=mean_squared_error(test_df[target],predictions)
    rmse=np.sqrt(mse)
    
    return rmse

What I expected to happen:

I want to split the data set into 2 parts (train:50% and test:50%) and I don’t understand why I can’t split it with my code, it only works if I change it like the solution code:

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]

What actually happened:
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

Other details:
It works with this code (the same solution):

#train_df=tmp_df.iloc[0:int(len(tmp_df)/2)]
#test_df=tmp_df.iloc[int(len(tmp_df)/2):]
1 Like

Hi @arredocana,

The error is due to a small typo:
test_df=tmp_df.iloc[int(len(tmp_df*.5)):]

int(len(tmp_df*.5)) should be changed to int(len(tmp_df)*.5)

Best,
Sahil

1 Like

Hi @Sahil could we also use the train_test_split function from library scikit learn instead of this code used during this first mission:

    # Randomize order of rows in data frame
    np.random.seed(1)
    shuffle_index=np.random.permutation(df.index)
    df=df.reindex(shuffle_index)
       
    # Split full dataset into train and test sets
    train_df = df.iloc[0:int(len(df)*.75)]
    test_df = df.iloc[int(len(df)*.75):]

For example, using this train_test_split function in the knn_train_test function used on this project guided would be:

def knn_train_test(df, features, target, k=[5]):
    
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split
    
    # Normalize all columnns to range from 0 to 1 except the target column
    target_col = df[target]
    df = (df - df.mean()) / (df.std())
    df[target] = target_col
    
    # Split full dataset into random train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(df[features], df[target],\
                                                test_size = 0.25, random_state = 0)
    
    k_values = k
    k_rmses = dict()
    
    for k in k_values:
        
        # Instantiate model
        model = KNeighborsRegressor(n_neighbors=k, algorithm='brute')
             
        # Fit the model to the training data
        model.fit(X_train, y_train)
    
        # Make predictions using model
        predictions = model.predict(X_test)
    
        # Calculate RMSE
        mse = mean_squared_error(y_test, predictions)
        k_rmses[k] = np.sqrt(mse)
    
    return k_rmses

It’s the same thing, right?
Thanks in advance!

1 Like

Hi @arredocana,

Yes, you could use train_test_split.