Split dataset train/test in Project Guided Predicting Car Prices

Screen Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/3/univariate-model

Your Code:

def knn_train_test(df,features,target):
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(tmp_df.index)
    tmp_df = tmp_df.reindex(shuffled_index)
    # Split full dataset into train and test sets
    # Instantiate model
    # Fit a KNN model to the training data (using k=5 default)
    # Make predictions using model
    # Calculate RSME
    return rmse

What I expected to happen:

I want to split the data set into 2 parts (train:50% and test:50%) and I don’t understand why I can’t split it with my code, it only works if I change it like the solution code:

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]

What actually happened:
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

Other details:
It works with this code (the same solution):

1 Like

Hi @arredocana,

The error is due to a small typo:

int(len(tmp_df*.5)) should be changed to int(len(tmp_df)*.5)


1 Like

Hi @Sahil could we also use the train_test_split function from library scikit learn instead of this code used during this first mission:

    # Randomize order of rows in data frame
    # Split full dataset into train and test sets
    train_df = df.iloc[0:int(len(df)*.75)]
    test_df = df.iloc[int(len(df)*.75):]

For example, using this train_test_split function in the knn_train_test function used on this project guided would be:

def knn_train_test(df, features, target, k=[5]):
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split
    # Normalize all columnns to range from 0 to 1 except the target column
    target_col = df[target]
    df = (df - df.mean()) / (df.std())
    df[target] = target_col
    # Split full dataset into random train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(df[features], df[target],\
                                                test_size = 0.25, random_state = 0)
    k_values = k
    k_rmses = dict()
    for k in k_values:
        # Instantiate model
        model = KNeighborsRegressor(n_neighbors=k, algorithm='brute')
        # Fit the model to the training data
        model.fit(X_train, y_train)
        # Make predictions using model
        predictions = model.predict(X_test)
        # Calculate RMSE
        mse = mean_squared_error(y_test, predictions)
        k_rmses[k] = np.sqrt(mse)
    return k_rmses

It’s the same thing, right?
Thanks in advance!

1 Like

Hi @arredocana,

Yes, you could use train_test_split.