Student of data science here. Still learning so I am sorry if this question is too basic for this forum.
I am running Random Forest on a test dataset, with and without PCA.
We were given a task about PCA and random forest.
I am getting a higher accuracy (97%) without the PCA. With PCA I am getting only 93%, even when using all the variance (13 features, the same as the input).
I was was sure that when using all the features, I should get the same results (since the PCA is not actually doing anything). Are my results ok, ir is there some error in the code / something that I need to fix?
This is my code:
from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier from sklearn.decomposition import PCA from sklearn.datasets import load_wine from sklearn.preprocessing import StandardScaler (X, y) = load_wine(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0,stratify=y) clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train,y_train) y_pred_rf=clf.predict(X_test) print("Accuracy (no pca):",accuracy_score(y_test, y_pred_rf)) pca = PCA(13) scaler = StandardScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) pca.fit(X_train_scaled) X_train_pca = pca.transform(X_train_scaled) X_test_pca = pca.transform(X_test_scaled) clf.fit(X_train_pca,y_train) y_pred_rf_pca=clf.predict(X_test_pca) print("Accuracy (with pca)",accuracy_score(y_test, y_pred_rf_pca))