Student of data science here. Still learning so I am sorry if this question is too basic for this forum.
I am running Random Forest on a test dataset, with and without PCA.
We were given a task about PCA and random forest.
I am getting a higher accuracy (97%) without the PCA. With PCA I am getting only 93%, even when using all the variance (13 features, the same as the input).
I was was sure that when using all the features, I should get the same results (since the PCA is not actually doing anything). Are my results ok, ir is there some error in the code / something that I need to fix?
This is my code:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
(X, y) = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0,stratify=y)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred_rf=clf.predict(X_test)
print("Accuracy (no pca):",accuracy_score(y_test, y_pred_rf))
pca = PCA(13)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
pca.fit(X_train_scaled)
X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
clf.fit(X_train_pca,y_train)
y_pred_rf_pca=clf.predict(X_test_pca)
print("Accuracy (with pca)",accuracy_score(y_test, y_pred_rf_pca))