Random_state after kernel restart

Anyone knows why when using the same random_state in sklearn RandomForestClassifier, restarting the kernel would produce different results? This happens no matter i use global random state np.random.seed(1234), integer, or RandomState instance(used in code below) into random_state parameter.

To test, you can paste the code below, remember the accuracy_score, restart and run all and see the score change

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
from IPython.core.display import display, HTML
display(HTML('<style>.container {width:90% !important;}</style>'))

train_samples = 100  # Samples used for training the models
X, y = datasets.make_classification(n_samples=100000, n_features=20,
                                    n_informative=2, n_redundant=2)
X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]

rfc = RandomForestClassifier(n_estimators=100,random_state=np.random.RandomState(1234))

rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)
accuracy_score(y_test,pred)
1 Like

Hey, Han. You’ll hit yourself in the head after you understand what’s going on, so please wear a helmet before reading on :slight_smile:

This happening simply because of variations in the dataset. Try using the random_state parameter in the make_classification function.

I made slight edits to your code and created a script called han.py.

Expand to see the contents of the aforementioned script.
import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

train_samples = 100  # Samples used for training the models
X, y = datasets.make_classification(
    n_samples=100000, n_features=20,
    n_informative=2, n_redundant=2,
    random_state=1337
)

X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]

rfc = RandomForestClassifier(
    n_estimators=100,
    random_state=np.random.RandomState(1234)
)

rfc.fit(X_train, y_train)
pred = rfc.predict(X_test)
print(accuracy_score(y_test,pred))

Here are the results I get with it:

$ for _ in {1..10}; do python han.py; done
0.8772472472472472
0.8772472472472472
0.8772472472472472
0.8772472472472472
0.8772472472472472
0.8772472472472472
0.8772472472472472
0.8772472472472472
0.8772472472472472
0.8772472472472472

Oh no that was a really bad example. I am still trying to solve my kernel restart --> changing confusion matrix values problem. Now i suspect it is psycopg2.connect database read giving different row order each time causing GridSearchCV to see different train/test sets each time. Trying to ORDER BY results from tables that don’t have a schema (so no primary/unique keys) and trying to not do ORDER BY table.*

Also, do you know how kernel reset affects python those unordered set()/dict() operations? Eg. If i make a set and index into a certain position, can i be getting different members within same kernel run and between kernel restarts?

Since Python 3.7 “dictionaries preserve insertion order” (source).

As for sets, my incomplete and possibly faulty understanding is that you cannot guarantee order between different runs because sets use the hash function and the behavior of this function depends on stuff that is hard to control for reproducibility (operating system, hardware, /dev/random and other stuff).

You can have some control over this stuff with the environment variable PYTHONHASHSEED, but I think this only exists on CPython, so it also depends on the implementation :laughing:

Regarding the order in the same run, I think it depends on what you do. Running
for x in a_set: print(x) should always give you the same order as there are no new invocations of hash, but the moment you modify a_set, order can change (and most likely will).

If you need order, you’re better off using other data structures.

I just got deeper into the rabbit hole today. I learn’t that postgres does not accept ORDER BY table.* syntax, so i manually identified the set of columns for each table to prevent ties to place after ORDER BY, so i’m reasonably confident my input data is not changing in value and row order.

I am trying to fix my 3 input parameters to test_pipe(test_df_to_process, historical_data_for_feature_eng, train_dict). Jupyter magic %store and %store -r seem to work well in reproducing my 1st two parameters across kernel restarts so i’m trying to fix the train_dict now.

Now i have problems pickle.dump --> pickle.load on train_dict. It’s a dictionary of Class objects as values containing the preprocessing transformers from training data and the classifier from fitting the training data, to be applied to preprocess the test set and predict later. Inserting dump-load step is breaking the test_pipe() when it runs through one of the transformers, but works fine without dump-load.

I wish to persist the train_dict of classifier/transformers so i can restart kernel and use these without training the model again to see if model training code is the issue.

Do you have experience of limitations of pickle on dictionaries and what i can safely depend on pickle for? Trying dill gives the same error when testing, so definitely train_dict has changed after dump+load.

I did a small experiment doing pickle dump load on randomforest model alone (not the dict i have now) predicts no issue.

Nope, sorry. I’ve never used pickle much.

Thanks for the step forward!
This actually fixed the results.
Open/closed 3 anaconda prompts to test. 1st and 3rd (same result as 1st session) with PYTHONHASHSEED=0, 2nd session without setting and results changed.

Can i conclude from this that there are some set(), dict() or other hashing operations in preprocessing that should be fixed?

I suspect so, but this is comes from a point of view of someone who only has a vague understanding of what you’re doing.

Now it’s completely resolved.
It’s a chain of:

  1. No ORDER BY when reading from SQL
  2. sort_values --> drop_duplicates keeping different row due to 1.
  3. next(iter(set())) giving different value during feature engineering due to set()
  4. set deduplication of columns creating dataframe columns with different order (which sklearn is blind to) even though total number of columns is the same (so it silently failed)