Compare and validate a large model results


I’ve rained a neural network which take a long time to train so it’s not possible for me to run it multiple times. BTW it’s a MULTI-CLASS CLASSIFICATION problem which class labels are CATEGORICAL and I use F1 SCORE to evaluate the model.

I want to compare my results with previous methods so I need to know how my improvement is confident. I need to validate my results.

By far I know these:

1- I can use a TEST OF SIGNIFICANCE to assess if my improvement over previous methods is significant. (PROBLEM: many of these tests require to train model multiple times which is not possible in my case.)

2- I could calculate the CONFIDENCE INTERVAL of my measure of interest (PROBLEM: as before, from what I understood I need to average sth here too which requires training model multiple times. Am I wrong?)

MAIN QUESTION: Is it right to use this formula to calculate CI of my model’s F1? is there a test of significance that is appliable in my case? or any other approach?
do you have an implementation of it in R/Python?

Probably I’m getting sth wrong so ANY THOUGHTS, COMMENTS OR GUIDANCE are welcomed.

P. S: My test set is small and I also have to use all of my training set at once (meaning that I can’t use sth like k-fold cross-validation)