Skewed data question

Hi guys, i’m currently working on a project and i found that my target column is right skewed wich is no good
imagen
So i used np.log() to try to transform it into a normal/gaussian distribution, but it actually ended in a left skewed distribution

And now i don’t know what to do. I’ve been reading about it but i haven’t found another way to transform the data, and i don’t think i should work with the last distribution because, to me, it’s like working with the original right skewed data. So what should i do here?

Thanks for any help

3 Likes

hey @alegiraldo666

This might be a lame question (but I will still ask !). The above histogram is after removing the major outliers, or does it includes them?

We can’t expect perfect normality all the time, and since the sample size is this big, log-transformed target column should help you.

Apologies if you have already gone through these posts:

I am not much of a help though, but thanks for posting this question :slight_smile:

3 Likes

Hi @Rucha thanks for your answer, the posts were helpful

No i did not remove any outlier because i thought that as i have datetime data i shouldn’t remove them

3 Likes

Hi @alegiraldo666

I guess I am yet to reach this project or similar to this one. So I just scanned these posts on spot.

What have you decided as to how to proceed for your project? And have you looked out for more helpful posts like this? Have you found similar but more helpful posts like them?

And when are you uploading your project so that I can copy and cheat I mean learn :stuck_out_tongue_winking_eye: Ignore this part - take your time and effort on your project. I am stuck with a logistic regression one right now :stuck_out_tongue:

2 Likes

It’s the bike rental project, i just realized that i was using a similar dataset

In the end i decided to go with the result i got, and part of the research i did said that that was the best thing to do i actually found a post somewhere were they said wich transformation do depending wich distribution you have (i didn’t saved the link but i’ll look at my browser history for it), and after runing the model i got interesting results.

:rofl: I need to organize things and write a lot of explanations and analysis before publishing

I’m a huge fan of it, send me a dm if you need help!. I just “finished” a machine learning bootcamp, just need to do the final project with my study group (although i’m the only one who finished the syllabus before the begining of the project phase). We choose an “easy” project so i think it’s not going to take a lot of time

2 Likes

Hi @alegiraldo666

I do have a question.

The accuracy for train dataset came out to be 77% and for test dataset it got reduced to 68%.
My conclusion is this model is not great at predicting.

I have also identified the features that add no value to the model, based on their coefficients and odds ratio.

Will it make any sense to remove these features and then try again - My thought is not really because as it their odds ratio is close to 1 and coefficients close to 0.

In case you want to check out the entire project, let me know I will create personal topic for it.

Edit: One more question. Have you come across anything related to modelling with duplicate records. And not a subset wise duplicity, the entire row is repeated multiple times.

Thanks

1 Like

Hi @Rucha

YES! Not all features are information, maybe those are noise or simply they don’t have any pattern. Check the type of the data too, sometimes we make the mistake of leaving categorical data and then the models gives more values to a category just because it’s has a bigger number than other, but in reality there isn’t any difference

Try to run it again and if your accuracy is still low then upload it to check it out

Not yet, but i would remove the repeated lines and leave just one

1 Like

Hi @alegiraldo666

Not sure to well understand your answer, but it looks like you worry to deal with imbalanced classes (Pareto distribution, few classes have a lot of occurrences).

There is a scikit-learn method dealing with it: class_weight=‘balanced’

Example from Drivendata.co:

 # instantiate our RF Classifier
rf = RandomForestClassifier(
    n_jobs=4, 
    n_estimators=150,
    class_weight='balanced', # balance classes
    max_depth=3, # shallow tree depth to prevent overfitting
    random_state=0 # set a seed for reproducibility
)

PS: well, re-reading you first post, you said 'target column", so sorry if I am off topic!

1 Like