Going fast! #DataquestChallenge Premium Annual Offer:
500 get 50% & the next 1000 get 40% off.

Machine Learning Personal Project- How to Split Dataset Properly


I am working on a personal machine learning project and am wondering how to proceed.

My project involves scraping job listings from indeed.com and then attempting to predict the salary based on the text in the job description as well as the job location.

To summarize, I used a combination of 10 cities and 5 different search terms for job title (“data scientist”, “data engineer”, etc.) resulting in 50 different searches. I then concatenated them all into one dataframe. Among the columns in the dataframe are a column for city and one for the search term for job title.

My instinct is to perform K fold cross validation but to make sure that the each city/job title combination is proportionally represented in each fold. I have researched the StratifiedKFold class from Sk learn but it looks like that is for classification problems (it splits the dataset proportional to the target classes. My project is different. It is a regression problem (predicting salary) where each of the samples belongs to a different category (one of 10 cities and one of 5 job title search terms).

Am I correct in thinking that I need to proportionally represent each of sample categories in each fold?
What is the best sk learn tool to achieve this?

Thank you for your time!

Rather than say sklearn’s stratifications are for classification problems, we can think in terms of dependent vs independent variables.
There is not much i can find too on stratification of explanatory variables. Here’s one example Chapter 4 Stratification and summary | Stats for Data Science (ctrl+f The explanatory variables are used to define the groups to be used in the stratification)

I don’t know of any sklearn tool to do this. Maybe if you dig into the source code enough, you can find out how it stratifies and tweak it to work with multiple columns. Some lesson I went through on datacamp also tried to stratify X values with custom code, so this is indeed a thing.

Also, what-if tool (Model Understanding with the What-If Tool Dashboard  |  TensorBoard) also allows you to slice, but that is slicing after modelling for performance evaluation rather than slicing before to manage train-test-split, but i’m guessing there are similarities in them somewhere

1 Like

thank you @hanqi. I will research this.

Would it be incorrect to just leave the entire dataset “unstratified” (performing CV by randomly selecting all the records - thus some records for a specific city or job title might not be proportionally represented) ?

thanks for your time

I’m not sure about this, I assume the benefits for stratifying on y would be applicable to the X columns too, that is to make the training set look like the test set as much as possible, and hoping both of them look as much as future incoming data too.

The way I see this is if we try to stratify on too many independent variables, there must come a point where it becomes impossible to satisfy the stratification constraints on all columns, so this is something where we try our best to include as many columns as possible.

1 Like

Thanks again for the help @hanqi. I really appreciate it. I’ve got another question if you don’t mind. This is the first time I am doing a fully independent project so this is pushing the limits of my experience with SkLearn.

How can I find out what the most important (most correlated) features of my model?

I have 1190 rows total. My feature matrix X consists of three columns:

  1. location: categorical data, one of 10 cities.
  2. job title: categorical data, one of 5 job title search terms (‘data analyst’, ‘ml engineer’, etc.)
  3. job description: the entire text of the job description. I did some string cleaning before.

The response vector y is:
1). the annual salary: numeric

My workflow is to perform the following pre-processing steps on the feature matrix:

Use OneHotEncoder() on the location and job title columns
Use CountVectorizer() on the description column with the ngram_range parameter set to 2.

Since the CountVectorizer() results in a document term matrix of over 230,000 terms, I want to do some dimensionality reduction. I first explored using Principal Component Analysis (sklearn.decomposition.PCA) but then found that sklearn.decomposition.TruncatedSVD is meant to work on sparse matrices so I decided to use that instead: 2.5. Decomposing signals in components (matrix factorization problems) — scikit-learn 0.24.2 documentation

Is this a correct practice…to do dimensionality reduction after using CountVectorizer() when you have such a large document term matrix? I experimented with n_components parameter of TruncatedSVD() and found that the higher the number the lower the error that my model produced (not my much though). Is this normal? I am assuming this is a trade off between error and computational efficiency? How do you choose how many components you want to set it to? I tried it with 100 but it seems like an arbitrary choice.

The last step is to feed it to the model: I used LinearRegression(). For the sake of illustration, I also want to try it with Ridge() and Lasso().

After researching online, I’ve found a few ways to implement this with Pipeline() and ColumnTransformer(). I also want to use GridSearchCV() so that I can find the optimal hyper parameters.

I was able to get it to work but I’d like to be able to show which features were the most important or the most correlated to the salary: Which words in the job description, etc.?

I’ve found the following articles but I couldn’t get them to work. I think it has something to do with my combinations of Pipeline(), ColumnTransformer(), and GridSearchCV() and how I implemented them.

Here are couple of the ways that I have tried writing the code:

lr = LinearRegression()
ohe = OneHotEncoder()
vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
ct = make_column_transformer((ohe, ['location', 'searched_title']), (vect, 'job_description'))
svd = TruncatedSVD(n_components=100)
pipe = make_pipeline(ct, svd, lr)      
lr = LinearRegression()
ohe = OneHotEncoder()
vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
ct = make_column_transformer((ohe, ['location', 'searched_title']), (vect, 'job_description'))
svd = TruncatedSVD(n_components=100)
pipe2 = Pipeline(
        ('ohe', OneHotEncoder(), ['location', 'searched_title']),
        ('cv', CountVectorizer(stop_words='english', ngram_range=(1,2)), 'job_description'),
        ('columntransformer', ct),
        ('svd', svd),
        ('estimator', lr)
param_grid = {
    'svd__n_components': [5,10,100],
    'svd__n_iter': [5,10]

grid = GridSearchCV(pipe2, param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X, y)

I am wondering what is the best way to implement combinations of Pipeline(), ColumnTransformer(), and GridSearchCV() in this case? It seems like there are several ways you could do it.

How can I extract the most impactful features?

Thank you so much for your time. If you’d like I can upload the dataset and a notebook so you can visualize what is going on.

Yes it’s commonly done. I’m not sure of the exact math but my impression is that doing TruncatedSVD after CountVectorizer is exactly a method called Latent Semantic Analysis(LSA).

Yes if my understanding is right, truncating components off always causes worsening of metrics on training set, but may improve test set metrics due to the regularization. However an exception to previous statement of always worsening training set metrics is if decision tree is used downstream. The axis rotation may help trees split better since trees cannot draw diagonal lines across raw data like other linear classifiers can (SVM, logistic).

What does this mean?

You can dig into gridsearch.best_estimator_, which gives you the best pipeline object, then dig into each component of ColumnTransformer with final_pipe['preprocessing'].transformers_ , then you get the normal transformer and estimator objects with attributes like coefs_ or feature_importances_ you can access as normal.

2nd way looks completely wrong because you’re doing ohe and cv twice, once by themselves and once in the ColumnTransformer.

There are a lot of examples from sklearn. Open any docs page that uses a model that has coefs_ or feature_importances_ and click through the gallery of short samples at the bottom to copy the formatting and plots there.

1 Like