Thanks again for the help @hanqi. I really appreciate it. I’ve got another question if you don’t mind. This is the first time I am doing a fully independent project so this is pushing the limits of my experience with SkLearn.

How can I find out what the most important (most correlated) features of my model?

I have 1190 rows total. My feature matrix `X`

consists of three columns:

- location: categorical data, one of 10 cities.
- job title: categorical data, one of 5 job title search terms (‘data analyst’, ‘ml engineer’, etc.)
- job description: the entire text of the job description. I did some string cleaning before.

The response vector `y`

is:

1). the annual salary: numeric

My workflow is to perform the following pre-processing steps on the feature matrix:

Use `OneHotEncoder()`

on the location and job title columns

Use `CountVectorizer()`

on the description column with the `ngram_range`

parameter set to 2.

Since the `CountVectorizer()`

results in a document term matrix of over 230,000 terms, I want to do some dimensionality reduction. I first explored using Principal Component Analysis (`sklearn.decomposition.PCA`

) but then found that `sklearn.decomposition.TruncatedSVD`

is meant to work on sparse matrices so I decided to use that instead: 2.5. Decomposing signals in components (matrix factorization problems) — scikit-learn 0.24.2 documentation

Is this a correct practice…to do dimensionality reduction after using `CountVectorizer()`

when you have such a large document term matrix? I experimented with `n_components`

parameter of `TruncatedSVD()`

and found that the higher the number the lower the error that my model produced (not my much though). Is this normal? I am assuming this is a trade off between error and computational efficiency? How do you choose how many components you want to set it to? I tried it with 100 but it seems like an arbitrary choice.

The last step is to feed it to the model: I used `LinearRegression()`

. For the sake of illustration, I also want to try it with `Ridge()`

and `Lasso()`

.

After researching online, I’ve found a few ways to implement this with `Pipeline()`

and `ColumnTransformer()`

. I also want to use `GridSearchCV()`

so that I can find the optimal hyper parameters.

I was able to get it to work but I’d like to be able to show which features were the most important or the most correlated to the salary: Which words in the job description, etc.?

I’ve found the following articles but I couldn’t get them to work. I think it has something to do with my combinations of `Pipeline()`

, `ColumnTransformer()`

, and `GridSearchCV()`

and how I implemented them.

Here are couple of the ways that I have tried writing the code:

```
lr = LinearRegression()
ohe = OneHotEncoder()
vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
ct = make_column_transformer((ohe, ['location', 'searched_title']), (vect, 'job_description'))
svd = TruncatedSVD(n_components=100)
pipe = make_pipeline(ct, svd, lr)
```

```
lr = LinearRegression()
ohe = OneHotEncoder()
vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
ct = make_column_transformer((ohe, ['location', 'searched_title']), (vect, 'job_description'))
svd = TruncatedSVD(n_components=100)
pipe2 = Pipeline(
[
('ohe', OneHotEncoder(), ['location', 'searched_title']),
('cv', CountVectorizer(stop_words='english', ngram_range=(1,2)), 'job_description'),
('columntransformer', ct),
('svd', svd),
('estimator', lr)
]
)
```

```
param_grid = {
'svd__n_components': [5,10,100],
'svd__n_iter': [5,10]
}
grid = GridSearchCV(pipe2, param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X, y)
```

I am wondering what is the best way to implement combinations of `Pipeline()`

, `ColumnTransformer()`

, and `GridSearchCV()`

in this case? It seems like there are several ways you could do it.

How can I extract the most impactful features?

Thank you so much for your time. If you’d like I can upload the dataset and a notebook so you can visualize what is going on.