Text classification using spacy

Hi, I am Antonio and I’m currently trying to start a new career path in data science.
I discovered Dataquest and I’ve been studying with Dataquest’s tutorials. I was getting acquainted with some important python packages and I decided to follow Dataquest text classification tutorial, this one to be more specific.
The goal in the tutorial is to produce an accurate model that we could then use to process new user reviews and quickly determine whether they were positive or negative.
Everything was going according to plan and towards the end I’ve found a bug when I tried to use the logistic regression.

Screen Link:

My Code:


What I expected to happen:
The model for the logistic regression was supposed to be generated but instead I’ve got the error below:

What actually happened:

ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5548/2396744894.py in <module>
      1 # model generation
----> 2 pipe.fit(x_train, y_train)

~\anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
    339         """
    340         fit_params_steps = self._check_fit_params(**fit_params)
--> 341         Xt = self._fit(X, y, **fit_params_steps)
    342         with _print_elapsed_time('Pipeline',
    343                                  self._log_message(len(self.steps) - 1)):

~\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
    301                 cloned_transformer = clone(transformer)
    302             # Fit or load from cache the current transformer
--> 303             X, fitted_transformer = fit_transform_one_cached(
    304                 cloned_transformer, X, y, None,
    305                 message_clsname='Pipeline',

~\anaconda3\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    351     def call_and_shelve(self, *args, **kwargs):

~\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    752     with _print_elapsed_time(message_clsname, message):
    753         if hasattr(transformer, 'fit_transform'):
--> 754             res = transformer.fit_transform(X, y, **fit_params)
    755         else:
    756             res = transformer.fit(X, y, **fit_params).transform(X)

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1200         max_features = self.max_features
-> 1202         vocabulary, X = self._count_vocab(raw_documents,
   1203                                           self.fixed_vocabulary_)

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1131             vocabulary = dict(vocabulary)
   1132             if not vocabulary:
-> 1133                 raise ValueError("empty vocabulary; perhaps the documents only"
   1134                                  " contain stop words")

ValueError: empty vocabulary; perhaps the documents only contain stop words

One thing I’ve noticed is that checking x_train and x_test is that the stopwords were not removed even after creating an specific function for that.
It was my first time using spacy and I’ve tried looking for an answer at stack overflow but I couldn’t find one.
It is also my first time creating a topic here in the Dataquest community. Sorry if the formatting is not appropriate.

Thanks in advance.

Hi @antonio.rocha.andrad:

I have some experience with NLP so I’ll attempt to help if I can. Do you mind attaching your notebook (or code) here if possible for more context and so that I can take a look at your preceding code?

Also just to confirm, you are using the same dataset or another?

You may find this article useful when attempting to share your code.