How does Python know that this is ascii instead of just strings?

Screen Link:
Encodings And Representing Text In A Computer — ASCII Limitations | Dataquest

Replace this line with your code

text = "The Swedish word for quest is sökande"
encoded= text.encode(encoding='ascii', errors='replace')
print(encoded)
print(type(encoded))

ascii and replace seems to be just strings so I don’t understand how python is able to just tell that we are using the ascii table and to replace something. How is it different than if I replaced ascii with ‘flowers’ or replace with ‘candy’?

How is your question different than. . .

In open("my_file.txt"), how does Python know that my_file.txt is the file I have on my computer and not just some string?

. . .?

I ask because the answer can be informative, it can help give a better answer to your question.

Are you saying that Python already has “ascii” and “replace” built in to its system to pull it from? Using your open file example, my understanding is that when you call that function, python will look through the directory to find where that file may be and the string in there is just the quote on quote, format, that python uses to extract that file.

I’m not sure what you mean by this.

Something like that. More specifically, the str.encode method’s encoding and errors are ready to accept those strings as parameters with the meaning that one would expect.

More generally, in Python everything is a Python object. Files in your harddrive do not exist in Python in the same way that they exist in your operating system.

Whatever way your operating system interprets those zeroes and ones, Python has its own way too. In the case of file’s names, they decided to use strings as the proper way to refer to them. For ascii and replace above, the same thing happens.

1 Like

Hello @hliu19922019
As you will be learning more, you’ll learn how to create functions, classes and methods.
encoding and errors are parameters for .encode method in str and they expect known arguments (Check attached method information).

This will generate an error:

encoded = text.encode(encoding="flower", errors="candy")
---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
Input In [5], in <module>
----> 1 encoded = text.encode(encoding="flower", errors="candy")

LookupError: unknown encoding: flower

As expected, "flower" is not part of the encodings that encode expected. The same will be for errors "candy" is not expected.

`str.encode` full info
str.encode?
Signature: str.encode(self, /, encoding='utf-8', errors='strict')
Docstring:
Encode the string using the codec registered for encoding.

encoding
  The encoding in which to encode the string.
errors
  The error handling scheme to use for encoding errors.
  The default is 'strict' meaning that encoding errors raise a
  UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
  'xmlcharrefreplace' as well as any other name registered with
  codecs.register_error that can handle UnicodeEncodeErrors.
Type:      method_descriptor

Learn more:

@hliu19922019: referring to the article by @info.victoromondi:

How Python does to conversion is an under-the-hood implementation (probably involving some complex implementation which you need not know about, since many abstractions are created, to save us programmers from the trouble of re-implementing functionality).

I’m writing my answer for future students (including myself) at all levels so this may look overly technical to you.

What does each string mean?

Here’s the docs with a table of what string you type gets converted to what codec: codecs — Codec registry and base classes — Python 3.10.2 documentation

Similar patterns in other libraries

Under Scoring column, those are strings (hardcoded by sklearn designer) that provide convenience for the user when specifying what metric they want to use when calling a method.
The middle column Function is not a string but a python class that can also substitute the string as a method input. This class can be initialized with more parameters to give more flexibility compared to the string. User defined custom classes are also accepted as method inputs as long as they follow the framework’s requirements.
You can also find such patterns in deep learning libraries (Keras/Tensorflow/Pytorch).

How is the conversion implemented?

Here’s the C-API (python calls compiled C functions for faster calculations) source code that searches for the encoding based on the string you give: https://elixir.ortiz.sh/python/v3.8.6/source/Objects/unicodeobject.c#L3586

You can see how it parses your input string to control which relevant codecs to use to encode.

More general info

From Unicode Objects and Codecs — Python 3.10.2 documentation
This instance of PyTypeObject represents the Python Unicode type. It is exposed to Python code as str.

You can get an idea of how C interfaces with Python, helps find your way around C source code (most of the python built-ins’ source code are not written in python).

From Unicode Objects and Codecs — Python 3.10.2 documentation
Encode a Unicode object and return the result as Python bytes object. encoding and errors have the same meaning as the parameters of the same name in the Unicode encode() method

This tells you that PyUnicode_AsEncodedString is called when you do any 'hello'.encode in python.

String (python 2) vs Unicode (python 3)

Unicode (data representation concept) is the same as the colloquial string (general datatype of most programming languages, including python).
Strings were called string in python 2, then became unicode in python 3.

Here’s the python 2 api (https://svn.python.org/projects/python/trunk/Objects/stringobject.c).

You can use this to get a general idea of what calls what, then replace “string” with “unicode” (eg. PyString_EncodePyUnicode_Encode) to search what you want from the python 3 docs (Assuming you got lost digging around the python 3 api like me) at https://elixir.ortiz.sh/python/v3.8.6/source/Objects/unicodeobject.c.

Tracing the flow partially

  1. PyUnicode_Type is str in python (Unicode Objects and Codecs — Python 3.10.2 documentation)
  2. That type has unicode_methods (https://github.com/python/cpython/blob/8a84aef0123bd8c13cf81fbc3b5f6d45f96c2656/Objects/unicodeobject.c#L15196), one of which is UNICODE_ENCODE_METHODDEF (https://github.com/python/cpython/blob/8a84aef0123bd8c13cf81fbc3b5f6d45f96c2656/Objects/unicodeobject.c#L13907)
  3. Jump to header file, UNICODE_ENCODE_METHODDEF refers to unicode_encode
    (https://github.com/python/cpython/blob/8a84aef0123bd8c13cf81fbc3b5f6d45f96c2656/Objects/clinic/unicodeobject.c.h#L135) which calls unicode_encode_impl (https://github.com/python/cpython/blob/8a84aef0123bd8c13cf81fbc3b5f6d45f96c2656/Objects/clinic/unicodeobject.c.h#L190)
  4. Jump back to .c file, unicode_encode_impl (https://github.com/python/cpython/blob/8a84aef0123bd8c13cf81fbc3b5f6d45f96c2656/Objects/unicodeobject.c#L11519) calls PyUnicode_AsEncodedString which is the screenshot above.

Elixir and Github

Make sure you use the right python version from left panel of Elixir. Click any symbol to search it. Defined in means the definition (usually what we want), Referenced in means where else is the symbol used.

If you want to investigate from github, the main branch which shows up as default is the future python 3.11. It has different C-API from 3.10 and down (some stuff added/deprecated) so you need to select the correct branch from github if want to compare against that elixir link showing 3.8 (anyway elixir is much better since github doesn’t let you click around variable names).

If you want to practice
My whole answer was for encoding, parameter. You can repeat the whole process for errors too.