For the past few months, I have been teaching Le Wagon Singapore Office’s first Data Science Bootcamp.
This inspired me to share some lessons about the difficulties beginners face, and my own learning and teaching experience. I’ll point out some common issues new programmers face at various stages, and how we can tackle them and build our learning muscle to become more independent and active learners.
A framework I would introduce here is GET, APPLY, MAINTAIN, EXPAND.
GET refers to how you get exposed to knowledge and people. Usually beginners sign up for course providers online like Dataquest, Datacamp, Coursera. That’s a great way to have a well planned path someone prepared for you, and allows you to focus on the basics and getting the concepts down, and basic tasks done. However, it does not usually show you how the same tool can be used in other contexts, because that would either be out of scope of the course, or be too big a detour to maintain the lesson flow.
In this case, subscribing to newsletters that aggregate material and push to your email folders daily/weekly is a convenient way to collect information. You then just have to scroll through the headers and see if it’s a relevant item to skim through and bookmark for later study sessions.
A possible strategy is to subscribe to many newsletters, then slowly unsubscribe from those that do not resonate with you.
Example from my mailbox:
- Deep learning
- Deep learning weekly: https://www.deeplearningweekly.com/
For people, I mean following people on Linkedin or Twitter who share valuable content. That also saves you time searching for your own resources. There are probably numerous local and definitely quite a few international slack communities for you to join to meet fellow learners. Looking out on slack is how i found my first remote job at Datacamp. Having a network of friends who can help answer your queries is really motivational to learning, and building this takes time.
When facing a new challenge, it could be efficient to start with a short youtube video (especially animated ones like https://www.youtube.com/watch?v=0Om2gYU6clE&t=55s&ab_channel=Sreekanth) to get a lay of the land. Next, go to official docs/stackoverflow to find various example uses, and finally the github issues page of that library to see a tool’s limitations and workarounds others have been using.
After an introduction to basic theory, read articles discussing pitfalls, best practices, less known but still critical facts to avoid making easy mistakes. For example, SQL has 3 Valued Logic and non-equi joins, something basic tutorials rarely cover.
This learning style is pre-emptive, as opposed to just-in-time when you need it. I believe this depends on personal preferences. Pre-emptive saves you googling time, is suited to more difficult tasks like sklearn, and prevents you from doing the wrong things, while just-in-time suits easier tasks like pandas, where copy-pasting can easily solve a problem. In retrospect, I spent way too much time exploring the corners of pandas, a skill that has little value when anyone can just replicate the answer through a simple search.
When learning how to use a library (numpy, pandas, matplotlib, seaborn, scikit-learn), beginners usually start by running examples given by the course provider. However, they soon have to deal with official documentation when thinking about what other parameters are available and their possible values and types, which are compulsary/optional, and what order should they be specified.
When I first saw the pandas.read_csv docs (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html), I was overwhelmed by how many parameters there are, thinking how am I ever going to learn all these. Then, I saw matplotlib.pyplot.plot and was thinking what’s in all those **kwargs and what’s the point of the documentation if it doesn’t tell me what can be used in **kwargs? For sklearn, there is the extra danger of the code being runnable, but the workflow not making sense practically, such as preprocessing before train-test-split causing data leakage.
A way to deal with this information overload/underload is to use Cheatsheets https://www.datacamp.com/community/data-science-cheatsheets. This is good first exposure to what is possible and commonly used, and especially effective for getting things to even run, before learning the theory to improve the accuracy, error-handling, maintainability or more abstract ideas. You can then slowly try to make use of more and more parameters as you repeatedly use a particular method. When dealing with a new data structure or cloud service, such cheatsheets also show you the syntax for CRUD actions (create, read, update, delete) to achieve the basic manipulations of any object.
After knowing how the syntax looks from the cheatsheet and getting it to run, read some theory to understand how the library is designed so you can use it better. For matplotlib (https://matplotlib.org/stable/tutorials/introductory/usage.html), knowing that there are 2 APIs (pyplot vs object-oriented) and the anatomy of a figure will set you up for custom plots. Similarly in seaborn (https://seaborn.pydata.org/tutorial/function_overview.html), knowing about figure-level and axes-level functions leaves you more empowered to control plot properties in axes individually vs as a whole figure.
For more experienced learners, official documentation may not be enough exposure, so you may want to dig into source code to see the hidden attributes (and their hidden values) and functions that can be used. By reading seaborn source code, you can learn how extra numbers used in visualizations are calculated, such as ci=95 confidence intervals in regplot. By reading pandas source code, you can observe what are the conditions that lead to a particular error message. By reading sklearn source code, you can learn API design and how they managed to fit the complex world of machine learning into fit, transform, predict (https://arxiv.org/abs/1309.0238).
APPLY refers to how you take what is presented to extract the most lessons out of it. I will explain this by breaking it up into coding and debugging.
After we have learned some concepts, it is time to code. An efficient way to start is to copy examples from pandas which saves you a lot of formatting and Dataframe set-up time. This used to be a mental burden preventing me from doing side experiments to learn more about a pandas method.
If you have a particular question, stackoverflow does have very good mini-tutorials and usually multiple ways to achieve a data manipulation task. Websites like (https://www.programcreek.com/python/, https://python.hotexamples.com/, https://www.codegrepper.com/) scrape from open source repositories, where you can find examples more difficult and often more useful than what you can find on official docs.
Common mistakes made when coding are overwriting built-in names like list, sum, causing a function to become not callable. Also, if Jupyter Notebook was used, the same variable name may be used far apart in different cells, so out of order runs may cause overwriting of variable values with undesired content.
Productivity can be improved by using nbextensions (https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html) to give you a content page so you can jump around the notebook, variable inspector so you never have to print array.shape again, execution time label below the cell so you don’t have to use %%timeit magics.
type and dir
Ever useful, and very necessary tools for learning any new library are
dir(), because we don’t know how library designers designed their classes on first exposure.
type allows you to find your way around what the class of any object is. This gives you a clue of what to even google to learn about that object, and once familiar, tells you what you can do with that object. It also lets you know whether one class is a subtype of another class, which influences whether functions will work. For example,
bool is a subclass of
int, and so inherits the mathematical operations of the integer class. That’s why sum() can be applied to a boolean list which provides a ton of conveniences during pandas manipulations, and why patterns like
[func1,func2][var>2] as control flow work.
dir gives you a list of attributes and methods on an object, which tells you what information you can get out of it, and what things you can do with it. In sklearn, after fitting a model, that are usually many new attributes assigned to the object based on the training data given that you would want to use for analysis, plotting, writing pipelines. For pandas, you will often be pleasantly surprised to find the same methods on slightly different objects (eg. DataFrameGroupby vs SeriesGroupby), so learning becomes transferrable.
dir are insufficient information, you can use jupyter magics
%pinfo2 object to bring up the source code of the class/method or
%pfile object to see the file in which the class/method appears in to disambiguate this method against other similar methods.
Breaking down copied code
When copying code from other sources instead of writing from scratch, beginners often have problems breaking down the pipeline to understand.
A useful way to understand behaviour is to start commenting out the copied code line by line and try to make it still runnable without error, then you will see which are the essential and non-essential lines. Then take that single line you don’t understand and search for online examples. Ask what does it do, what type of inputs can it take, and what outputs do I expect to get? If things are still breaking, break down the code into even smaller blocks.
Even if the code runs, we can continue tweaking it to accept more difficult inputs and try to break it. This often pushes the boundaries of knowledge and inculcates active learning habits so you learn more than just what the tutorial tells you. Who knows when’s the next time you will come across this tool again, so better learn more now while it’s still fresh in your memory and abstract out as many use cases as possible.
Often we see method chains like
df.groupby([col1,col2]).sum().reset_index().sort_values() that beginners stare at helplessly. To study how this works, just insert comment before every dot and wrap
dir on what’s left to learn what you have (
type) and what can be done (
dir) with it.
Often when beginners see a giant Traceback and just freeze in fear. Here is a why I love realpython: https://realpython.com/python-traceback/, for their clear explanations. Basically, you should read the traceback from top to bottom, because the topmost error will point to your code, while the next few layers below will be the code called by your code, stuff that is really uninterpretable unless you are used to diving into the source. However, once you see enough common errors, reading the bottommost error could also let you understand immediately what’s wrong.
Besides debugging broken code, we may want to debug working code. Using
import ipdb; ipdb.set_trace() around could be very helpful to pausing the program exactly where you want it, so you can inspect values/types/dir. A trick is to insert that line inside your custom functions during
df.apply(custom_func) to see what exactly is pandas sending to your custom function as input.
Besides debugging your own working code, you may want to debug open source code in Visual Studio Code by setting breakpoints, just remember to set
justMyCode to false or else it will not step into the open source function. I found this method useful to learn how sklearn’s algorithms are implemented, something which documentation has no room to explain.
After going through a in-depth learning session, we are bound to forget syntax, or how something works. For the former, that takes practice, while the latter can be alleviated with comments and headers in your notebook, linking to references where you copied an idea from.
Notion is a pretty and useful tool for organizing your notes and website bookmarks.
Open up new Jupyter Notebooks, one for each series of topics (eg. closures + decorators + properties, graph algorithm, dynamic programming, data structures) so you can continuously accumulate new knowledge into the same space, and revise or delete old cells once committed to memory (Leitner system for spaced repetition).
Along the way, we should definitely work through assessments to judge where we are. Some may even like going through tests without knowing anything about the subject, to begin with the end in mind if there are time constraints like I did when preparing Google Cloud certifications.
Here are some places with SQL, Python, Data Science tests.
- Linkedin Skills Assessment
- Workera: https://workera.ai/
Expand means to do something beyond your current capabilities now and then once you start getting comfortable. For example, I felt the need for this when looking at conditional probability. The theory makes perfect sense to me and I can manipulate equations and get answers, but when talking about implementation using layered boolean indexing, I felt very uncomfortable about the correct order of indexing. This made me realize how implementation is a fantastic marker to judge whether someone understands something, and I believe it to be almost as difficult as teaching.
I believe every beginner should go through the “writing from scratch” stage of learning to get a feel of using the IDE, looking at error tracebacks, using autocomplete. After sometime however when you know a tool well, it’s faster to learn by exposure, just by reading more code and see how others implement an idea. This is also why having strong python skills is a pre-requisite that opens many doors.
One way of expanding knowledge beyond beginner courses is to read books because these usually cover more in-depth ideas. Search for books with runnable ipynb tutorials attached (https://allendowney.github.io/ThinkBayes2/) which let you manipulate ideas gleaned from the book for fast feedback loops.
Besides self study, working with people could be a more efficient method. For example, joining competitions in a team means more heads to bounce ideas off and more hands to run experiments. You get to practice reading teammates’ or competitors code on the same problem which gives you multiple ideas to solve a not-so-trivial problem. Doing remote work as a big group (https://omdena.com/) can also train communication and project management.
Another avenue is employment, where you get to work on problems with a business focus, and where you see which parts of the theory you learned matter, or doesn’t, which helps focus future learning efforts.
Take notes/code snippets so they allow you to refer back to when meeting new ideas/problems, and allow you to reflect on why something you previously thought was good code is not so good anymore.
Come up with questions to cover more cases than what you’re required to handle, identify the underlying assumptions, ask whether this solution still applies in another context, or when an assumption is not met.
Tweak given code to cover more cases
- What if i give another datatype as input, will the method break or truncate its value?
- What if i set up the pandas series indexes to be non-consecutive, non-unique, how does this affect loc, iloc, and shorthand indexing?
Build own examples to test edge cases
- setting up dataframes were a huge mental burden, so copy from docs
- setting up Sql tables was painful, so copy DDL from stackoverflow and run using https://sqliteonline.com/, http://sqlfiddle.com/.
By asking how something may apply to other situations, you can learn faster, or invent creative solutions. For example, python introductions usually tell you about list comprehensions, and if you learned sets and dictionaries too, you may ask whether there are set or dict comprehensions analogously too.
Think about whether you can mix up these comprehensions, such as unpacking key-value during dict comprehension and using only one of them, or iterating through lists/sets and adding your own metadata to create a dict.
In sklearn, when you learn that a model has
predict_proba method, you may ask does every classification model have this? When you see
SelectFromModel in sklearn requiring the fitted estimator to have a
coef_ attribute, you can ask which models have which of these attributes and why?
There is no need to be perfect on the first attempt. Just getting started could seem like an insurmountable task, but as you get more used to the learning cycle, it gets easier to start. You can break down the task of learning a new API into
- Can run
- Can run with correct results (eg. fit on training set, transform on training+testing set instead of fit on training+testing)
- Can run with correct results and time/space efficient (item in set vs item in list)
- Can solve the problem in multiple ways
- Can prepare the code to be easily editable for potential requirements changes in future
- Uses Best practices (naming, typehinting, context managers) and shows good communication (code organized into meaningful function names and reads like a story)
Here are some of the epiphanies I been through in my own learning journey.
Tutorials usually begin teaching container data structures using very simple contents in the contained objects such as integers/floats/strings. Soon I discovered how those contents could become full blown classes representing something in the real world, such as stacks/queues of people, or adjacency lists. This reminds me to stay humble and to always be expecting new uses of something I thought I already knew.
I discovered for most applications we don’t need to really care whether a collection is wrapped as a list/tuple/set because all of them are considered iterables and can be fed into other functions like
map(func,iterable)and can be unpacked with for-loops. Additionally I discovered the concept that we can get more than 1 piece of information at a time moving from list/tuple/set to dict iteration, which introduces unpacking. Finally, that non pure python objects like pandas DataFrame also can be iterated through, but because it’s a much more complex object than a dictionary, we have to be aware what is the key and value in that case.
A related concept is that we can iterate through an object using the
list()/tuple()/set()/dict() constructor instead of manually doing for-loop, so
list(df)/tuple(df)/set(df) gives us the columns of a DataFrame because it does a for loop behind the scenes and extracts the keys of the df, which are the columns. Why column names are returned and not row index names is because that is how the DataFrame class is defined to respond to the iteration protocol, which for loops and
The for-looping behaviour of
list() if not made aware of, may cause people to think that
[object] are the same. The latter does not unpack but directly wrap to literally construct a list.
When first seeing key parameter of
sorted(iterable, key=func) , I was mindblown that we can calculate statistics before sorting based on those statistics, rather than just sorting the raw items in the iterable. Later I realized that
pd.Series is also a function that can be passed to df.apply() to expand a list of values in a single column per row, into a series with multiple columns per row, which isn’t obvious because we usually use
pd.Series() as a constructor rather than a function to be used by another function.
Expanding on this idea, we do not have to completely write our custom function from scratch, but make use of library functions too as part of our custom function.
The follow code uses python’s reduce framework to merge dataframes
instead of repeatedly typing
data_frames = [df1, df2, df3] df_merged = reduce(lambda left,right: pd.merge(left,right,on=['key_col'],how='outer'), data_frames) # instead of df1.merge(df2,...).merge(df3,...)
Tutorials teach you to write from scratch, but that could prove more difficult than tweaking open source code which already contain a lot of behaviour that you may want and which takes time to copy-paste. Just importing a data storage class from pandas like DataFrame and subclassing it to add my own behaviour could be very convenient. For sklearn, we can also import just the preprocessing functions for our own use, or subclass algorithms to do more than they’re intended to. For example, some people use the GridSearchCV api for non machine learning related purposes.
The subclassing point above refers to adding new methods or tweaking existing methods. This section talks about more special methods in python wrapped with double underscores. Implementing or overriding the dunder methods of a class allows the user of the class to use syntactic sugar (
__getitem__ len(), calls
__len__) and interact with it more easily.
Here’s an example of how calling code can be shortened when the called class implements dunder methods (https://www.youtube.com/watch?v=wf-BqAjZb8M&ab_channel=PyCon2015)
numpy.sum() can sum across rows, columns, or the entire matrix. This tells me to watch out when using methods because the same parameter has drastically different behaviour when the value changes. axis=0 vs 1 mistakes could be very hard to catch when both axis have the same number of elements so no error appears when you use these results downstream. Similarly, pandas has a lot of methods dependent on specifying axis.
We don’t have to limit ourselves to vectorized methods only given in numpy/pandas, but also make use of
np.frompyfunc to write our own custom python functions that make use of the broadcasting rules of numpy to avoid writing lengthy for loops to align operands.
A bootcamp is valuable if the materials are comprehensive and teachers passionate to help students beyond the standard curriculum. Resources are collected for you and key points of each topic filtered which saves a lot of own time googling and going down rabbit holes that don’t help you find a job.
Also, the alumni network is a huge plus for starting to build your network of friends with similar interests. It is not suited for everyone because the pace is extremely fast and while comprehensive, students usually do not have the time to fully understand the materials, so even completing the exercises with runnable code is an achievement.
Dataquest can be seen as a self-paced bootcamp, with less readily available help from teachers but a better question and answer forum where discussions are tracked so current learners can look at past mistakes from others, and future learners can contribute to questions from years ago.
If you have time, Dataquest is a better option to get your feet wet. For those who are fully committed to change your careers and have a time limit, bootcamps would be a better way to drive yourself, especially when studying with a group of similarly passionate people going through the same path.
I love Scott Young’s writings on learning, so ending with some pointers from him: https://www.scotthyoung.com/blog/2021/05/09/deep-learning-strategies/ , we can connect on Linkedin (https://www.linkedin.com/in/hanqi91/) if you love learning too.