Your advise on building a data warehouse / pipelines

Hi all,

I have been working as a data analyst for about 1,5 year and during that time I learned:

  • SQL (Basics, joins, subqueries, views, stored procedures) - more than enough to answer all business questions so far
  • Python (Data cleaning, data transformation) - Pandas and numpy, enough to do work more efficiently than in Microsoft Excel.
  • DAX (Power BI) - advanced. Scenario’s that I’m not able to solve in SQL and Python I can fix with DAX measures.

My manager asked me to delve into building a datawarehouse.
So far I have build a large amount of views to feed Power BI reports. But we feel like this isn’t a scalable approach. We have around 200 report users and our application landscape is growing rapidly.

What skills and approach would you advise me in order to integrate data in a data warehouse?
How would you build your first pipelines?
Where should I start? What kind of data warehouse? Shall I use Python or SQL to build it?

Any advise and experiences are more than welcome.

Is there anyone who wants to share his experiences and help me out on this topic ? Really looking forward to your replies

Hi @LuukvanVliet! I’m an infrastructure engineer at Dataquest, but I’ve worked as a data engineer, and built out data warehouses and ETL pipelines in a couple of past roles. :slightly_smiling_face:

I think that the first question you need to answer is, which data warehouse best suits your needs? Some options to consider might be (this is not an exhaustive list):

  • Azure SQL Data Warehouse
  • AWS Redshift
  • Google BigQuery
  • Snowflake
  • A standard relational database (such as Postgres, MySQL, Oracle)

Some questions you need to answer in order to make a good choice here are:

  • Which analytics tool(s) will you be using (are you sticking with Power BI, or potentially evaluating other solutions?), and which of these data warehouses does that tool easily integrate with?
  • What is the size of your data?
  • How do you want to balance cost vs. time to implement / ongoing maintenance labor? (Often there’s a trade-off here – cheaper options are usually more labor intensive to implement and maintain.)
  • How quickly do you need to implement this?

A process I’ve used for selecting tools like this in the past has been something like this:

  • Make a list of requirements, and a list of “nice to have” features. A requirement/nice to have might be “Must integrate easily with Power BI,” “Must allow analysts to use standard SQL to query,” or “must be available from our cloud provider.”
  • Make a list of candidate tools (you can probably start with my list above, but do some research and see if you can find other candidates!)
  • Create a document which lists each candidate tool, and for each tool, list each requirement. Research each tool in detail, and mark which requirements are met. Also add any notes for pros / cons / risks of that tool, and try to estimate cost.

Hopefully there will be a clear winner based on this research. Bring this document to your manager to justify your decision. (Your manager will for sure appreciate that you documented your research!)

After you select a data warehouse, the next steps will hopefully become more clear. Ultimately I think you’ll likely end up using a combination of Python and SQL in your work.

Good luck!

3 Likes

That was a very clear answer @darla.
Will certainly take this into account :slight_smile:

Are there any courses on dataquest you would recommend me for building and designing a data warehouse? So far I haven’t found any.

@LuukvanVliet At Dataquest we intentionally avoid focus on teaching very specific tools, but instead focus on teaching broader concepts and techniques which can be applied with a wide variety of tools. (Note that as a member of the Engineering team, I’m obviously not involved in setting content strategy! This is what I’ve understood from the content team, though. :slightly_smiling_face:)

Some Dataquest content relevant for this task, is the Building a Data Pipeline section of the Data Engineer path. That can help with general concepts related to ETL pipelines which will get data into the data warehouse.

I do think it would be good to add some general content about data modelling in the Data Engineer path, which would by very useful for building a data warehouse – hopefully we’ll be able to do that eventually!

1 Like