Small Big Data Platform

Hi Everybody

I am Billy as a member of Data Quest and still learning Python course now. I am working on one SME company now.

So Sad, as per my boss advised that I want to create a small BIG data platform to analysis company data (Internal system (SQL Server) and Email system (Linux-Postfix)) on early of next year.

This platform need to use some popular technique such as Hadoop, MapReduce etc… and perform some common analytics such as predictive analysis, outlier detection etc…

So I want to ask every experts or classmates have experience for this or any internet sources, books can introduce to me for create above platform.

I have tried to search on internet but the information is piles, have not completed information to build the small platform.

In addition, which tools of data warehouse and data virtualization etc… are popular to use ?

Wish everybody can give some direction/advice/suggestion to me. tks a lot

Best Regards

Billy

Hi @billymklee. I’m an infrastructure engineer at Dataquest, but I’ve worked as a data engineer in the past. :slightly_smiling_face:

I think that the first question you need to answer, is how you’re going to store the data (that is, what kind of data warehouse best suits your needs).

Some questions you need to answer are:

  • Specifically which analytics tasks are most important? How will analysts need to work with the data? Will they primarily work by writing SQL queries, or do they actually need to create advanced models? Do you need to be able to query the data with SQL?
  • Do you or analysts have any analytics tools you prefer to use, or are you open to any?
  • What is the size of the data? (You said “small,” but can you quantify this in a ballpark estimate for “millions of rows”?)
  • How do you want to balance cost vs. time to implement and maintain? (Often there’s a trade-off where cheaper options require more labor to implement and maintain.)
  • How frequently will you query the data?
  • Do you have any other constraints? (A specific cloud provider you have to use, data privacy requirements, etc.)

Depending on the answers to those questions, some options might be (this is by no means an exhaustive list, just a few options off the top of my head!):

  • AWS Redshift
  • Google BigQuery
  • Snowflake
  • A standard relational database (such as Postgres, MySQL, Oracle)
  • A data lake, implemented for example as csv / parquet / json files in AWS S3

If your data size is actually relatively small, and you anticipate it will remain relatively small for the near future (1-3 years), I would strongly encourage you to avoid “big data” tools like Hadoop, Spark, MapReduce, etc. These tools typically only make sense if your data is very large – otherwise they just add needless complexity and make every task take 4 times as long as it needs to.

You can get a long way with a simple relational database, some SQL queries, and occasionally exporting data to model using scikit-learn! You may also explore tools like Mode Analytics (which we use at Dataquest), Sisense/Periscope, AWS Athena, etc., which may make your analytics work much easier.

4 Likes

Hi Darla,

Thanks for your replied

I think that the first question you need to answer, is how you’re going to store the data (that is, what kind of data warehouse best suits your needs).

Billy - Any comments for Apache Hive ? If not good, Have another suggestion ?

Some questions you need to answer are:

Specifically which analytics tasks are most important?

Billy - For initial stage, we may perform outlier detection and predictive analytics first, another analytics will be perform later.

How will analysts need to work with the data?

Billy - Sorry, Don’t understand the meaning, Can take example for me ?

Will they primarily work by writing SQL queries, or do they actually need to create advanced models?

Billy - Need to create a data model for analytics.

Do you need to be able to query the data with SQL?

Billy - I knew and able to query the data with SQL

Do you or analysts have any analytics tools you prefer to use, or are you open to any ?

Billy - Still don’t know which tools is best. You are expertise, Can you recommend to us for open source and commercial tools ?

What is the size of the data? (You said “small,” but can you quantify this in a ballpark estimate for “millions of rows”?)

Billy - Difference table have difference size, the initial stage, I want to process two tables.

Billy - One table around half of million rows and other table around millions rows

How do you want to balance cost vs. time to implement and maintain? (Often there’s a trade-off where cheaper options require more labor to implement and maintain.)

Billy - As per current economic recession situation, Sure is cheaper options even require bit more labor to implement and maintain.

How frequently will you query the data?

Billy - Query data per month now, Need to change by your suggestion according to new analytics platform .

Do you have any other constraints? (A specific cloud provider you have to use, data privacy requirements, etc.)

Billy - No specific cloud provider constraints,

Billy - No more data privacy requirements in our company,

Billy - but at least the data belong to company, cannot disclose to outside.

Depending on the answers to those questions, some options might be (this is by no means an exhaustive list, just a few options off the top of my head!):

AWS Redshift

Google BigQuery

Snowflake

A standard relational database (such as Postgres, MySQL, Oracle)

A data lake, implemented for example as csv / parquet / json files in AWS S3

Billy - We are using MS SQL database

If your data size is actually relatively small, and you anticipate it will remain relatively small for the near future (1-3 years), I would strongly encourage you to avoid “big data” tools like Hadoop, Spark, MapReduce, etc. These tools typically only make sense if your data is very large –

Billy - How large the data to use such tools as per your experience ?

otherwise they just add needless complexity and make every task take 4 times as long as it needs to.

You can get a long way with a simple relational database, some SQL queries, and occasionally exporting data to model using scikit-learn!

You may also explore tools like Mode Analytics (which we use at Dataquest), Sisense/Periscope, AWS Athena, etc., which may make your analytics work much easier.

Billy - All above tools running on cloud, right ?

Billy - If yes, our boss don’t like put his company data to cloud, have other tools can do that ? (Open source is better)

P.S The Epidemic are popular now, Wish all the members of Data Quest is in good health.

Hi Darla,

Any news ?

Best Regards
Billy Lee

Hi Darla,

Any news ?

Best Regards
Billy Lee

Hi Billy,
I understand to a good extent of what you are looking for in a pursuit to get a small data platform and i would say you are lucky to have this kind of project in hand. Sorry i just happen to see your post and thought would pen down of few things i can suggest which are open source but require lot of effort and patience to implement.

You can start with Hive and then use spark SQL or spark Dataframes (I prefer dataframes due it’s performance or debugging advantages) for distributed processing of your datasets.

Phase1:
Use HDFS to store your data and Hive to build analytical SQL queries on top of it. Maybe build jupiter notebooks to have specific use cases for frequent queries.

Phase2:
Replace Hive QL queries with spark SQL queries or spark dataframes, once you polish (data cleaning/ cleansing - can have jupiter notebooks linked to spark cluster) your datasets. This will lead to fast processing of your data, depending on size of your cluster / private cloud.

Phase3:
Build data pipelines either using Jenkins or python programming to implement above scenarios with visualization using python libraries like seaborn or similar of your choice.

However, I may not be able to reply to your further queries if you still have not started yet on your project. I will try to reply with some delay.

Regards
Harsha

2 Likes

Hi Harsha,

Tks for your reply my post first.

Sorry for late reply due to busy recently.

I have added some comments as below. Pls check. tks

I understand to a good extent of what you are looking for in a pursuit to get a small data platform and i would say you are lucky to have this kind of project in hand. Sorry i just happen to see your post and thought would pen down of few things i can suggest which are open source but require lot of effort and patience to implement.
Billy – No problem to use open source software for me bcs we have been using it such as Posfix, Apache web server for Linux etc…

You can start with Hive and then use spark SQL or spark Dataframes (I prefer dataframes due it’s performance or debugging advantages) for distributed processing of your datasets.

Phase1:
Use HDFS to store your data and Hive to build analytical SQL queries on top of it. Maybe build jupiter notebooks to have specific use cases for frequent queries.
Billy –

  1. Ok. I will install Hadoop first.
  2. Due to us using SQL server for transactional database, We need to export the entire data to HDFS daily ? or use Sqoop connect between SQL server and Hive for transfer to Hive ?
  3. Is it use Jupiter notebook to build frequent queries or data model then save it to .py can execute by cron job etc…?

Phase2:
Replace Hive QL queries with spark SQL queries or spark dataframes, once you polish (data cleaning/ cleansing - can have jupiter notebooks linked to spark cluster) your datasets. This will lead to fast processing of your data, depending on size of your cluster / private cloud.
Billy –

  1. Of course follow your suggestion to use spark dataframes.
  2. How to processing the data for my desire analystics such as predictive or outier detection etc… ? Can provide some links for me ?

Phase3:
Build data pipelines either using Jenkins or python programming to implement above scenarios with visualization using python libraries like seaborn or similar of your choice.
Billy –

  1. Build data pipelines using python programming to implement above scenarios with visualization, Can I follow below link, right ?
    https://www.dataquest.io/blog/data-pipelines-tutorial/
  2. Can use another visualization tools such as Power BI or QlikView etc… ?

However, I may not be able to reply to your further queries if you still have not started yet on your project. I will try to reply with some delay.
Billy - I will started my project soon but I need to know the structure first. That means I want to install and use below open source system at least at below, right ?

Must use

Apache Hadoop
Apache Hive
Apache Spark

May use

Sqoop
Power BI
QlikView

Best Regards
Billy

I can recommend InetSoft. They also have a free online version for personal use called Visualize Free