Thanks for your replied
I think that the first question you need to answer, is how you’re going to store the data (that is, what kind of data warehouse best suits your needs).
Billy - Any comments for Apache Hive ? If not good, Have another suggestion ?
Some questions you need to answer are:
Specifically which analytics tasks are most important?
Billy - For initial stage, we may perform outlier detection and predictive analytics first, another analytics will be perform later.
How will analysts need to work with the data?
Billy - Sorry, Don’t understand the meaning, Can take example for me ?
Will they primarily work by writing SQL queries, or do they actually need to create advanced models?
Billy - Need to create a data model for analytics.
Do you need to be able to query the data with SQL?
Billy - I knew and able to query the data with SQL
Do you or analysts have any analytics tools you prefer to use, or are you open to any ?
Billy - Still don’t know which tools is best. You are expertise, Can you recommend to us for open source and commercial tools ?
What is the size of the data? (You said “small,” but can you quantify this in a ballpark estimate for “millions of rows”?)
Billy - Difference table have difference size, the initial stage, I want to process two tables.
Billy - One table around half of million rows and other table around millions rows
How do you want to balance cost vs. time to implement and maintain? (Often there’s a trade-off where cheaper options require more labor to implement and maintain.)
Billy - As per current economic recession situation, Sure is cheaper options even require bit more labor to implement and maintain.
How frequently will you query the data?
Billy - Query data per month now, Need to change by your suggestion according to new analytics platform .
Do you have any other constraints? (A specific cloud provider you have to use, data privacy requirements, etc.)
Billy - No specific cloud provider constraints,
Billy - No more data privacy requirements in our company,
Billy - but at least the data belong to company, cannot disclose to outside.
Depending on the answers to those questions, some options might be (this is by no means an exhaustive list, just a few options off the top of my head!):
A standard relational database (such as Postgres, MySQL, Oracle)
A data lake, implemented for example as csv / parquet / json files in AWS S3
Billy - We are using MS SQL database
If your data size is actually relatively small, and you anticipate it will remain relatively small for the near future (1-3 years), I would strongly encourage you to avoid “big data” tools like Hadoop, Spark, MapReduce, etc. These tools typically only make sense if your data is very large –
Billy - How large the data to use such tools as per your experience ?
otherwise they just add needless complexity and make every task take 4 times as long as it needs to.
You can get a long way with a simple relational database, some SQL queries, and occasionally exporting data to model using scikit-learn!
You may also explore tools like Mode Analytics (which we use at Dataquest), Sisense/Periscope, AWS Athena, etc., which may make your analytics work much easier.
Billy - All above tools running on cloud, right ?
Billy - If yes, our boss don’t like put his company data to cloud, have other tools can do that ? (Open source is better)
P.S The Epidemic are popular now, Wish all the members of Data Quest is in good health.