In this project, I mainly did a couple of things: optimized dataset memory usage, and then analyzed the dataset using sqlite3. the dataset we’ll be exploring is startup investments from Crunchbase.com, the dataset is current as of October 2013.
I would like some constructive feedback and hear your ideas of improving upon it.
link - https://github.com/Abidzar16/Analyzing-Startup-Fundraising/blob/master/Exploration.ipynb
Welcome to the DQ community and thank you for sharing your project with us.
Without diving deep into the technical aspects of the project, I wish to say:
- The project has clearly defined sections and subsections. Each section makes up a question. The addition would be to have a markup summarizing the results either of subsections individually or collectively.
- section 2. Question
- sub-section 2.1
- sub-subsection 2.1.1
- sub-subsection 2.1.2
To conclude …
I didn’t quite get section 2.1 but then that’s the technical aspect I guess. basically, why do we have 10 rows for
chunk. Please do let me know what exactly I am missing here.
(and why are we re-reading the datafile again here)
Moving on, to SQL queries, how have you devised the Top 10% and 1% data based on the number of records. I get the sorting but what I don’t get is the threshold of the number of rows rather than conditions or aggregate calculations.
One important factor I would like to highlight is the SQL Query formatting that you have used. keywords & functions - in caps, column names, etc small
However, the non-naming of the resultant column -
SUM(raised_amount_usd) AS "Amount Raised (USD)"
and the scientific notation of the sum results does take away the presentation of query results.
This was a learning for me as well. Hope to see more projects from you.