Analyzing Startup Fundraising Deals from Crunchbase

In this project, I mainly did a couple of things: optimized dataset memory usage, and then analyzed the dataset using sqlite3. the dataset we’ll be exploring is startup investments from Crunchbase.com, the dataset is current as of October 2013.

I would like some constructive feedback and hear your ideas of improving upon it.

link - https://github.com/Abidzar16/Analyzing-Startup-Fundraising/blob/master/Exploration.ipynb

2 Likes

hi @abidzarisprivate

Welcome to the DQ community and thank you for sharing your project with us.

Without diving deep into the technical aspects of the project, I wish to say:

  • The project has clearly defined sections and subsections. :ok_hand: Each section makes up a question. The addition would be to have a markup summarizing the results either of subsections individually or collectively.
    say,
- section 2. Question 
   - sub-section 2.1
         - sub-subsection 2.1.1
         - sub-subsection 2.1.2

To conclude …

I didn’t quite get section 2.1 but then that’s the technical aspect I guess. basically, why do we have 10 rows for chunk. Please do let me know what exactly I am missing here.
(and why are we re-reading the datafile again here)

Moving on, to SQL queries, how have you devised the Top 10% and 1% data based on the number of records. I get the sorting but what I don’t get is the threshold of the number of rows rather than conditions or aggregate calculations.

One important factor I would like to highlight is the SQL Query formatting that you have used. keywords & functions - in caps, column names, etc small :+1:

However, the non-naming of the resultant column -
SUM(raised_amount_usd) AS "Amount Raised (USD)"
and the scientific notation of the sum results does take away the presentation of query results.

This was a learning for me as well. Hope to see more projects from you. :slight_smile: