Using current data missing gross_square_feet

,

I am missing the gross_square_feet values when using current data. Either I made a mistake in data cleaning (quite possible) or somehow the current data doesn’t have these values for the condominiums

Data gathered from here and I created the csv file manually using the same steps

This occurs when I try to create scatterplots and it says removing NA values

@casey as you are the solutions author any steps that I missed?

# Filter out property exchanges between family members with a threshold of 10,000
NYC_property_sales %>% filter(sale_price > 10000) %>%
# Remove properties that have gross_square_feet of 0 
filter(gross_square_feet > 0) %>%
# Remove NA values in both sale_price and gross_square_feet
drop_na(c(gross_square_feet, sale_price))

# Arrange by borough and neighborhood
NYC_property_sales <- NYC_property_sales %>% arrange(borough, neighborhood)

It’s this filtering that causes the issue

NYC_Condos <- NYC_property_sales %>% filter(building_class_at_time_of_sale == "R4")

head(NYC_Condos$gross_square_feet)

The gross_square_feet column returns NA which means that I can’t plot anything :smiley:

As an fyi I wasn’t saving the result :smiley: but when I did all the condo information was swept away as gross_square_feet was NA in this dataset

Also my notebook is here

Hi @manandwa. I do not see any obvious issues with your code, but I have not run it from start to finish myself. How does the NYC_property_sales variable look before you filter for R4? Have you inspected the dataframe to see if there are values contained in the building_class_at_time_of_sale column?

Are you using RStudio? If so, I suggest that you restart R and clear everything in your environment and then run all of the code again. This way, if you messed anything up in a previous code run you can “start fresh” and run it from the beginning again. I recommend running code chunks one at a time, from top to bottom, and inspecting the variable along the way. See this section of our new blog post on R Markdown for guidance.

How does the dataframe look when you inspect it along the way? If you look at the dataframe after the filtering that causes the issue, what do you see in the data?

Best,
-Casey

1 Like

As a next step, I’d recommend posting the CSV file you created and I can take a look at that. Best,
-Casey

1 Like

Hi Casey,

So I’ve restarted R several times and cleared and recreated the CSV. I even started with one of the excel files and got the same result.

Here is the csv file (the dataframe only shows the issue when I remove NA values)NYC_property_sales.csv (3.3 MB)

Before I filter for R4 I have gross_square_feet column filled with actual values so I wonder did I create NA values and not realize it?

I’ve also taken your suggestion and ran each chunk one at a time. When I filter with building_class_at_time_of_sale == "R4" it returns a dataframe with nothing. This leads me to suspect that the issue is occurring when removing NA values.

I confirmed this by restarting R and clearing output in RStudio and running each chunk one at a time.
when I run this code NYC_property_sales %>% filter(building_class_at_time_of_sale == "R4") before removing NA values I get a dataframe (this is in a separate chunk). I then run the code that remove NA values along with filtering gross_square_feet and sale_price and rerun NYC_property_sales %>% filter(building_class_at_time_of_sale == "R4") in its own chunk and get nothing.

Also inspecting the values shows NA in gross_square_feet before removing NA values (Visual inspection)

Should I manually get the gross square feet for each one of these values (tedious) or should I use the tidyr fill function to fudge this? I was really trying to use the latest data and go through the process of cleaning and preparing it for analysis (as I know that is 90% of work)

This seems like a valid suspicion.

1 Like

You shouldn’t have to do it manually. What happens if you plot the data after you filter for R4 but before removing NAs? Does ggplot omit the NAs?

Maybe the NAs were removed when you filtered for gross square feet above zero with filter(gross_square_feet > 0)? Looking at your CSV, I don’t see any NAs in the gross square feet column. What happens if you skip the step of removing NAs?

ggplot does remove the NA values so much so that I get no plot :smiley:

I will try skipping the step of removing NAs and seeing what happens

So I think I’ve found the underlying issue, but I don’t have the solution yet. Thank you for providing the csv. I do not see “R4” in the data. Is it possible that NYC changed their building codes? I did not see that indicated here. So could R4 have been filtered out in an earlier step? I did not recreate all of the code in your notebook, I started by reading in the CSV you provided.

What happens if you back up in the process? Do you see R4 in the data earlier in the code? When does it dissapear?

1 Like

I see R4 in the raw data for Brooklyn. So I suggest going step-by-step to figure out where these values are removed from the data.

So here is the process I’ve done so far (and thank you for going through this with me step by step)

I removed line of code that remove NA values (I restarted R and went through chunk by chunk). My code is very similar to the solution code (I use the solutions to help me understand the instructions). When I don’t remove NA I get the data for condos in NYC but gross square feet is still NA.

I even tried filtering for R1 (CONDO; RESIDENTIAL UNIT IN 2-10 UNIT BLDG.) and still got the same results of gross square feet being NA

However, I did notice this:
Gross Square Feet: The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property. (the definitions are from here

I will go through step by step with just the raw data for brooklyn and see when they are removed but why they would be removed as I’m not touching that column

I think the issue is with the raw data itself

So I ran this as a test script for just the brooklyn data (no processing at all just filtering for R4)

This is the code I ran

library(readxl)
library(readr)
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
library(ggplot2)

# Load data
brooklyn <- read_excel("rollingsales_brooklyn.xls", skip = 4)

test <- brooklyn %>% filter(`BUILDING CLASS AT TIME OF SALE` == "R4") %>% head()

This created a 6 row by 21 column dataframe called and looking at GROSS SQUARE FEET gives NA.

So I am guessing that either somebody forgot to enter this in or this is not the raw data I am looking for

head() will only show you the first six rows, so it’s possible that the first six just happen to have NA for GROSS SQUARE FEET

What do you see if you run:

test <- brooklyn %>% filter(`BUILDING CLASS AT TIME OF SALE` == "R4")

unique(test$`BUILDING CLASS AT TIME OF SALE`)

Does R4 show up in the list of unique values?

And what if you view this dataframe in the RStudio viewer where you can see all of the values. If you scroll up and down, do you see values in the cells?

So I recreated test to show all the variables using test_overall


test_overall <- brooklyn %>% filter(`BUILDING CLASS AT TIME OF SALE` == "R4")

This is a 2335 by 21 dataframe and looking at the values of this in the RStudio viewer shows NA values for GROSS SQUARE FEET

When I run

unique(test_overall$`BUILDING CLASS AT TIME OF SALE`)

I get R4 so that checks out

When I run

gsf <- unique(test_overall$`GROSS SQUARE FEET`)

gsf returns NA_real

So I’m at a loss here

And just to verify what building codes don’t show NA I ran the following

gsf_no_na <- brooklyn %>% filter(`GROSS SQUARE FEET` != "NA")
unique(gsf_no_na$`BUILDING CLASS AT TIME OF SALE`)

The output of that unique command lead to this

[1] "S1" "A5" "A9" "A2" "A1" "B1" "B2" "B3" "S2" "B9" "C0" "V0" "C2" "C3" "C1" "S3" "K4" "K1" "K2" "K9" "E1" "I5" "A4" "A3" "C7" "D1" "S5" "S4" "S9" "O8"
 [31] "K5" "G2" "G7" "G4" "E9" "V1" "M1" "Z9" "V2" "C4" "C5" "D3" "D7" "D6" "O2" "H3" "F4" "F5" "F9" "G1" "G9" "E2" "W9" "M9" "O7" "A0" "K6" "S0" "V3" "O1"
 [61] "O5" "W2" "W6" "O9" "GU" "W8" "P3" "Z0" "RR" "E7" "P9" "N9" "F2" "G6" "N2" "G0" "I6" "K8" "F1" "P5" "P6" "A7" "Q1" "D8" "J2" "W1" "Z1" "GW" "K3" "G5"
 [91] "G8" "Z3" "I1" "Q8" "V9" "Q9" "J9" "D9" "K7" "P2" "H8" "H9" "D5" "D2" "T2" "O4"
1 Like

I’ll investigate this next week. Best,
-Casey

1 Like

Great, should I redownload the datasets if they change?

So I redownloaded the data and ran the same script and still go the same results which could mean that they changed definitions on what gross square feet means (and I even tried this on R5 building code and still got NA values for gross square feet)

1 Like