Selecting multiple proportional samples iteratively from dataframe

Hey all,

I have spent 8.5 hours trying to figure out my problem, and I come just short!! :smiley:

I am cleaning the nyc data for a project.

What I am trying to do is fill in missing zip code values based on a string match.

  • I have about 12 different string matches, therefore 12 different groups all with NaN zip codes.
  • I am filling in the zip code for these 12 different groups based on the proportion of the string matches in the main dataframe that have existing zip codes filled in. For example:
    • I have a sub data set that contains all string matches for “bronx” with NAN zip codes. It is 1000 records.
    • In my main dataset, I have the following distribution for the string matches bronx:
      - 11111 - 80%
      - 22222 - 12%
      - 33333 - 8%
  • Therefore, I want 80% of my sub dataset of 1000 zip codes to be 111111, 12% to be 22222, and 8% to be 33333.

Rather than doing this manually, I wanted to create a series of functions to do this. Here is the example of the issue:

distribution of string "triborough bridge"matches in main dataset that have completed zip codes. This is called df_pop_fill

index zip_code
0 10035 0.800164
1 11102 0.134316
2 10454 0.064701
3 10007 0.000819

Example of a sub dateset (908 records) of the same string matches that have NAN zip codes. This is called rfk:

zip_code latitude longitude on_street_name
32325 NaN NaN NaN triborough bridge
41576 NaN NaN NaN triborough bridge
45937 NaN NaN NaN triborough bridge
46476 NaN NaN NaN triborough bridge
50485 NaN NaN NaN triborough bridge
dict_1 = {}
sample = rfk.copy()
for number in range(0, len(df_pop_fill)):
    n = (df_pop_fill.iloc[number,1] *len(rfk)).round()

    sample = rfk.sample(n = int(n), replace = False, random_state = 1)
    dict_1[df_pop_fill.iloc[number,0]] = sample_temp1.index.values
    # sample = sample[~sample.index.isin(sample_temp1.index)] <-- unfortunately, does not 
# update outside of the for loop
dict_1
--------------
# rest of the code once successful
df = pd.DataFrame.from_dict(dict_1, orient = "index").T.unstack().dropna().reset_index(level=1, drop = True).astype(int)
df = pd.DataFrame(df)
df["zip_code"] = df.index
df.set_index(0,inplace = True)
df.index.name = None

* update rfk.

The issue is:

  • I get duplicate values in dict_1 because I am unable to update the sample dataframe after each iteration. Ultimately, I am trying to random sample without replacement and then keep picking the next amount of records based on the next percentage, etc until there are no records left.

Secondly, once I am able to get this example working correctly, this will ultimately return an error because based on the percentages, it returns 909 records currently due to rounding error of the percentages. Would need to code something to counteract this.

Any advice is appreciated!
Anthony

I read this post three times, still most it it makes no sense to me.
Here is a list of questions I have.

  1. What is a string match. Are you talking about regex? Something specific to the problem?
  2. Why specify 12 different? Is the number 12 important to the question? If they are, why not show all of them, or at least more than 1 example. The only places I see string match mentioned are
  1. Where did this distribution come from? How is it calculated? What was the previous data about
  1. Subset of 1000 zip codes was mentioned, later the first 5 rows of another subset of 908 was shown. How is the 1000 related to 908?

Are the headers wrong? Should it be zip_code, probability?

Previously the discussion of string matches was on bronx, suddenly this example turned to triborough bridge, are bronx and Triborough bridge same types of concepts (eg. in the same column of a table)?. Why the jump of examples?

Could you prepare a minimal example (that your code can be run on and generate the same issues you face), so others can make changes to your code, or rewrite from scratch.

It is extremely hard to understand in the current state what you’re trying to achieve, what is important, what is not important, why you are iterating rather than doing vectorized methods, why you are not hardcoding range(0, len(df_pop_fill)) as a number, why a dict_1 is required to exist. What the initial input data looks like, what the output data should look like

Hi,

I tried to make it more pinpoint, but seems that did not work.

Main dataframe, mvc2 (sorry, format shifts)

	date_time	borough	zip_code	latitude	longitude	on_street_name	cross_street_name	off_street_name	num_pers_injured	num_pers_killed	...	cause_veh_2	cause_veh_3	cause_veh_4	cause_veh_5	collision_id	veh_type_1	veh_type_2	veh_type_3	veh_type_4	veh_type_5
0	04/18/14 09:00 PM	NaN	NaN	NaN	NaN	FOREST AVENUE	BARD AVENUE	NaN	0	0	...	NaN	NaN	NaN	NaN	322250	PASSENGER VEHICLE	NaN	NaN	NaN	NaN
1	05/09/14 12:15 PM	QUEENS	11434	40.661663	-73.765499	181 STREET	146 DRIVE	NaN	1	0	...	Failure to Yield Right-of-Way	NaN	NaN	NaN	332671	PASSENGER VEHICLE	SPORT UTILITY / STATION WAGON	NaN	NaN	NaN
2	04/29/14 09:30 AM	BROOKLYN	11210	40.628751	-73.951106	AVENUE I	EAST 26 STREET	NaN	0	0	...	Unspecified	NaN	NaN	NaN	327772	PASSENGER VEHICLE	SPORT UTILITY / STATION WAGON	NaN	NaN	NaN
3	05/12/14 05:25 PM	MANHATTAN	10075	40.772655	-73.955552	EAST 78 STREET	2 AVENUE	NaN	0	0	...	Other Vehicular	NaN	NaN	NaN	333807	PASSENGER VEHICLE	PASSENGER VEHICLE	NaN	NaN	NaN
4	04/18/14 01:10 PM	MANHATTAN	10024	40.787002	-73.975513	WEST 85 STREET	AMSTERDAM AVENUE	NaN	1	0	...	Unspecified	NaN	NaN	NaN	321954	TAXI	BICYCLE	NaN	NaN	NaN
5 rows × 27 columns

From the my main dataset mvc2 I have all extracted all records that match the strings in the str.contains part that have:

  • a null zip code,
  • a null latitude, and
  • a null longitude.

Here are four examples of twelve :

# remove whitespace before and after; make all lower case; replace whitespace between words with one space
mvc2.on_street_name = mvc2.on_street_name.str.strip().str.lower().str.replace(r"\s+", " ")

# Extract all records in dataframe that have null zip code, null latitude, and not null on_street_name
mvc2_null_fill_bridge = mvc2[((mvc.zip_code.isnull()) & (mvc.latitude.isnull())) & (mvc.on_street_name.notnull())] 

#Extract sub datasets with particular string matches

narrows = mvc2_null_fill_bridge.on_street_name.str.contains("narrows bridge|verrazzano bridge|verrizano bridge")

brooklyn = mvc2_null_fill_bridge.on_street_name.str.contains("brooklyn bridge")

koch_queens = mvc2_null_fill_bridge.on_street_name.str.contains("koch|queensboro")

rfk = mvc2_null_fill_bridge[mvc2_null_fill_bridge.on_street_name.str.contains("triborough bridge|rfk bridge|robert f. kennedy|robert f kennedy|rfk")]

Example output of the last one:

date_time borough zip_code latitude longitude on_street_name cross_street_name off_street_name num_pers_injured num_pers_killed cause_veh_2 cause_veh_3 cause_veh_4 cause_veh_5 collision_id veh_type_1 veh_type_2 veh_type_3 veh_type_4 veh_type_5
32325 2020-01-18 13:25:00 NaN NaN NaN NaN triborough bridge NaN NaN 1 0 Not applicable Not applicable Not applicable Not applicable 4277456 Station Wagon/Sport Utility Vehicle Not applicable Not applicable Not applicable Not applicable
41576 2019-12-26 00:00:00 NaN NaN NaN NaN triborough bridge NaN NaN 0 0 Unspecified Not applicable Not applicable Not applicable 4265151 Sedan Sedan Not applicable Not applicable Not applicable
45937 2020-03-18 14:06:00 NaN NaN NaN NaN triborough bridge NaN NaN 0 0 Unspecified Not applicable Not applicable Not applicable 4302591 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle Not applicable Not applicable Not applicable

Total records is 908

What I am trying to achieve is a basic imputation problem. I want to fill in the zip code for each of the 908 records in the above sub dataframes (i.e., those that have particular string matches that have the missing zip codes) in my main dataframe mvc2 based on the proportion of that these string matches in my main dataframe, mvc2 but that have the completed zip code.

Working example:

# remove whitespace before and after; make all lower case; replace whitespace between words with one space
mvc2.on_street_name = mvc2.on_street_name.str.strip().str.lower().str.replace(r"\s+", " ")
mvc2_null_fill_bridge = mvc2[((mvc.zip_code.isnull()) & (mvc.latitude.isnull())) & (mvc.on_street_name.notnull())] 

# rfk dataframe that have all missing zip codes for every string match in on_street_name column
rfk = mvc2_null_fill_bridge[mvc2_null_fill_bridge.on_street_name.str.contains("triborough bridge|rfk bridge|robert f. kennedy|robert f kennedy|rfk")]

#Extract the proportion of the above string matches in the main dataframe mvc2 that have completed zip codes.
df_pop_fill = mvc2_values_extract[mvc2_values_extract.on_street_name.str.contains("triborough bridge|rfk bridge|robert f. kennedy|robert f kennedy|rfk")].zip_code.value_counts(normalize = True).reset_index(

Output of df_pop_fill:

index zip_code
0 10035 0.800164
1 11102 0.134316
2 10454 0.064701
3 10007 0.000819

Now, what I want to do is in rfk dataframe that have nan values for zip codes, I want to randomly sample without replacement 80% of the records and make them have the zip code 10035. Then I want to sample without replacement the remaining records, and select randomly 13% to have the zip code 11102; Then I want to sample without replacement the remaining records, and select randomly 6% to have 10454, etc.

I want to apply this same technique iteratively to each of the the string matches (see four of the twelve examples above).

Is it more clear in what I want to achieve?
Anthony

Hi there,

I am unsure if what I provided makes anymore sense.

I just found a solution on my own. Even though it is not exactly as my requirements above, it will work.

I appreciate your time,
Anthony

Yes it was a lot clearer with certain details like 1000 removed, and i can see more clearly the outer processing loop you want is the 12 groups and the inner loop for rfk specifically is 908 rows. I can also see your data processing DAG now, realizing that mvc2_null_fill_bridge is the common starting point for the outer loop, and the conditions on which it was created. However in the new question there were still magic variables from nowhere (mvc2_values_extract), like sample_temp1 from nowhere in the 1st question.

Below is my understanding of what you want to achieve, how i set up fake data, and the processing steps for the inner loop to fill rfk.

Goal
Fill in a certain column in a dataframe with values from a mapping and in the proportion of the mapping.

Information sources
The mapping comes from value_counts(normalize=True)

Setting up data

from itertools import product

df = pd.DataFrame(product([np.nan],['Red','Blue'],'ABCDE'),columns=['zipcode','color','letters'])
df
index zipcode color letters
0 NaN Red A
1 NaN Red B
2 NaN Red C
3 NaN Red D
4 NaN Red E
5 NaN Blue A
6 NaN Blue B
7 NaN Blue C
8 NaN Blue D
9 NaN Blue E

Setting up mapping

s = pd.Series(['A']*3+['B']*5)   # prime numbers chosen to produce non integer proportions
proportions = s.value_counts(normalize=True)
proportions
B    0.625
A    0.375
dtype: float64

Shuffle before split and fillna to introduce randomness

shuffled_df = df.sample(frac=1)
shuffled_df

Prepare split points

split_points = (proportions*len(shuffled_df)).astype(int)  # off by 1 errors after the typecasting doesn't matter as last number will not be used anyway when np.split
split_points 

Split df

split_dfs = np.split(shuffled_df,split_points[:-1])
split_dfs

Map the filling function to multiple iterables (iter1 = index, iter2 = df), each contributing 1 positional input to lambda filling function

shuffled_df # for before after comparison
pd.concat(map(lambda zipcode,df:df.fillna(zipcode), split_points.index,split_dfs))
index zipcode color letters
2 B Red C
8 B Blue D
4 B Red E
9 B Blue E
1 B Red B
6 B Blue B
7 A Blue C
3 A Red D
0 A Red A
5 A Blue A

There you have the analogous rfk with zipcode filled. Note i didn’t make value_counts() into a dataframe because it seems unnecessary. Also in future when you use df.map for feature engineering, throwing in a series directly is more convenient.

This example mapping only used 1 split point number to generate 2 splits in np.split but the code should work for more split points too. You can look at np.split docs and try modifying the sample mapping here to add more zipcode categories.

1 Like