Data Cleaning and Extraction

Hi guys,

please, i will need a help. I have been working on a personal project, and i got stuck.

A brief summary, i have a list called crime_list, and i want to create a new column with this list with respect to each state and province.

I also created a datetime for a specified time schedule and i want to automatically compute a new dataframe from it.

the image below explains it further. Please i need guidance and help on this

The first two images explains what i did, while the last image explained the result i wanted (i did that manually in excel, but i have a large dataset and doing it manually is not an option)

question3

Please, Someone should help me out.

@Sahil
@fedepereira
@april.g
@dilarakrby

1 Like

Hi @nnabugwukelvin.chukw,
in the 1st image, you defined a list “crime” instead of “crime_list”.
Could be this the source of the issue?

1 Like

I changed it, though, nothing happened. I can share the code here if it is required.

I think i am not getting it right with my code

1 Like

sure,
please share the code and, if possible, the excel file (or simply share a zipped file). I can check it faster this way.

DRAFT REQUEST ANWOLO.ipynb (15.7 KB) DRAFT REQUEST ANWOLO - Copy.xlsx (35.6 KB)

Click here to view the jupyter notebook file in a new tab

I can clarify further if needed

I see some incompatibility when you assign values in the for loop:

First, you assign one element in a column, but you are not using the row index (df.loc[ ] or df.iloc[ ]).
By doing this, you will rewrite the values in the “Crime Categories” column on each loop.

df1['Crime Categories'] = crime

Then you assign a series (252 items) to a column for each crime:

df1['Start Date'] = pd.Series(start_date)
df1["End Date"] = pd.Series(end_date)

If I understood correctly based on the last image, you want a combination of months, crimes, station. Is this correct? You will have a big data frame. The following code creates this combination with a little help of the itetools module but misses the State/Province (you can go further and include it):

from itertools import product 

pstations = list(df1['Police station'].value_counts().index)
crimes = list(df2["Crime Categories"].value_counts().index)

perm = product(pstations, crimes)

# We will generate data as a temporal dataframe (252 items) and append all of them into a list
df_temp = pd.DataFrame()
df_temp["start date"] = ""
df_temp["end date"] = ""
df_temp["Police Station"] = ""
df_temp["Crime Category"] = ""

temp_list = []

for pstation, crime in list(perm):
    df_temp["start date"] = start_date
    df_temp["end date"] = end_date
    df_temp["Police Station"] = pstation
    df_temp["Crime Category"] = crime
    
    temp_list.append(df_temp)

# We will add all temporal dataframes to this one
df3 = pd.DataFrame()
df3 = pd.concat(temp_list)

The total length is 252 dates x 22 Crimes x 1141 Police Stations = 6.325.704 rows

You can also use multiindex in this case, if you find it useful.

1 Like

Hi @fedepereira, i am really grateful and happy seeing this.

I am having a problem assigning the Province to the structure.

How can i do this perhaps?

see my code

from itertools import product

pstations = list(df1['Police station'].value_counts().index)
crimes = list(df2['Crime Categories'].value_counts().index)
province = list(df1['Province'].value_counts().index)

perm = product(pstations, province, crimes)

# we will generate data as a temporal dataframe (252 itmes) and append all of them into a list

df_temp = pd.DataFrame()
df_temp['Start date'] = ""
df_temp['End date'] = ""
df_temp['Police Station'] = ""
df_temp['Crime Category']= ""
df_temp['Province'] = ""

temp_list = []

for pstation, crime, prov in list(perm):
    df_temp['Start date'] = start_date
    df_temp['End date'] = end_date
    df_temp["Police Station"] = pstation
    df_temp['Crime Category'] = crime
    df_temp['Province'] = prov
    
    temp_list.append(df_temp)
    
# adding the temporal dataframes to this one

df3 = pd.DataFrame()
df3 = pd.concat(temp_list)
1 Like

got this as my output… seeing displayed 251 rows, while total rows are 6.325.704 rows.

where did i get it wrong?

And i checked, and saw many null values…

1 Like

can someone please help me? :relieved: @Bruno

I solved it guys…

Later found out that all i need to do for this is;

results =

for (p_index, pol_st) in enumerate(province_station['Police station']):
    for crime in crime_cat['Crime Categories'][:len(start_date)]:
        for (index, date) in enumerate(start_date):
            results.append([province_station['Province'][p_index], pol_st, date, end_date[index], crime])
2 Likes