Can't transfer large data file (+5M rows) to a sqlite3 db with pd.to_sql method

Hi everyone,
I am writing to you because I’m struggling on a task on my job (food delivery business) while handling a file and converting it to a sqlite db. Basically the situation is the folllowing:

  • I got a huge list of orders (5+million rows)
  • I have ported this data locally on my machine as a 550MB file
  • I’d like to move this data into a sqlite DB, so it is lighter and ready to query by externals (an agency)

I created a huge dataframe with pd.read_csv, fragmenting it into chunks (otherwise I run into a memory error, so I can dodge the first bullet :smile:). Then I’d love to use a df.to_sql method and import everything, but guess what, second memoryerror I can’t seem to escape.
By looking at some topics on stackoverflow it seems that the solution would be to iterate into the DF with a loop, and load data little by little. I can’t understand why but after the first iteration, my loop ends.

Below there’s the whole story - not marvelous formatting but it should do the trick.

Would you be able to lend a hand or give me a hint on other ways to get this working?
Thanks!

Gory details
import pandas as pd
import sqlite3
#instantiate advice.db
conn = sqlite3.connect('advice.db')

create_table = '''
CREATE TABLE IF NOT EXISTS main (
    customer_id TEXT PRIMARY KEY,
    order_id_local TEXT,
    order_date_local TEXT,
    time_placed TEXT,
    requested_for TEXT,
    hour_requested TEXT,
    meal_type TEXT,
    payment_method_type TEXT,
    delivery_postcode TEXT,
    order_value REAL,
    is_offer INTEGER,
    has_voucher INTEGER
    )
    ;
'''

#drop table if exists and create one, then close connection
conn.execute("DROP TABLE IF EXISTS main")
conn.execute(create_table)
conn.close()
db = 'advice.db'

def run_command(c):
    with sqlite3.connect(db) as conn:
        conn.isolation_level = None #autocommit mode
        conn.execute(c)             #execute a sql statement - we'll use command to create and insert rows into the tables

def run_query(q):
    with sqlite3.connect(db) as conn:
        return pd.read_sql(q, conn)
    
def show_tables():
    q = '''SELECT name, type
    FROM sqlite_master
    WHERE type IN ("table","view")
    '''
    return run_query(q)
show_tables()
name type
0 main table
q = "SELECT * FROM sqlite_master WHERE type='table';"
run_query(q)
type name tbl_name rootpage sql
0 table main main 2 CREATE TABLE main (\n customer_id TEXT PRIM...
run_query('''SELECT * FROM main LIMIT 5''')
customer_id order_id_local order_date_local time_placed requested_for hour_requested meal_type payment_method_type delivery_postcode order_value is_offer has_voucher
dtypes = {
    'customer_id': 'str',
    'order_id_local': 'str',
    'order_date_local': 'str',
    'time_placed': 'str',
    'requested_for': 'str',
    'hour_requested': 'str',
    'meal_type': 'str',
    'payment_method_type': 'str',
    'delivery_postcode': 'str',
    'order_value': 'float64',
    'is_offer': 'int64',
    'has_voucher': 'int64'
}
cols = ['customer_id', 'order_id_local', 'order_date_local', 'time_placed',
       'requested_for', 'hour_requested', 'meal_type', 'payment_method_type',
       'delivery_postcode', 'order_value', 'is_offer', 'has_voucher']
df_chunk = pd.read_csv(r'advice-dataset.csv', dtype = dtypes, usecols = cols, chunksize=200000)

chunk_list = []  # append each chunk df here 

# Each chunk is in df format
for chunk in df_chunk: 
    chunk_list.append(chunk)
    
# concat the list into dataframe 
df_concat = pd.concat(chunk_list)


#source - https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c
df_concat.shape
(5069653, 12)
#insert rows into db in chunks - loop ends after first data input??
with sqlite3.connect(db) as conn:
    for i in range(0,5000000,10000):
        j = i+10001
        df_concat[i:j].to_sql('main', conn, if_exists='replace', index = False)#, chunksize = 10000 )
run_query('SELECT count(*) FROM main')
count(*)
0 10001
#50 iterations of 100k
for i in range(0,5000000,100000):
    j = i+100001
    print(i,j)
0 100001
100000 200001
200000 300001
300000 400001
400000 500001
500000 600001
600000 700001
700000 800001
800000 900001
900000 1000001
1000000 1100001
1100000 1200001
1200000 1300001
1300000 1400001
1400000 1500001
1500000 1600001
1600000 1700001
1700000 1800001
1800000 1900001
1900000 2000001
2000000 2100001
2100000 2200001
2200000 2300001
2300000 2400001
2400000 2500001
2500000 2600001
2600000 2700001
2700000 2800001
2800000 2900001
2900000 3000001
3000000 3100001
3100000 3200001
3200000 3300001
3300000 3400001
3400000 3500001
3500000 3600001
3600000 3700001
3700000 3800001
3800000 3900001
3900000 4000001
4000000 4100001
4100000 4200001
4200000 4300001
4300000 4400001
4400000 4500001
4500000 4600001
4600000 4700001
4700000 4800001
4800000 4900001
4900000 5000001
#test the code
test = pd.read_csv(r'advice-dataset.csv', dtype = dtypes, usecols = cols, nrows = 100)

#insert rows into db
with sqlite3.connect(db) as conn:
    test_df.to_sql('main', conn, if_exists='replace', index = False )
    
run_query('SELECT * FROM justeat LIMIT 5')

1 Like

Hey, Nick.

If the data is private, is there a chance you can generate some fake data and replicate this behavior so that people can easily experiment?

Hi Bruno sure,
no data make it hard to debug. By running this code I can give you dummy data - by extending the range of the loop you can get a Dataframe as big as desired - pretty sure you don’t need 5 Mil rows to run into memory errors though. Not sure this is the best method to achieve this - it’s very slow.

Let me know how this works out for you and if you can get into a position where you can experiment.
Thanks!

#prep dataframe columns and sample data
cols = ['customer_id', 'order_id_local', 'order_date_local', 'time_placed',
       'requested_for', 'hour_requested', 'meal_type', 'payment_method_type',
       'delivery_postcode', 'order_value', 'is_offer', 'has_voucher']

sample_data = [['AB:IT:940459', '34204380', '2019-05-05', '2019-05-05 10:19:52',
        '2019-05-05 12:50:11.410', '12', 'Lunch', 'Card', '80000', 12.5,
        0, 0],
       ['AB:IT:644203', '34204377', '2019-05-05',
        '2019-05-05 10:19:49.383', '2019-05-05 12:50:04.953', '12',
        'Lunch', 'Cash', '10151', 13.4, 0, 0],
       ['AB:IT:1421127', '34204374', '2019-05-05', '2019-05-05 10:19:47',
        '2019-05-05 13:10:05.640', '13', 'Lunch', 'Card', '12345', 22.5,
        1, 0],
       ['AB:IT:1188690', '34204371', '2019-05-05', '2019-05-05 10:19:38',
        '2019-05-05 12:52:46.920', '12', 'Lunch', 'Card', '73682', 15.2,
        1, 0],
       ['AB:IT:542683', '34204369', '2019-05-05',
        '2019-05-05 10:19:36.267', '2019-05-05 12:50:03.290', '12',
        'Lunch', 'Cash', '10123', 9.0, 0, 0]]


#Build a sample dataframe of 5 rows
sample_df = pd.DataFrame(data = sample_data, columns = cols)


#prepare a full dummy data dataframe - extend range of loop to get up to n rows
full_sample = pd.DataFrame(columns = cols)
n = 10000

for i in range(0,n):
    full_sample = full_sample.append(sample_df)

full_sample.shape

Thank you for this.

This where you made a mistake in your debugging process. From a line count, you concluded this, but it’s not true that one follows from the other.

You only have roughly 10K rows not because it is only running the first iteration, but rather because it’s only saving the values from the last one. And this happens because of if_exists = 'replace':

image

You’ll want to use append.

man you’re so right… :man_facepalming:
This was actually very simple. I made a mistake in thinking that replace would have replaced duplicated records (and not appending them) while since it is applied to a dataframe, it keeps overwriting it till the last iteration.

Thanks @Bruno for the help, I was a tiny bit from getting it right. Also, no memoryerror this time.
Is it common to run into this kind of issues according to your experience?
Do you have any suggestion on how to circument them whenever they surface?

Thanks again!
Nick

You’re welcome.

Tunnel vision is very common and it happens to everyone.

What works for me is breaking down things more and more until I spot in which step the error is occurring. Once that happens, it’s much easier to know why it fails.

What I did in your case was, instead of running the loop, run each iteration by hand (just a couple), it was working fine.

Then I decided to print the iteration variable at every 100000 leap, it seemed to be running through the whole thing.

Then I decided to inspect the used options more closely and I figured it out, but if I hadn’t had this inspiration, I’d probably make sure that the rows that I was getting matched the ones from the first iteration and then I would have noticed they didn’t match and things would eventually click.

All in all: break things down. Everything is simple if you break it down enough.

1 Like

Hey, my cent on this for everyone reading, my experience with sqlite3 and pandas on big datasets taught me to use dask and postgres for datasets this big. Pandas could handle this amount easily if youre good with ram, but sqlite is tricky as it tends to lock often on inserts and deletions where transaction is big. There is postgres adapter for python so you wont have struggle writing bash calls.

I agree with the sentiment, but I’m not quite sure what is it that SQLite can and cannot handle. Also, the question is about SQLite.

For these two reasons I didn’t share this point of view, but it’s definitely something to keep in mind. @nlong

How do you deal with updating/adding existing data? I don’t like the use of “replace” as it drops the whole table and replaces it. Is there a more efficient way of doing this?

1 Like