How to create a database from scratch
I used instructions from this page
import sqlite3
import pandas as pd
### create SQLite DB and create table
!touch ../data/velib.sqlite
conn = sqlite3.connect('../data/velib.sqlite')
c = conn.cursor()
c.execute("""
CREATE TABLE stations (datetime text, stationCode int, meca int, elec int, park int)
""")
### write CSV table to SQLite DB
# import table from CSV (10 seconds)
stations = pd.read_csv('../data/data-5m.csv', low_memory=False)
# clean the table with pandas (15 seconds)
stations['datetime'] = pd.to_datetime(stations.datetime, format='%Y-%m-%d %H-%M')
for col in ['stationCode', 'meca', 'elec', 'park']:
stations[col] = pd.to_numeric(stations[col], errors='coerce')
stations = stations.dropna()
for col in ['stationCode', 'meca', 'elec', 'park']:
stations[col] = stations[col].astype(int)
# write table to SQLite DB (90 seconds)
stations.to_sql('stations', conn, if_exists='append', index=False)
### test simple query with sqlite3 only (7 ms)
c.execute("""
SELECT * FROM stations
""")
c.fetchone()
### test query with pandas (< 2 seconds)
query = """
SELECT s.datetime, s.meca + s.elec AS bikes, s.park
FROM stations s
WHERE s.stationCode = 12001
"""
pd.read_sql_query(query, conn).head()
Queries
Date related queries
From this page:
SQLite does not support built-in date and/or time storage class.
Instead, it leverages some built-in date and time functions to use other storage classes such as
TEXT,REAL, orINTEGERfor storing the date and time values.
SELECT datetime('now')
-- 2022-01-01 14:30:15
SELECT datetime('now', 'localtime')
-- 2022-01-01 15:33:08
Sanitized injection
query = """
SELECT s.datetime, s.meca + s.elec AS bikes, s.park
FROM stations s
WHERE (s.stationCode = ?
AND s.datetime BETWEEN ? AND ?)
"""
params = (str(station_code), str(first_dt), str(last_dt))
df = pd.read_sql_query(query, conn, params=params, parse_dates="datetime", index_col="datetime")
Proof of concept
Let query be:
SELECT s.datetime, s.meca + s.elec AS bikes, s.park
FROM stations s
WHERE (s.stationCode = 12001 AND s.datetime BETWEEN '2021-11-01' AND '2021-11-08')
This was WAYYYY more rapid than pandas
### Import and connect (< 200 ms)
import sqlite3
import pandas as pd
conn = sqlite3.connect('../data/velib.sqlite')
c = conn.cursor()
# query --> see SQL code above
### this takes 1.40 s
df = pd.read_sql_query(query, conn, parse_dates="datetime", index_col="datetime")
### standard rendering (< 1 s)
import matplotlib.pyplot as plt
import seaborn as sns
sns.lineplot(data=df, x='datetime', y='bikes', color='green')
sns.lineplot(data=df, x='datetime', y='park', color='purple')
plt.xlabel('')
plt.ylabel('')
for label in plt.gca().get_xticklabels():
label.set_rotation(45)
label.set_ha('right')
Production
- nothing to be put in requirements.txt:
sqliteis part of the standard library
after stubbornly trying to download SQLITE3 from pip for several minutes, i remembered that python is batteries included. -- quote from this stackOF post
- manage connector:
- I had an error because of multi-threads, which can be overcome with
check_same_thread=False - use a
with sqlite3.connect(DB_PATH) as connstatement is best - more on this stackOF post
- I had an error because of multi-threads, which can be overcome with