Storing structured data without schema changes

This is just a very short blog around how to store data in a SQL database without having to worry too much about continual schema changes. Most people will know about different flavours of storing data, flat files (csv, hdf, parquet, which is my favourite), SQL, or NoSQL datastores. I shall skip graph stores in this article.

The benefit of using a database for storing data during data science investigations is

not having to worry about type conversions like datetime formats
the ease of moving from investigations/data trials to staging, beta, and prod
a one-place stop to base data dictionaries (what do the table and column names mean) and data catalogues (how much/what data do I have)

Compared to a few years ago, I now almost immediately switch to storing my data in databases once I established that these data sources could be useful. Using PostgreSQL I discovered that I can add up to 64 characters of comments to tables, views, and columns.

SQL and NoSQL

So, why SQL and not NoSQL? This post is an opionionated post. An often stated advantage of NoSQL is that NoSQL is schema free. For a data scientist, this sound very attractive — initially. So, yes, NoSQL works really well for highly unstructured data, but there comes a time (like, when adding a dashboard tool such as Apache Superset, Tableau, or PowerBI etc, or when provisioning data via REST or GraphQL) when it is bluntly obvious that schema-free does not mean that one needs no schema. For dashbaords, it needs to be clear, which data there are, and what these mean, to proceed. So, even with NoSQL database one needs to capture the schema somehow, it is crucial to have an information architecture, so, why not use a SQL store early on (and combine it with a NoSQL store where approriate)

There is a reason data scientists dislike SQL stores too early, mainly because, during the initial phases of a data trial, it is quite unclear how the schema would look like. Which is why everybody starts with flat files (and I have, in my past, always missed the right point in time when I should have moved to a database).

Pandas to the rescue

A previous project exposed me to early use of a SQL datastore (IBM DB2 in that case). We heavily used pandas and got used to simply dumping dataframes into the SQL data store.

df.to_sql("t_tablename",con=conn,if_exists="overwrite")

This will create a table if required, perform something not quite but similar to an upsert (it actually doesn’t, but one can query the existing data and store it in a dataframe, then append() and drop_duplicate() it) without having to worry about schemas. It however, falls over when

the data contains strings that should be SELECTed for DISTINCT values (e.g., country names) server side, or
the dataframe schema (column layout and make) changes

The first issue is due to string being mapped to SQL TEXT, which is similar to a a CLOB (Character Large Object) in most cases, think of an external file. While this allows the text to be able to grow often to GB sizes, querying aggrgations may not be possible (postgresql) or cause performance penalties. The good nes is, this can be avoided by mapping data types like so:-

import sqlalchemy
...
df.to_sql("t_tablename",con=conn,if_exists="overwrite",dtypes={"label": sqlalchemy.types.VARCHAR(100), ...})

This approach also helps in documenting the schema, but, yes, it can be painful to determine the maximum length of all strings in a data frame.

This approach does not help when the dataframe layout changes. In that case, one needs to DROP the existing table first, then creat a new table. Not for the faint of heart. Which is why data scientists dislike SQL datastores too early….

Key-Value Store

One solution may be to store data as key-value pairs. As an example, we take the NSIDC Multisensor Analyzed Sea Ice Extent – Northern Hemisphere (MASIE-NH), Version 1 dataset, which computes the ice extent for various arctic regions (ftp link).

yyyyddd	(0) Northern_Hemisphere	(1) Beaufort_Sea	(2) Chukchi_Sea	(3) East_Siberian_Sea	…
2006001	13034724	1069711	966006.2	1087103	…
2006002	13034724	1069711	966006.2	1087103	…
2006003	13170663	1069711	966006.2	1087103	…
2006004	13409716	1069711	966006.2	1087103	…
…	…	…	…	…	…

Upper left corner of G02186 dataset, first line removed

Now, while this data product may not continually add another arctic sea region to its data, it is quite obvious that querying this dataset is quite challenging, as the different pieces of data are stored as columns with different names.

This is where pandas’ stack() functionality comes in handy. This is the code to transform the table above into a key-value format, with datetime as the index, the region as the key, and the values as the ice extent data, in three columns:-

df.index=pd.to_datetime(df.yyyyddd,format="%Y%j")
del df["yyyyddd"]
df = df.stack().reset_index().rename(columns={"yyyyddd":"datetime_date","level_1":"region",0:"extent"})

The table now looks like

index	datetime_date	region	extent
0	2006-01-01	(0) Northern_Hemisphere	13034724
1	2006-01-01	(1) Beaufort_Sea	1069711
2	2006-01-01	(2) Chukchi_Sea	966006.2
3	2006-01-01	(3) East_Siberian_Sea	1087103
4	2006-01-01	(4) Laptev_Sea	897773.4

This you can then nicely store in a table like so

df.to_sql("t_g02186",con=conn,if_exists="replace",dtypes={"region":slqlachemy.types.VARCHAR(50)})

I got used to calling datetime columns datetime_date by now, this avoids clashes with existing built-in reserved terms of various SQL databases. With this, you can now either re-create the original layout using pandas’ pivot_table():

df.pivot_table(index="datetime_date",columns="region",values="extent")

Or query a bespoke dataset

df = pd.read_sql("SELECT * FROM t_g02186 WHERE region='(2) Chukchi_Sea',con=conn)

Conclusion

Structuring the data you work on early in the process is a good thing, but spending too much time in changing schemas for SQL databases takes the fun out of it. The combination of pandas dataframes, pandas.stack(), SQL WHERE, and pandas.pivot_table() takes a lot of pain out of sorting your data in a structured and reusable manner rather early in the process.