The FREEZE Data Pipeline

For the Arctic Risk data repository I decided to use a standard architecture often referred to as a data lake.

  • we ingest live, soft real time data, using a suite of ingestion scripts (Continual or Warm), and
  • we do one-off imports (Batch or Cold)

The tool suite is centred around available open source tools. While an objective is to allow for a scalable solution that can handle a large number of requests, during alpha and beta we focus on basic functionality while keeping an eye on scalability.

One we identified a potential data source, we assess it, plot it, check data quality, coverage, and its potential for the arctic risk platform. Good datasources are added to Awesome Arctic Data github repository. Tools of the trade are generally jupyter, pandas, requests, matplotlib, seaborn, bokeh, geopandas. For remote sensing/satellite data, we chose wget and .netrc stored login credentials if required.

Batch (Cold) Ingestion

For cold ingestion, i.e. one-off data ingestion, we stay with the jupyter notebook. We generally import data into postgresql with PostGIS extensions enabled. This way we can keep geospatial information intact. While pandas offers a good way of creating and upating tables, our preference is to specify the datatypes of the various data columns specifically, except for floating point values.

Little known fact is that one can ingest html directly into pandas, once this is done, the return value is an array of dataframes, though. Example:-

adf = pd.read_html("https://en.wikipedia.org/wiki/Rideau_Canal")

This returns a list of tables on this page. As we are after the open/close times of the ice skating season to study climate change effects, we need to search for the correct table:

for dfCanal in adf:
  if "Season" in dfCanal.columns:
    break

Also, we try to document our ingestion diretly inside PostgreSQL:-

comments = {'subsite_name': 'Name of the subsite',
 'subsite_id': 'ID of subsite as generated by gtnpdatabase.org',
 'view_url': 'path portion of subsite webpage',
 ...
table = "t_permafrost_sites" for column,comment in comments.items():     stmt = f"COMMENT ON COLUMN {table}.{column} is '{comment}';"     conn.execute(text(stmt).execution_options(autocommit=True))

Hot Ingestion

For hot ingestion we use python and Apache Airflow. Airflow scripts can be tricky to write and debug, and I found that logging lags a key functionality I am keen to have, statistics on number of records before and after a script ran. It has happened 😎 that upserts were not quite successfuly after a glitch in the source changed something, which resulted in data loss.

While this can be tricky to detect and avoid as such, getting informed about such events I find useful. So, for every script run, we count records before, and after the run, to ensure the numbers are not reducing.

Storage

Most data out there is in the format of tabular, csv style datasets. Many do contain descriptive headers, multi-row column names with non-standard characters, and inconsistent out-of-range values. In addition, there are more and more RESTful services. Remote sensing data is typically available as netCDF or hdf5 files, these often contain critical metadata (such as geo coordinate references) hidden inside metadata strings. Geoboundary data are available as shapefiles, sometimes geoTIFF, or geoJSON.

We try to store all data in a defined schema PostgreSQL datastore. My data format of choice outside this tool would be parquet, as this is a easy to handle, highly efficient binary format that can be readily used by tools such as Apache Spark.

For the case of satellite data, these will generally arrive as netCDF, or HDF5 files, and we keep the originals. Geospatial data comes generally in shapefiles, again, we leave these formats.

Mapping, Cataloging, and Provisioning

Hasura we find to be a wonderful tool that enables provisioning selected PostgreSQL tables as GraphQL and, since version 2 (which is in alpha right now), also as a RESTful interface. We will closely follow their features around data dictionaries and catalogues, metadata management is on their roadmap.

In parallel, we also monitor progress around linkedin datahub, a very promising development that excels in cataloguing data sources. As the time of this writing, some features are very hard to use, but development progresses at impressive pace.

Presenting and Visualisation

Some of our insights can be readily presented using dashboard plots with limited interaction, our tool of choice is Apache Echarts, a successor to vega and vega-lite, which describes visualisations using a structured grammar. Some of these visualisations we create as html/JavaScript files, other we assemble dynamically using python’s flask/flask-cors and gunicorn.

We also look at Apache Superset for dashboards that work well for the common cross-filter dashboard approach, but find it hard to use at times unless the data are generated and stored such that Superset can cope with it — this phenomenon also applies to similar tools such as Tableau or PowerBI.

Some applications require more ad-hoc functionality which, for the moment, we provision using streamlit apps. In addition, we will revert to bokeh server for more complex visual interactions, and keep an eye on h2o’s wave.

Provisioning

This is done via nginx and/or OpenRESTy.

Data Version Control

This will become an attractive issue once we move into the analytics space in anger. At the moment, data versioning is not done just yet.

Additional Tools

We also have an installation of ElasticSearch and its impressive Enterprise Search webcrawler which we will plan to use for information and knowledge retrieval, and the generation of author connection graphs.

For graph databases, we investigate grakn to store relationships and knowledge flows.

Machine Learning and Analytics

This blog is about the data engineering toolchain, watch this space.