So far, fingers crossed, so I managed to avoid an COVID-19 infection. While home office helped a lot, I took care when meeting other people. During the pandemic, I was taking the train to visit family, though, and, when preparing for the trip, I found myself browsing through seven-day-incidence data across Germany, manually overlaying it with train stops.

That was really inefficient, I found, and it seemed to to use a computer to look at the rail schedule on one page, and the incidence rates on another, then combine the data in my head. I do like computers helping me, one of my passions are assistants that help avoid unnecessary and error-prone tasks.

So, time to build a tool. Actually, visualisations.

Data Sources

COVID-19 Case Data

There are numerous sources for COVID-19 case data for Germany, the “official” one being data from the Robert-Koch Institute. https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland seems to be the authoritative source, together with the COVID-19 datahub at https://npgeo-corona-npgeo-de.hub.arcgis.com/.

Rail Schedule Data

Data for German Rail (Deutsche Bahn) are accessible via their Open Data API at https://api.deutschebahn.com/store/. You need to register to get an API Key. This site is in the process of being migrated to a new offering (DB API Marketplace).

Geo Data

We need to fuse data based on Geolocation, i.e. map a train stop to be inside (or neighbouring) a Landkreis (District, or County). Data are available e.g. from https://gdz.bkg.bund.de/index.php/default/digitale-geodaten/verwaltungsgebiete.html, or https://www.arcgis.com/home/item.html?id=b2e6d8854d9744ca88144d30bef06a76.
For some reason, Berlin’s districts are not part of the Landkreise structure, and one needs to replace the German geo data entry for Berlin with the Berlin district data (e.g. from https://daten.odis-berlin.de/de/dataset/bezirksgrenzen/). Another approach would be to aggregate the RKI COVID data to sum up data for the whole of Berlin.

Fuse Data

Rail Schedule

The flow of logic is shown below

Some logic is required to select origin and destination train stops (such as “Berlin Südkreuz”). The id’s can then be used to query train connections. Each connection will then result in a list of train stops (for the complete journey, e.g. Hamburg to Munich, not just the queried connection).

RKI Data

RKI Data are reported against date, and LandkreisId (which is the Amtlichen Gemeindeschlüssel (AGS) as per https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland). As described there, the data for Berlin are contained at Bezirk level.

Geo Data

The data from the German geo portals contains the LandkreisId (AGS), as well. As mentioned above, the Germany data need to be amended by adding the Berlin information. First we drop the existing entry:

import geopandas as gpd
....
basedata = gpd.read_file("Kreisgrenzen_2017_mit_Einwohnerzahl.shp")
basedata["IdLandkreis"] = basedata.RS.astype(int)
basedata.drop(basedata[basedata.GEN == "Berlin"].index,inplace=True)

Then we rename the column names of the Berlin shapefile, and convert it to the same geo-coordinate system

berlindata = gpd.read_file(Berlin_Bezirke.shp").to_crs("EPSG:4326")
berlindata["IdLandkreis"] = berlindata.Schluessel.str[:-5]+berlindata.Schluessel.str[-2:]
berlindata.rename(columns={"Land_name":"Bundesland","Gemeinde_n":"GEN"},inplace=True)

While the Germany data have population data, we also need to integrate those for Berlin from an Excel file

berlinpopulation = pd.read_excel("SB_A01-05-00_2019h02_BE.xlsx",
                                 sheet_name="T5",usecols=[0,1],skiprows=range(7),nrows=12,header=None,
                                names=["GEN","EWZ"])
berlindata = berlindata.merge(berlinpopulation,on="GEN")
berlindata.IdLandkreis = berlindata.IdLandkreis.astype(int)

Ultimately, we fuse the latest boundaries with all of the data above:

geodata = gpd.read_file("Kreisgrenzen_2019.shp")
geodata["IdLandkreis"] = geodata.RS.astype(int)
geodata = geodata.merge(basedata[["IdLandkreis","EWZ"]],on="IdLandkreis")
geodata = pd.concat([geodata,berlindata[["IdLandkreis","EWZ","geometry","GEN"]]])

So, we fuse Berlin geo shapes and population numbers based on the Bezirk names, and the Germany county population data (from 2017) with the geoshapes (from 2019), and the Berlin data using LandkreisId, which is the same key used by the RKI COVID numbers.

Analytics

It is impossible, really, to accurately forecast COVID-19 infection numbers. One would have to know how many people would interact with each other under what circumstances (wearling masks, inside/outdoors). To plan a journey in the foreseeable future, it may be useful, though, to project infection numbers using rather simple metrics, such as linear regression , autoregression, and Stata ARIMA time series estimation (SARIMAX). All models are taken from statsmodels.

This produces a series for forecasts which can be used as a min/max envelope estimate for the future.

To repeat the initial statement, these forecasts will be inaccurate, they, however, are still useful if one wants to understand how the incidence situation could be like for a trip coming up next week.

InfoViz

All of the above wa sonly done to be able to show an indication of expected 7 day incidences along the train stops of my planned trip. So I also developed a few visualisations that could come in handy answering my question “for those people boarding the train at stop <x>, do they come from a hotspot region?”

Incidence Bands

This shows bands of possible seven day infection numbers at a point in the future, for the planned train ride (ICE 702 Hamburg to Munich, on March 29). It also shows a reference line showing the incidence rate of my current location, to I understand “higher/lower”.

Better/Worse Bubbles

Similar to the above, but taking a “most likely” value and an indication of increasing/reducing arrow. The y position indicates the magintude.

Better/Worse String of Pearls

A simplified version focussing on expected development of seven day incidence.

Train Stop and its Surroundings Donuts

This adds a bit more detail on the train stops by also showing the surrounding county situation. The length of the donut arc resembles the length of the shared border. See https://emergentalliance.org/applying-geospatial-knowledge-to-the-covid-19-assessments/, or https://github.com/emergent-analytics/workstreams/blob/master/Geospatial%20-%20WS1/WS3_kp_Knowledge_Graph.ipynb for a description how to compute neighbourhood relations from geo spatial data.

Takeaway

These were InfoViz trials I applied to datasets that had been created to understand the COVID-19 situation in Germany. I was actually using them before travelling, sometimes taking a screenshot on my phone, so I could understand where passengers boarding my coach in, say, Erfurt, would come from in the context of COVID-19 incidences. Understanding this made me feel safer as I was more informed.

From all of the above visualisations, the String of Pearls could easily be applied to existing apps showing the planned travel. Also, similar kinds of InfoVizes could be useful for when there were other contexts (such as risk of adverse weather, or flodding), when planning a rail trip.

InfoViz: COVID-19 Situation when Taking Trains

Data Sources

COVID-19 Case Data

Rail Schedule Data

Geo Data

Fuse Data

Rail Schedule

RKI Data

Geo Data

Analytics

InfoViz

Incidence Bands

Better/Worse Bubbles

Better/Worse String of Pearls

Train Stop and its Surroundings Donuts

Takeaway