Geographical Referentials: Data Quality in Stages
If you’ve spent time roaming through our Data hub, data.opendatasoft.com, you know how vast and comprehensive our catalog of datasets is. Learn our best practices.
If you’ve spent time roaming through our Data hub, data.opendatasoft.com, you know how vast and comprehensive our catalog of datasets is. You could easily get lost, just as you could get lost in the aisles of a bookstore, in search of a much-talked-about book on witches that you’d like to give your little brother for Christmas.
On the Data hub, you won’t find a bookstore assistant with super powers who is ready to search the store and the basement inventory to help you spot rare gems. At Opendatasoft, we provide you with something much better: a data team specifically trained to provide our customers and Data Network’s visitors with reliable geographical referential data that can be found in no time.
To find the georeferentials, use a magical formula,
georef-countryname (in English), copy and paste it directly in the search engine on data.opendatasoft.com, along with Geographical Referentials filters for the theme and Public for the portal name.
This effort to streamline and improve the quality of geographical information is the first component of a larger project, spearheaded by Opendatasoft data hunters, to offer up-to-date, multi-thematic datasets based on a unified and reproducible structure for different countries. Let’s take a look behind the scenes of this important project for the Data Network and all the data portals created with the Opendatasoft platform.
Copy to clipboard Why Start With Geographical Information?
Geography: An All-Time Favorite
In 2019, our team conducted an internal audit of the Data hub and found that the datasets with a geographical dimension (such as geolocated points or visible contours on a map) were the most frequently reused. Here, “reuse” refers to the number of downloads counted for the dataset and/or the addition of the dataset into another portal’s catalog using the federation functionality. It also reflects the amount of geographical processing carried out by users within the platform itself.
Some geographical data is heavily used by our customers—national and local government bodies and private companies—to supplement their own business data with spatial and statistical information. For example, administrative boundaries are very popular as they enable to outline the different areas of a country, such as regions, states, municipalities, and districts.
This popular use of geographical data is a key factor for our team to prioritize spatial data, among others. Indeed, the efficient functioning of the Opendatasoft platform depends on the quality and freshness of geographical information.
Georeferentials: The Mechanics behind the Platform
There are three internal services the platform relies on in order to function:
This processor retrieves the geographical shapes corresponding to a country’s administrative divisions using an official join key that is compatible with data produced by governments and other organizations. Examples include a region code or a municipality code.
Recovery of administrative divisions.
Some datasets may lack fields for territorial levels (municipalities, districts, regions, etc.), but may still contain geographical coordinates. In this situation, the processor can retrieve the name and shape of the missing administrative divisions from the coordinates.
This feature lets you browse a catalog by filtering datasets by territory. When a data portal covers multiple territorial levels, it is then possible to move up and down between territorial levels.
For this type of navigation to be useful, the geographical coverage associated with each dataset must be based on a set of reference administrative divisions that takes into account the breakdown and specifics of the country , such as the “Provinces” in Canada. The same applies to how the previously mentioned two processors work.
Copy to clipboard How Did the ODS Team Update the Georeferentials?
Over the years, geographic data has been added repeatedly to the Data hub. As a result, the georepository stack has grown considerably and become too overloaded to remain relevant. For example, there are more than 80 datasets related to French administrative boundaries.
For people looking for reusable, quality geographical data—including our customers, this poses difficulties to distinguish the most reliable datasets among the vast catalog with multiple sources, various versions, and sometimes conflicting names.
This was a challenge at the heart of data quality, freshness, and reusability. Our data team decided to tackle the problem by setting the following objectives:
- Structure country layers based on a same strategy.
- For each level, consolidate into two datasets—one vintage dataset and one for the most recent year of the vintage dataset—from multiple official sources. For example, the dataset on French municipalities was created by consolidating data from INSEE, IGN, and Natural Earth.
- Follow a naming standard for the repositories so that they can be correctly identified anywhere, whether on the Data Network or in the relevant Opendatasoft processor interface.
- Within each level, use consistent naming for attributes and fields so that there are no disparities in the format or spelling across administrative divisions. Whether we’re in the German Gemeinde (Municipalities) repository or the German Kreise (Districts) repository, the kreis code IDs and land code IDs are consistent: krs_code and lan_code. You can check the Information Tab > Data schema for matches in the two datasets. .
- Always keep this set of geographical referentials up to date to ensure its reliability.
A Strategy Built on Automation
With these objectives in mind, our team chose to set up a data pipeline to store and prepare the reference data prior to publication. This processing plant named DataSeed is a platform that can automate a particularly tedious and time-consuming cycle of operations when carried out manually on an ever-expanding volume of data.
DataSeed is currently capable of automating the following operations:
- Retrieval from multiple data sources
- Orchestrated processing of source data
- Data quality verification: This step checks for consistency between different territorial levels and different data sources and simplifies geographical shapes for better performance.
- Creation of consolidated repositories from processed and verified source data
- Delivery of repositories to sites where they are listed and reused
After leaving the internal DataSeed platform, the reference datasets no longer require edits or cleanups on the fly. All of the operations are automated upstream, so that the data repositories loaded onto the Opendatasoft platform and on the Data Network are ready for use.
Copy to clipboard How Do I use Geographical Referentials?
There are several possible ways to use them:
- Download a repository from the Data Network or use it in an external service through the application programming interface available in the dataset’s API tab.
- Display the repository in your catalog, filtered by the appropriate region’s administrative divisions, for example, using the federation feature. This feature fetches the dataset directly from the Data Network catalog using your Opendatasoft portal’s administration interface. There’s no need to download it and then manually import it into your own catalog. This way, whenever the federated dataset is updated, your catalog is also updated automatically.
- Enhance an existing dataset on your portal with a repository, using the processors described earlier (geographic join and retrieve administrative divisions).
Copy to clipboard What’s Next?
Currently, data producers and reusers in Germany, Canada, France and Mexico can use up-to-date geographical referentials. Other countries, including Australia, Belgium, and the United States will be added to this list in the coming months.
With a continued focus on geographical referentials, the data team is already working on creating sets for other useful themes. We plan to include geographic referenced data on demographics, housing, and even employment.
Easier Access to Repositories
Currently, repositories can be found on the Data Network by searching with the text
georef-countryname, plus a filter on the portal name (Public) and on the theme (Geographical Referentials). Although proven successful, accessing repositories in this way might require some prior knowledge. To achieve a more direct and immediate discoverability, geographical referentials will soon be added to the Data hub’s Repositories page.
This repository visibility issue is part of the process of redesigning the Data hub, whose exploratory phase began in the fourth quarter of 2020. The entire Opendatasoft team is poised to build and maintain a system to encourage more practical data use and gather a community of data enthusiasts who produce, enhance, share, and use data on a daily basis.
How to choose the right charts for your data?
Illustrate, share, show, demonstrate, develop… So many verbs to refer to the visual depiction of data. It’s indeed for the purpose of “sharing data” that they’re first and foremost expressed...
Make better decisions with SFR Geostatistics
The SFR Geostatistics solution allows businesses of all kinds to benefit from anonymized mobile data. We interviewed Claire Prost-Romand, Product Manager to learn more…
What Is a Basemap and How to Choose the Best One for Your Data?
Without a basemap, your geographical data has little value. This blog post walks you through the definition of a baseman and how to choose one for your data platform.