January 05, 2021
Reading time: 9 min
Opendatasoft customers can now use a set of reliable geographical referentials to enrich their own data. Let's discover how our team worked to enhance the quality of these reference datasets.
Opendatasoft customers can now use a set of reliable geographical referentials to enrich their own data. Let's discover how our team worked to enhance the quality of these reference datasets.
If you’ve spent time roaming through our Data Network, data.opendatasoft.com, you know how vast and comprehensive our catalog of datasets is. You could easily get lost, just as you could get lost in the aisles of a bookstore, in search of a much-talked-about book on witches that you’d like to give your little brother for Christmas. , just as you could get lost in the aisles of a bookstore, in search of a much-talked-about reference book on witches that you’d like to give your little brother for Christmas.
On the Data Network, you won’t find a bookstore assistant with super powers who is ready to search the store and the basement inventory to help you spot rare gems. At Opendatasoft, we provide you with something much better: a data team specifically trained to provide our customers and Data Network’s visitors with reliable geographical referential data that can be found in no time.
To find the georeferentials, use a magical formula, georef-countryname
(in English), copy and paste it directly in the search engine on data.opendatasoft.com, along with Geographical Referentials filters for the theme and Public for the portal name.
This effort to streamline and improve the quality of geographical information is the first component of a larger project, spearheaded by Opendatasoft data hunters, to offer up-to-date, multi-thematic datasets based on a unified and reproducible structure for different countries. Let’s take a look behind the scenes of this important project for the Data Network and all the data portals created with the Opendatasoft platform.
In 2019, our team conducted an internal audit of the Data Network and found that the datasets with a geographical dimension (such as geolocated points or visible contours on a map) were the most frequently reused. Here, “reuse” refers to the number of downloads counted for the dataset and/or the addition of the dataset into another portal’s catalog using the federation functionality. It also reflects the amount of geographical processing carried out by users within the platform itself.
Some geographical data is heavily used by our customers—national and local government bodies and private companies—to supplement their own business data with spatial and statistical information. For example, administrative boundaries are very popular as they enable to outline the different areas of a country, such as regions, states, municipalities, and districts.
This popular use of geographical data is a key factor for our team to prioritize spatial data, among others. Indeed, the efficient functioning of the Opendatasoft platform depends on the quality and freshness of geographical information.
There are three internal services the platform relies on in order to function:
This processor retrieves the geographical shapes corresponding to a country’s administrative divisions using an official join key that is compatible with data produced by governments and other organizations. Examples include a region code or a municipality code.
Some datasets may lack fields for territorial levels (municipalities, districts, regions, etc.), but may still contain geographical coordinates. In this situation, the processor can retrieve the name and shape of the missing administrative divisions from the coordinates.
This feature lets you browse a catalog by filtering datasets by territory. When a data portal covers multiple territorial levels, it is then possible to move up and down between territorial levels.
For this type of navigation to be useful, the geographical coverage associated with each dataset must be based on a set of reference administrative divisions that takes into account the breakdown and specifics of the country , such as the “Provinces” in Canada. The same applies to how the previously mentioned two processors work.
Over the years, geographic data has been added repeatedly to the Data Network. As a result, the georepository stack has grown considerably and become too overloaded to remain relevant. For example, there are more than 80 datasets related to French administrative boundaries.
For people looking for reusable, quality geographical data—including our customers, this poses difficulties to distinguish the most reliable datasets among the vast catalog with multiple sources, various versions, and sometimes conflicting names.
With these objectives in mind, our team chose to set up a data pipeline to store and prepare the reference data prior to publication. This processing plant named DataSeed 🌱 is a platform that can automate a particularly tedious and time-consuming cycle of operations when carried out manually on an ever-expanding volume of data.
DataSeed is currently capable of automating the following operations:
After leaving the internal DataSeed platform, the reference datasets no longer require edits or cleanups on the fly. All of the operations are automated upstream, so that the data repositories loaded onto the Opendatasoft platform and on the Data Network are ready for use.
There are several possible ways to use them:
👉 Download a repository from the Data Network or use it in an external service through the application programming interface available in the dataset’s API tab.
👉 Display the repository in your catalog, filtered by the appropriate region’s administrative divisions, for example, using the federation feature. This feature fetches the dataset directly from the Data Network catalog using your Opendatasoft portal’s administration interface. There’s no need to download it and then manually import it into your own catalog. This way, whenever the federated dataset is updated, your catalog is also updated automatically.
👉 Enhance an existing dataset on your portal with a repository, using the processors described earlier (geographic join and retrieve administrative divisions).
Currently, data producers and reusers in Germany, Canada, France and Mexico can use up-to-date geographical referentials. Other countries, including Australia, Belgium, and the United States will be added to this list in the coming months.
With a continued focus on geographical referentials, the data team is already working on creating sets for other useful themes. We plan to include geographic referenced data on demographics, housing, and even employment.
Currently, repositories can be found on the Data Network by searching with the text georef-countryname
, plus a filter on the portal name (Public) and on the theme (Geographical Referentials). Although proven successful, accessing repositories in this way might require some prior knowledge. To achieve a more direct and immediate discoverability, geographical referentials will soon be added to the Data Network’s Data Network’s Repositories page.
This repository visibility issue is part of the process of redesigning the Data Network, whose exploratory phase began in the fourth quarter of 2020. The entire Opendatasoft team is poised to build and maintain a system to encourage more practical data use and gather a community of data enthusiasts who produce, enhance, share, and use data on a daily basis.