Last week, we introduced several new features on the front-end and API side of our platform; but this was only half the fun: what is the use of a million-points-ready clustering if it cannot be run on large datasets that have been previously formatted to be useful to everyone, and that are kept up-to-date over time without having to do the same work over and again?
Today, we introduce you to our brand new data processing layer which provides a rich set of features to prepare and enrich data before publishing them.
It is composed of a grouped-together “processors” set that is easy to use out of the box, but is also great for advanced and complex expressions (e.g. Excel-like formulas, regular expressions…). Each processor can work with the result of another one, so they can be combined to build a powerful transformation pipeline. And, as usual with our publishing interface, a real-time preview of the end-result is displayed, making it safe to experiment without any unwanted consequence on already-published data.
These features’ goal is to cover enough requirements in order to remove any need to use an old-school ETL before publishing data with our platform.
Here are a few examples of data preparation and enrichment that can be done using the processors set:
- directly insert geographical coordinates, based on simple text addresses (a.k.a. geocoding), so that the dataset can be displayed on a map
- apply text transformations that make values consistent (normalize, capitalize, trim spaces…)
- calculate numeric values using mathematical expressions
- split, join, replace text values; use regular expressions to extract parts from an expression
- format and normalize date values
- apply transformations on geographical data to make it consistent
- skip lines based on certain criteria (e.g. in order not to publish cities with less than 100,000 inhabitants)
- create new lines based on a single one with transposition formulas
- cross join lines with another dataset (like a database table join), for example to enrich lines with data from a reference dataset (this also includes crossing data with public datasets that are already hosted on the platform).
And the good news is all of these steps are saved, meaning that there is no need to do it again each time data change.
In the short following video, you will see three different examples of how OpenDataSoft processors can be used to enrich data before publishing them:
We are gradually rolling out these new features to our customers, so if you want us to guide you through it with a hands-on session, feel free to ask, we will be happy to 🙂
And since we will be continually developing the processors list, don’t hesitate to tell us if you feel one is missing.