What is a dataset?
A dataset (or data set) is a collection of related data points, stored in the same location, such as a table.
Each data point can be text, numbers, geographical information, or multimedia (such as an image or video).
For example, a simple, tabular dataset created by a retailer might include columns representing variables – the type of clothes, color, and stock levels. The rows then represent the values of each item, as shown in the example below:
When describing data within a dataset a hierarchy is followed, going from smallest to largest:
- Data point: The smallest element of data that cannot be further subdivided. “Shirt”, “Black” or “2” are all data points in the table above
- Data object: A collection of grouped, related data points that fit together. For example “Blue shirt with 4 in stock” is a data object
- Dataset: All of the data within the table.
Each data point within the dataset can be accessed individually and all of them share the same theme – in the example above all data points relate to clothes inventory.
Different datasets might be related, with these relationships described through data schemas. In our example, a second dataset might include the date and price of a sale of one of the items of clothing in dataset one. The data schema explains how the two datasets interrelate.
How can you reuse a dataset?
Datasets are intended to be shared, whether internally or externally. They therefore require supporting elements and tools to allow their reuse.
This is all the information about the dataset: license, creation/modification date, producer, data model used, etc. This information allows the reuser to be reassured about the reliability and quality of the dataset. Some business sectors require the use of specific metadata to meet interoperability needs.
In its raw form, a dataset can be difficult to analyze. That’s why most datasets that are shared by organizations are accompanied by data visualizations, or at least tools to create it. These can be simple views like maps or graphs, or more advanced formats like dashboards or data stories.
APIs are essential when retrieving large datasets in real-time, and are generally provided by the producer of the dataset. Once connected, they allow you to automate the retrieval of information that is always up-to-date.
What can datasets be used fo?
Datasets are essential to creating value from data. Consequently the number and size of datasets that an organization has collected and made available internally and externally is a measure of how advanced its data sharing strategy is.
Internal uses to improve efficiency
- by data experts: datasets can be collected with data warehouses or data lakes and then analyzed and queried using business intelligence tools
- through self-service: They can be made available through a central data catalog to everyone within the organization, enabling them to be used for better decision-making and improved operations
- for training AI: Artificial intelligence algorithms learn by understanding the relationships between data points within datasets, allowing them to make more informed decisions. Training them therefore requires access to very large volumes of data, from one or more datasets.
External uses to increase transparency
- through open data: Public open data portals typically contain a large number of datasets, grouped into specific areas or themes. For example, UK Power Networks’ open data portal contains 39 datasets. These vary in size – one contains a complete list of its electricity distribution pylons (containing over 47,000 data points), and another is a list of all local authorities in its distribution area (116 records).
- for hackathons/competitions: Sharing datasets with the wider community not only increases transparency but provides opportunities for innovation. Releasing specific datasets and allowing them to be used for hackathons or competitions provides new opportunities for innovation from inside or outside the organization.
External use to create new services
- with a specific ecosystem: Datasets can be shared externally, either with a specific partner or with a wider, but closed ecosystem. Schneider Electric’s Exchange data marketplace shares 195 energy-related datasets with 540 users from 200 companies, enabling it to increase value for its partners, and for the company to launch new data services.
Analyze your data usage with Opendatasoft’s new data lineage ...
Opendatasoft has launched its unique, innovative data lineage feature. Focused on usage, it allows organizations to better understand how their data is used internally and externally, across data ecos...
Democratizing data to make cities youthful and better for all
How can you use data to make people’s lives better? To find out, we interviewed Canadian social enterprise Youthful Cities to learn how it is empowering young people with a combination of relevant dat...
What are the benefits of adopting an ecommerce shopping experience ...
Ensuring that all employees have access to the right data to do their jobs is essential to efficiency, innovation and decision-making. Read how the data marketplace approach scales data access and sha...
How to engage people with your data
Increasing data usage and driving data democratization relies on users engaging with your data. You cannot just build a portal, hope people will come and then know how to use your data. We explain fou...