Product News: AI enables intelligent semantic search and accelerates the use of large-scale data

Learn more
Digital transformation

Interview with Emmanuel Letouzé – On Open Data, Big Data, Data Science, and Algorithms

Open Data, Big Data, Data Science, and Algorithms: what's the connection between these four very different subjects?

Brand content manager, Opendatasoft
More articles

Back in December, we took part in the Open Government Partnership 2016 Global Summit. While attending, we had the opportunity to speak with Data Pop Alliance co-founder and Director, Emmanuel Letouzé about Open Data, Big Data, Data Science, and Algorithms, four very different subjects yet which seem to have some connection. Data-Pop Alliance is a global coalition on Big Data and development created by the Harvard Humanitarian Initiative (HHI), MIT Media Lab, the Overseas Development Institute (ODI), and recently the Flowminder Foundation. The Alliance works to “promote a people-centered Big Data Revolution through collaborative research, capacity building, and community engagement.”

The author of the UN Global Pulse’s White Paper “Big Data for Development,” we spoke with him to get some deeper insights into what Open Data and Big Data mean, the role of Data Literacy today, and the connection between Open Data and Open Algorithms.

Copy to clipboard

Question: In your own words, what is the connection between Open Data, Big Data, and Data Science?

Let’s say they’re all parts of what’s been referred to as the Data Revolution: Big Data and Open Data I would say are two of the major components of the larger data ecosystem, which suggests that it’s not always about the data. This is easy to understand when people talk about Open Data, that this refers to a movement and not just the fact that the data is open. These are not just open data. We are talking about Open Data as a concept and a movement with actors, stakeholders, objectives, standards, and principles; there’s a theory of change behind it that transparency will change incentives and behaviors, and therefore is going to change outcomes.

I think the same applies to Big Data. The same way that Open Data is not just open data, Big Data is not just big data. I distinguish both, I think and write about Big Data as a social technological phenomenon, or a movement if you will, with a capitol B, capitol D, as with Open Data, and I contrast that as big data with small b, small d, when I talk about the big data. Then, I would say in this case, big data are large datasets.

When it comes to their intersection and their overlap, there are many differences, the first being that we don’t want big data to be open data. That’s where you have privacy, security risks, and other consequences. But at the same time, increasingly, the Big Data and Open Data movements, or these communities, are merging, meeting, mingling, and the theory and spirit of change behind the Open Data movement has been influencing Big Data, in the sense of advocating for greater participation of data subjects, greater transparency, but transparency of what? Again, it’s not transparency of raw data, but the transparency of the processes, so algorithms, but also of the objectives: why do you want to apply so-called “Big Data” techniques, and what are you trying to achieve? These are principles that are at the heart of the Open Data movement, and they have diffused to Big Data. It was always, or at least initially kinda a secretive, elite activity, in the likes of Minority Report.

Last, when it comes to Data Science, there’s a famous Venn Diagram from 2010 that says Data Science is the intersection of math, hacking skills, statistics, and expert knowledge. But in a sense, it is the science of analyzing data.

 

I refer to Big Data, to go back to the distinction, in terms of three C’s: the Crumbs, the Capacities and the Communities, and in a sense, Data Science pretty much falls under Capacities. Which capacities, human or technological, are needed to make sense of Big Data as the crumbs? It’s a subset of Big Data as an ecosystem.

Copy to clipboard

Question: Bearing in mind the intricacies of French and English, you differentiate how you express Big Data and Open Data, corresponding to the different ways of saying 'the Big Data' or 'the Open Data' in French to make your above distinction clearly. Is there an equivalent way of doing this in English?**

To some extent, yes. Let’s use the analogy of how we refer to the United States.

We say ‘The United States is’ because the United States is a concept. 150 years ago, however, people would say ‘The United States are’ to refer to the states themselves.

The analogy is the same. If you say “big data are,” or “open data are,” you talk about the data, which happen to be qualified as ‘big or open.’ When you say “Open Data is,” then you talk about the movement, the concept something larger than its core. This also goes for Big Data.

When people try to translate Big Data from French or Spanish, it’s always been very hard. In French they’ll say ‘les données massives’ literally, ‘the massive data.’ They’re back to talking about the data only, they’re not talking about data science as well, for instance. In Spanish, it’s the same, they’ll say ‘datos grandes,’ but again, why only focus on the data? It’s very misleading!

In what I’ve seen and read in France for example in literature or in newspapers, some people talk about Big Data as either ‘les big data,’ (the big data, data being a plural noun), ‘la big data,’ (the big data, with data being a singular feminine noun based on the feminine definite article, thus referring to the specific word ‘donnée’ or single piece of data), which gets to more of a concept, but I think the cleanest one that is closest to my idea is ‘le big data’ (here using a masculine definite object, le) referring to the concept, the phenomenon, and the movement. We could say that it’s very similar for Open Data.

Copy to clipboard

Question: Going to a new question about data and transparency, you talk a lot about literacy in the era of data. What do you mean by that?

Everyone likes data literacy. Who wouldn’t? But what is it, really?

If you think about it quickly and therefore very superficially, it’s the ability to use, understand, and analyze data. Something like that. Put otherwise, it’s an ability or a set of skills.

But then when you think about it more, you start to think ‘alright, so what are these skills?’ They’re the ability to collect, understand, and crunch.

So then, you start asking ‘what does a data literate society look like?’ If you’re using this skills-based definition, you could say that it’s a society full of Amazon Data Scientists. This would be a highly data literate society, just like a society of junior NSA analysts. But then you’d likely start thinking that there’s something off; there has to be something more than just a set of skills. What is this more? How is it different?

So to answer that question, I went back to the definition of literacy and its historical role. The key point, when we talk about data literacy, is to try to understand what is meant and expected from it. What do we mean by data literacy, and why do we think it’s a good thing? Both are correlates of one another.

In a nutshell, if you go back to the writings of Claude Levi-Strauss, the French Anthropologist, in Triste Tropiques he has a whole page on the historical role of writing and literacy. He says that historically, writing is supposed to be this sort of marvelous, empowering invention where people could become literate and free themselves from the dark ages, and from there, democracy would ensue.

Well, he says, it’s exactly the opposite that happened. Writing was invented as a means of control by the power to be able to assert and exert their power over the masses, because when you have writing, you can keep track of things; you can organize cities, armies, and entire economies. The same is true with literacy.

We think literacy was invented for the greater good of the people, and would have a simple mechanistic, positive effect on societies, and to help build more democracy, etc. He says it’s actually the opposite. He says, in the late 19th century when the young nation-states of Europe were building themselves, when the states were building the nations, literacy campaigns were organized as a means of and for entrenching power of the young nations; when you have the masses that are literate enough, then you can create factories complete with foremen and workers; people pay taxes, everyone needs to know the law, and everyone needs to follow it. It’s truly a means of control.

So now the question was, is the same thing happening for data science and with data literacy? With the caveat that when literacy became more about just being able to read and write, when it was expanded and thickened in its meaning and objective to not just be about reading and writing, but also about being a free agent able to critically assess not just text but also radio shows, being able to engage in discussions, then only did it become a source of very deep, powerful social change.

“I prefer to talk about literacy in the age of data.”

And actually when you look at the definition of literacy by UNESCO, there is no mention of reading or writing. So when people have this notion that literacy is about reading and writing, it’s wrong. You can be literate without being able to read and write, as is the case in many societies. So literacy is more of a range of capabilities that allow you to reach your goals in life, where you can create opportunities for yourself.

So getting back to data literacy, if we have a skills-based conception of data literacy, we fall back into the same traps as when we thought that literacy was only about reading and writing at a basic level.

So I prefer to talk about literacy in the age of data; which means I don’t really care if you’re data literate, what matters is that people are free agents that are able to discuss, argue, make decisions, or ask their representatives in whatever shape or form they may come to be accountable. So increasingly that means being able to understand data, being able to question a graph, use data including in your daily life (even like a Google map, you need to be somewhat data literate to use that), but this goes above and beyond a purely skills-based conceptualization of what it means to be data literate.

So then do data visualizations play a role in enabling data literacy?

That’s one way, but as we know, data visualizations can still be very misleading. There are whole theories about how to lie with statistics, so arguably that can be done with data visualizations as well. Someone who is data literate would not be fooled by a graph that doesn’t start with a zero on the Y-axis, for instance.

What does it mean and require for societies to be literate in the age of data, we have defined it in our paper entitled ‘Beyond Data Literacy,’ so getting into things like empowerment and engagement, is the willingness and ability to constructively engage in societies through and about data. So this could be in many different ways, by being a data advocate, activist asking private companies to release their data, it can be about reading the news, it can even be about asking the question about what is data! Everything is data. Your shirt, that could be sending me information, so in a sense, what is not data?  [/zilla_column]So when you expand this definition to almost everything, this becomes a watery concept essentially requiring literacy in every subject. You have to go back to a more concrete level to say what would be the building blocks of data literacy. And so we developed curriculums, trainings, toolkits for the building blocks of data literacy, as we defined it, and we think that if you master the four skills sets we’ve built, we think you’re on a pretty good path to being literate in the age of data.

Copy to clipboard

Question: So during the panel, the jury decided that algorithms were not democratic by a 3-2 vote. What did you think about that decision?

I wasn’t surprised. I was actually more surprised that it was not a 5-0 vote. The question itself is moot, it’s like asking if a table is democratic; there’s not a lot of things that are democratic or non-democratic in and of themselves. There are things that are by their nature undemocratic, but I don’t think that algorithms fit into this category because democracy is based on algorithms; it’s a set of rules! The notion that you vote, you count the votes, and whoever gets the most votes is the winner, this is an algorithm! So I don’t see how an algorithm by its nature can be undemocratic, since democratic societies have been upheld by them for so long.

So now if you’re asking if these new types of algorithms that run and feed on very different kinds of data other than votes, that feed on crunched personal data, that have these functions or features like classification, prediction, finding patterns and correlations that humans are not able to find; if we are saying these are inherently undemocratic then I would disagree; there are many ways that they can become democratic in their design. They can be written and designed in a participatory manner, we can imagine ways that people would vote in the same way that people design their laws for their algorithms, so I think it’s more a question of process.

They can serve democratic principles beyond being governed by democratic processes, they can be (and there are caveats!) more fair than many human decisions. For instance, you can think about getting a boat permit in a harbor. There are wait lists, and in a corrupt system, the person who will get it first is usually the mayor’s friend. There are lots of discriminatory behaviors. So if you put a rule in place and this happens to be an algorithm saying ‘you’ve waited this long, your boat is this long, you get the permit (or not)!’ I think there would be no problem with these types of decisions being made by algorithms.

So then the algorithms would be public good algorithms about assigning public goods for housing, unemployment, public investment, education, etc. They have to be open, transparent, subject to discussion and redress if it turns out that there is not the expected outcome. But I think they could be a powerful tool to enhance citizen oversight and participation, if citizens have a say in how they are made, which I think right now is just as much a blackbox as what algorithms are said to be.

How does the open algorithm contribute to opening more data?

All algorithms contribute to producing more data because the outcome of all algorithms is or are data, in the sense that it’s a yes or a no, left or right, 10 or 100, r2. There’s some type of output that is data. In a sense, it’s something that consumes a lot of data, turns it into an outcome or decision that then spits out more data. But that’s only one kind of data that the algorithm produces, and this data can then create new information, knowledge, new outcomes; once you produce new data, you produce new insights saying ‘oh this is what it means’, so then it becomes information that spreads and grows and can be further opened up.

I’d also say that algorithms create new types of data, if we expand our definition of data to beyond just a number. The information, knowledge, and outcomes can change behaviors and incentives in the process, likely able to be measured as data. This would be indirect data production via algorithms.

Copy to clipboard

Question: Do you have any final thoughts that you'd like to conclude with?

The fundamental question that we should be asking is this: what are we saying that data is in its various shapes and forms (and it comes in very many different shapes and forms)? Again, it can be even as simple as a pixel or a post-it on a wall; these are data. Just a piece of paper that is colored is data, and can mean something. What are we saying, even if we take a more restricted definition of data like numbers or responses to a survey, what are we saying that data and information have done for and in the world historically, especially what good, for, by and how, whom? So it really gets to the theory of change that we are promoting when we say ‘data for good,’ ‘data for development,’ or even ‘data for democracy.’ We realize in many ways that it’s not as simple as one thinks; lots of bad things are happening with full information, such as climate change, smoking, and genocides. Not everyone, but a sufficiently large number of people seem to know what is happening, but there is almost no response. I think that what frustrates me in some of these discussions is the notion that governments are necessarily well-intended and well-meaning. We talk about the benevolent policy maker who makes rational decisions based on data, that if only they had more and better data they would be able to make better decisions. This isn’t really the way laws are made. The decisions are not necessarily irrational, but people are driven by politics as much as if not more by facts, so if we want to improve the system, it’s more about changing the machinery itself. This is where citizen empowerment comes in. Data is (not are) as a phenomenon going to be much more disruptive than any citizens and even governments or even corporations think.

**

Emmanuel differentiates in French three different ways of saying Big Data, each corresponding to the three different ways of expressing the definite article (the – le, la, les) in French.

***

During the Open Government Partnership Summit, a panel debated in court-room style (with a jury made up of audience members and experts answering questions asked by ‘lawyers’) whether or not algorithms were democratic or not. Mr. Letouzé was one of the experts defending algorithms.

Articles on the same topic : Open data

More article
The importance of data portals to accelerating success in transport and mobility Mobility
The importance of data portals to accelerating success in transport and mobility

Driven by the need to decarbonize, increase efficiency and meet changing customer needs, the transport and mobility sector is undergoing a rapid transformation. Data is at the heart of this, with data portals critical to building an effective, sustainable and customer-centric transport ecosystem.

What is a Smart City? A Comprehensive Introduction Data Trends
What is a Smart City? A Comprehensive Introduction

Across the globe cities and municipalities are transforming themselves into smart cities, improving the urban environment for citizens, visitors, and businesses, while boosting efficiency and sustainability. In this blog we explain what a smart city is and how to build one successfully.

How internal data portals benefit cities and municipalities Data access
How internal data portals benefit cities and municipalities

In a changing world, cities and municipalities need to provide seamless access to reliable, high-quality data to all employees if they are to meet their objectives around efficiency, improving the lives of residents, innovation and sustainability. We explain the importance of internal data portals to delivering on these needs.

The importance of data portals to accelerating success in transport and mobility Mobility
The importance of data portals to accelerating success in transport and mobility

Driven by the need to decarbonize, increase efficiency and meet changing customer needs, the transport and mobility sector is undergoing a rapid transformation. Data is at the heart of this, with data portals critical to building an effective, sustainable and customer-centric transport ecosystem.

What is a Smart City? A Comprehensive Introduction Data Trends
What is a Smart City? A Comprehensive Introduction

Across the globe cities and municipalities are transforming themselves into smart cities, improving the urban environment for citizens, visitors, and businesses, while boosting efficiency and sustainability. In this blog we explain what a smart city is and how to build one successfully.

How internal data portals benefit cities and municipalities Data access
How internal data portals benefit cities and municipalities

In a changing world, cities and municipalities need to provide seamless access to reliable, high-quality data to all employees if they are to meet their objectives around efficiency, improving the lives of residents, innovation and sustainability. We explain the importance of internal data portals to delivering on these needs.