The only sustainable strategy is adaptation
Data is all the rage. There’s a proliferation of data analysts, data scientists, data architects, data engineers, data story-tellers. Everywhere you look businesses, organizations, policies, decisions, and processes are all touted as data-driven. One starts to get the sense of the old advertising tropes – “New & Improved! Now with XYZ…!” as though the rush to be perceived as data-driven entails an improvement. Were you not basing anything on data before and just making it all up as you went?
Well, obviously not. Data is not new, but our ability to handle it and derive insights from it has arguably been vastly improved. We now have incredibly sophisticated methods and far more computing power to wield them. There is, however, a subtle and often missed question lurking in the background. What, exactly, is data?
It may seem a specious question at first, but how we define what is and (especially) is not data plays a large part in how successful are attempts to derive insights from it. This is so important that I’m actually going to skip right to the punchline:
It isn’t data until you have a question.
Almost anything can be data, but what actually is data ultimately depends on what question you are trying to answer. Consequently, for it to be good data there also has to be a good question behind it. We all know that arriving at the right answers requires asking the right questions, but identifying what is the right data for that question can be surprisingly difficult.
To give an illustration of why data needs a question – a why to go with the what – I’ll use what may seem to be an odd example. Franz Boas is generally regarded as the father of modern anthropology in the United States. A fascinating and somewhat controversial historical figure, Boas nonetheless had an enormous influence on the study of human history and behavior.
Boas’ early career was in natural history as a museum curator, which may have contributed to his later methods for a scientific anthropology. In short, he collected everything related to a group or culture – artifacts, stories, languages, art, folklore… all of it. There are notebooks, tomes, and encyclopedic volumes just for the catalogs of all the things they collected. Boas and his contemporaries filled museums, and entire careers were made just from cataloguing it all – over decades!
The basic intuition was that a science of human behavior had to be empirical. By gathering all the empirical data possible then, the universal patterns they were looking for should become obvious once enough data was available. It certainly seemed a reasonable approach. That’s not quite what happened, though.
Instead, most of that material ended up gathering dust and anthropologists looked for a different way to go about being a science. Why? In no small part because there was too much of it to even begin looking for patterns. Since no real question was ever formulated beyond a broad “why are people different” there was no way to know what was a pattern. They had a lot of potential data, but with no question they had no way to parse it at all as data.
In an age where everyone is trying to hoover up as much data as possible in hopes that it will somehow drive value, understanding this difference becomes exceedingly important on multiple levels. It’s critical to know what questions are being asked, and whether these volumes of data are really worth the costs. Not just economically, but also ethically – should we be collecting data without a clear idea of why? …are there collective benefits to go with those costs?
For potential data to be useful data, there needs to be a question. Data scientists may sometimes refer to this as feature selection or feature engineering, but what it really means is that for a measurement or observation to be data it has to be directly related to an outcome or question.
There has to be a context to make observations into data. All to often in the rush to chase value that context is an afterthought. In reality, to be data-driven means being question-driven from the onset. It’s the question that determines whether data can become information, which ultimately is the goal.