Physicists tell us that only a small percentage of the universe can be seen directly. In fact, the vast majority of the universe is what they call “dark” — composed of matter and energy that we know is there but that we can’t observe, identify or analyze directly.
The concept of “dark matter,” first described in 1933 by physicist Fritz Zwicky, has since evolved into one of the cornerstones of modern physics. In fact, scientists have posited that up to 96 percent of what’s “out there” in the universe is either dark matter or dark energy. More interesting than that: In the 75 years since its discovery, we have yet to figure out just what exactly it is.
Sounds pretty fantastical, right?
Fantastical sounding though it is, it turns out there’s really a lot of good evidence for why scientists believe this to be the case. They can, for example, measure gravitational interaction between two observable bodies and — fairly precisely — account for the amount of mass that must be there, even though they can’t see it directly. In other words, the indirect evidence allows them to conclude the existence of — and some of the properties of — the dark matter and energy.
It makes perfect sense. It’s kind of like a stopped drain: You can’t see what’s causing the blockage, but you know for a fact that something is (because otherwise your sink wouldn’t be overflowing) and you can tell something about it (like how thick the clog is) based on how slow or fast water moves through it.
‘Dark’ Data
Interesting though this concept is on its own merits, it’s not normally one that we encounter in IT. But in this case, there’s actually a very practical reason that I’m bringing it up. Namely, in the same way that the vast majority of the universe is “dark,” the vast majority of data in our enterprises is dark in the same way.
What I mean by this is that we all know that our networks and other infrastructure process a tremendous amount of data. Some of this data we know pretty well — compliance activities might have charted some of it out, some of it might be associated with a particular highly visible application set that we’re intimately familiar with, and some of it might be so business-critical that we always have one eye on it to make sure everything runs smoothly. But how much of the total data are we aware of? Definitely not 100 percent. Probably not 50 percent. Twenty percent? Ten?
In fact, when it comes down to brass tacks, most of us are in the unfortunate situation that the vast majority of what transmits over our networks is, for lack of a better word — “dark.” We know it’s there — it has to be in order for our businesses to run smoothly. We see it move from place to place when we chart out things like bandwidth utilization or overall traffic patterns. But we don’t know, with any degree of certainty, what it is, where it’s going, where it came from, or why. It’s dark data.
For the security organization, this can make for a particularly stressful state of affairs. At best, this dark data is related to legitimate business activity (meaning, of course, that we need to protect and safeguard it). At worst, it can be any number of things that we don’t want: malware, unauthorized/illegal activity, inappropriate user traffic (e.g. pornography, gambling), etc. Being chartered with safeguarding something that we have no knowledge of is never good — particularly when there’s so much of it. And preventing something that we have no knowledge of is even worse. Unfortunately, however, that’s where we are.
Step 1: Quantify the Problem
Given this set of circumstances, the challenges should be obvious: This unknown data poses a risk to the firm, we are chartered with reducing risks, therefore we must minimize the amount of unknown data. In other words, we need to maximize what we do know minimize what we don’t know.
Easy to say, hard to do.
However, it’s important to realize that it’s not a completely unsolvable problem — at least if we approach it from a practical point of view. The temptation is to become overwhelmed by the problem and either a) ignore it or b) spawn an ineffectual, expensive, and overly complicated initiative that’s destined to fail from the get-go. Neither of those strategies work — at least not usually.
Instead, a much more practical approach is to borrow a page from the physicists’ playbook. The physicists who study dark matter can very precisely tell you what percentage of the universe they don’t understand. They can say with a high degree of certainty that “96 percent” is dark. But most of the time in IT, we aren’t there yet. To say it another way, we don’t even know what we don’t know.
So, the first step would be to understand the scope of the problem. Start with an inventory of what you do know about your data. If you’re in a large organization, there are a bunch of people working overtime trying to solve exactly this problem (albeit for different reasons) — they just probably won’t share their results with you unless you ask. For example, individuals in the compliance arena are working very hard to catalog exactly where your organization’s regulated data is. SOX, HIPAA, or PCI compliance initiatives, for example, are oftentimes a good repository of meta-data about what and where information is stored for their particular area of interest. Folks who are working on business impact analysis (part of disaster recovery) probably have a pretty good idea of what data is used and how by critical business applications.
If you consolidate these sources of information into a “known universe” of your enterprise’s data, you can get a much broader view than you would otherwise have. Take the data that already exists — metadata (data about the data) and consolidate it. The goal here is not to get to 100 percent — instead, it’s to create a foundation that you can build on in the future. Plenty of opportunities will arise to gain further information about the data in your firm, but unless you have a place to record it, it’s not going to be useful.
So what you’re doing is creating a place for that metadata to go so that when you’re given the opportunity to learn about where more of your data lives, you can record it — and ultimately put it into its proper perspective as you begin to see how data sources and pathways relate to each other.
Step 2: Baby Steps
Once you’ve cataloged the data that your organization does know about, you can start to look for other areas from which to glean information. At this point, if you’ve followed the first step above, you’ll have a place to record additional information as the opportunity arises. Building a new application? Map out what data it will process, where it gets its data from, and where it will send it to. Keep a record of that in your living, breathing “data inventory.” Got an audit going on? Keep an eye on what the auditors sample, and ask them to work with you on moving your data inventory forward. Just about any task that tells you something new about the data in your firm is fair game.
This doesn’t have to be a fancy, heavily-funded initiative — and it doesn’t have to be farmed out. Be very skeptical of “data charting” initiatives, vendors that claim to be able to categorize all your data for you, or consultants who want to sell you data classification/mapping services. In fact, I’ve seen more organizations be successful at this by starting small, keeping an eye open for opportunities to add to their knowledge, and recording it when they do than I’ve seen organizations be successful trying to formally audit all of their data to catalog it.
The first approach works because it’s a “living” activity — when it’s transparent and well-explained, folks understand what you’re trying to accomplish and they actively look to forward the goal when they have an opportunity. It’s “grass-roots.” The latter approach fails because it’s like trying to boil the ocean — it’s just too big and too complex.
Hopefully, if you’ve done these two things, you can increase your awareness of where your data is, what it’s for, and who’s using it. Even if you only go from understanding 10 percent of the data to 30 percent, you’re farther along than you were when you started.
Ed Moyle is currently a manager withCTG’s information security solutions practice, providing strategy, consulting and solutions to clients worldwide, as well as a founding partner ofSecurity Curve. His extensive background in computer security includes experience in forensics, application penetration testing, information security audit and secure solutions development.