Data Management

How to Stay Safe on the Internet, Part 2: Take Canaries Into the Data Mine

artificial intelligence processes

The preface to this security guide series, Part 1, outlines the basic elements that comprise a threat model, and offers guidance on creating your own. After evaluating the asset and adversary expressions of the threat model equation, you likely will have determined the danger level of your adversary — and by extension, the caliber of its tools.

This installment begins our exploration of the core substance of the series: how to identify the adversary’s means of assailing your asset, and the countermeasures you can deploy. This piece addresses what Part 1 classifies as a “category 1” adversary: the operator of a service that catalogs the data users supply.

While tailored to the threats associated with category 1 foes, everything covered in this article forms a foundation for resisting higher category adversaries. With that in mind, I recommend that anyone who wants to know how to think about potential sources of compromise should read this. However, the techniques covered here will not be sufficient to defend against higher-level adversaries.

Sharp FAANGs Take a Megabyte Out of Data

More than any other factor, it is our asset that determines the kind of adversary we face. Those of us who take aim at a category 1 adversary are in this position because our asset is the corpus of sensitive personal details consequent to online transactions.

This all comes down to how much data an adversary can glean from you, and how thoroughly it can analyze it. If your data passes through some software or hardware, its developer or maintainer enjoys some measure of control. The reality of the Internet’s infrastructure is that we can not vet every device or code that interacts with our data, so we should assume that any nodes that can retain our data have done just that.

The ubiquitous technology services Facebook, Amazon, Apple, Netflix and Google, often referred to as “FAANGs,” are perhaps the greatest user data hoarders, though they’re not alone.

When online services affirmatively collect our data, it is generally for either or both of two purposes: First, a service genuinely may wish to improve your experience. A service that anticipates your wants and needs is more likely to retain your usership, and the only way to do that with any accuracy is to learn from your wants and needs as you actually express them.

The second and more common rationale for data mining is aggregating and selling user profiles for advertising purposes. If a platform can’t derive revenue from figuring out what you like, it passes along its careful observations to a company that can. User data might get sold multiple times, with more data blended in along the way, but it usually ends up with an advertiser, which then uses that data to show you ads for products you’re most likely to buy.

In theory, user profiles in mined datasets are anonymized. Ad companies don’t care who you are, just what ads you want to see. Still, if the data contains enough “classifiers” (columns in a table in which profiles are rows), every profile will express a unique combination, making users identifiable. It is easy, then, to understand peoples’ reluctance to let this data accumulate.

It’s not just data that category 1 defenders should worry about, though, but also metadata, which often is more revealing. A notoriously tricky concept to grasp, metadata can be thought of as information that is generated inherently as a consequence of the creation or existence of data.

Consider sending one email. The semantic content of the email would be the data, while the metadata would consist of the timestamp of when it was sent, the sender and recipient email addresses, their respective IP addresses, the email’s size, and countless other details.

One transaction that precipitates metadata is revealing enough, but metadata exposes significantly more as it is observed over time. To continue with our example, this would mean crunching all of one user’s sent and received emails, which an email provider easily can do. By correlating the timestamps with the user’s IP address, which provides a geolocation and is reassigned as the user changes networks, the email provider can infer the user’s spatial movement patterns and waking hours. Thus, metadata magnifies the value of data — your asset — exponentially.

Before we can begin equipping ourselves to fend off category 1 adversaries, let’s get a better sense of who they are and what they’re capable of doing. Actors that fit into category 1 can range from Internet service providers (ISPs) to the online services you use, and even to others that piggyback on the ones you use. The common thread is their privileged position with respect to your communications: They serve, carry or mediate them in one way or another.

The full implications of this position are clearer after assessing a few points that a category 1 adversary might occupy. The entities below are given in order of how fundamental they are to facilitating your online communication. Furthermore, they are cumulative: When addressing any one of these, we also must handle everything listed before it.

One party that always plays a role in your communications and everything else on a computer is the developer of the device’s operating system. This is because the OS is responsible for interfacing with all of your device’s networking hardware, and passing data back and forth between programs and the network.

All of this data interchange is driven by low-level OS processes that most people don’t look at and don’t know how to interpret. Practically speaking, addressing the OS’ access to your data is difficult. It’s also overkill for this threat model, but I cite it here for the sake of completeness, and to introduce the concept for later discussion.

The other entity that always occupies a link in the chain between your device and the network is your ISP. This is the company that assigns you an IP address on the public Internet, and permits access to the Internet backbone over its infrastructure.

Because everything you send includes the geolocatable IP address for the sender and recipient, and the ISP is responsible for delivering it between the two, the ISP knows where on the Internet (and in the real world) you are at all times.

Since all Internet packets are logged by timestamp, the ISP can align them with IP address records to divine user browsing patterns. ISPs are not only in a position to snoop on your traffic, but also have every incentive to do just that. ISPs recently were deregulated, which allows them to sell your browsing habits. All of that makes them one of the biggest category 1 players.

Since so much of our digital communication is transmitted over the Web, Web browsers figure into most threat models. Nearly every service you can think of likely is accessed through a browser, and odds are it is your single most-used piece of software.

As you would expect for proper Web navigation, a browser knows your IP address and that of each destination website. So your browser knows as much about your online habits as your ISP does, but restricted to the web (that is, only HTTP).

Browsers also tend to gather diagnostic data — records cataloging potential page load failure conditions — and send it to the browser’s developer. In itself this is useful, but there is a risk that this data traverses the sphere of influence of a particular entity that relies on data mining for its profit: Google.

Except for Mozilla Firefox, all mainstream browsers are based on Chromium, the project at the heart of Google’s Chrome browser, and over which Google exerts some influence.

Browsers are responsible not only for establishing connections to websites but also, crucially, for managing cookies and other background processes. A browser cookie is a piece of code that a site you visit deposits into your browser to perform some task, like letting you stay logged into a site you’ve logged into already. However, by default cookies persist regardless of where you browse later, until your browser deletes it. In most cases, this is never.

Cookies simultaneously produce the Web experience that we’ve grown accustomed to and the data mining that underpins it. For example, tracking cookies snitch about your browsing habits — such as which tabs you have open together at what times — to the entities that installed them on your browser.

Thus, browsers end up serving as the gatekeepers to your data, and your choice of browser and configuration decides how well-locked the gate is.

Email providers also are in a uniquely lucrative position to scour your data, since email is the de facto gateway to all Web services. You’ve likely seen enough account verification emails to corroborate this.

What’s more, your email provider retains all your email content, both incoming and sent, which intrinsically cuts a wide swath through your life. A scan through messages from retailers, colleagues, and friends can paint a shockingly vivid portrait of you. In other words, email providers reap the benefit of how prolific email is as a communication channel.

Social media presents another novel lens into sensitive data about you. Although social media is not as central to digital communication as email is, its intended use case allows platforms to derive a lot of information through correlation.

Beyond that, it offers operators data that you may not express over any other medium, especially if the platform promotes rich media formats like photos or includes affinity-expression features such as likes. Status updates encourage regular activity, photos are geotagged and increasingly rich fodder for image recognition AI, and a constellation of likes assemble demographic profiles.

Of course, social media platforms’ ability to organize users by interconnected webs of friends and followers, or through direct messages, reveals a “social graph” — an org chart of who fraternizes with whom.

Choose Your Weapons!

Now that you know what our adversaries are armed with, how can you defend against them? One thing that may seem rudimentary merits mention for how overlooked it is: The surest defense of your information is to not store it digitally in the first place.

Granted, this isn’t an option for some records. Still, certain personal details can be withheld from digital devices and platforms. For instance, don’t indicate where you live or what your birthday is. You can leave things like social media profiles coyly void of deeply personal interests or interpersonal associations.

Assuming that you have data that you can’t keep off the network, end-to-end encryption is the single most effective instrument you have. Cryptography (the study of encryption) is far too complicated a discipline to dissect here, but in a nutshell, encryption is the use of mathematical codes that can’t be deciphered except by the intended sender and receiver.

The trick with end-to-end encryption is ensuring that your definition of “intended receiver” matches your service’s definition. Although a service may encrypt your message from you to its servers, decrypt it, and newly encrypt your message from its server to your interlocutor (who decrypts it), that is not end-to-end. Encryption is end-to-end only if your message is encrypted so that only your correspondent can decrypt it, denying interceding servers a peek.

With this in mind, use end-to-end encryption whenever it is possible but still pragmatic. When there is an encrypted alternative that is no more (or only slightly more) difficult to use than your current option, make the switch.

There are a few places where this likely will be viable for you. To start, you should avoid using open wireless networks (i.e., not protected with a password). If implemented with discretion, a virtual private network (VPN) affords you a robust general-purpose safeguard. Even easier to configure is the HTTPS Everywhere extension, a free add-on to your browser. It’s not easy to enable encryption directly with email, but you can choose a provider that promises end-to-end encryption to your message’s recipient.

As for other countermeasures, it’s best to tackle them by specific adversary.

A VPN is the ideal tool for thwarting nosy ISPs. To understand why, envision the same browsing scenario with and without one. When you connect to a website without a VPN, your ISP sees a connection going directly from your IP address to the site. This holds for every site you visit.

If you browse through a VPN connection, your ISP will see only a connection between your IP address and the VPN server address, regardless of how many sites you go to. With a VPN, your computer establishes an encrypted channel from you to the VPN server, passes all your Internet traffic through it, and has the VPN server forward it to wherever it is headed. In other words, the VPN browses on your behalf, passing back its connections through a tunnel that observers (including your ISP) can’t penetrate.

There’s a catch, though: With a VPN, you’re obscuring your data from one entity by passing it through another. So, if you can’t trust the latter, you haven’t rendered your data any more secure. Be sure to read reviews and privacy policies for VPN services carefully.

Since the lion’s share of online communication is Web-based, retooling your browser to lock down your data is critical. Your first choice should be an open source browser, meaning one with code that is publicly available so it can be audited independently. Only Firefox fits the bill — Chromium, the basis for Chrome, is open source, but Chrome is not. Fortunately, Firefox is an excellent browser that long has blazed trails on the Web.

You will need to change some settings. First, rig your browser to delete all cookies and caches every time it closes. This will force you to log into your accounts every time, but that’s a better security posture anyway, since the cookies that keep you logged in across browser sessions can be stolen.

Next, you should trawl the settings for tracking options and turn off all of them.

Finally, tack on a few security-enhancing extensions. Believe it or not, ad blockers serve a security function, as most ads slurp up sensitive data about you and beam them to their mothership unencrypted.

You also should install the Electronic Frontier Foundation’s Privacy Badger and HTTPS Everywhere extensions. The former kills tracking scripts that try to insert themselves on every page you load, while the latter encrypts some otherwise unencrypted Web connections.

Fending off eavesdropping email providers is tricky, but it’s possible for the dedicated among you. Along with being dedicated, you also must be willing to pay a subscription fee, so your provider can draw the revenue to maintain its service.

This isn’t a sure thing, as some paid services double-dip to sell your data. However, if your email service is free, it’s almost certain to be monetizing the demographic profile built from your correspondence. Know that your emails are always at the mercy of the service provider, so send them judiciously.

Lastly, reconsider the apps you use. If there are apps or services you can function without, dump them. Every piece of software you use, whether installed on your device or accessed via the cloud, is another entity that has data from you.

If it’s not feasible to discard a tool, replace it with one that retains less data. Favor open source alternatives whenever possible, as they’re open to more scrutiny. You also should favor software that requires fewer numerous and invasive permissions. If you see no good reason why an app that does X needs permission Y, skip it.

Now for Your Assignments

Those bound to fight off category 2 and 3 threats certainly should not put too much stock in these techniques. In fact, that cautionary advice applies to category 1 defenders as well, to an extent.

This article doesn’t give you everything you need to take the test, but it should supply enough instruction for you to do your homework, find future lessons, and identify your own preferred mindset for consuming the material.

The real final exam will be proctored by your adversary, but I am available for office hours.

Jonathan Terrasi

Jonathan Terrasi has been an ECT News Network columnist since 2017. His main interests are computer security (particularly with the Linux desktop), encryption, and analysis of politics and current affairs. He is a full-time freelance writer and musician. His background includes providing technical commentaries and analyses in articles published by the Chicago Committee to Defend the Bill of Rights.

2 Comments

  • I was responsible for computer security in a law enforcement agency for decades and taught information security. I believe that this series of articles is the most valuable that I have ever read on TechNewsWorld.

    • Thank you. I AM truly flattered that you think so highly of this series. Equipping people with the kind of knowledge I impart in these pieces is why I started writing about tech in the first place, so it’s always good to see that I’m headed in the right direction.

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

More by Jonathan Terrasi
More in Data Management

Technewsworld Channels