[MUSIC PLAYING] DAVID J. MALAN: All right. This is CS50's Introduction
to Cybersecurity. My name is David Malan. And this week, let's focus
on preserving privacy. Indeed, over the past
several weeks, we've focused on securing your accounts,
your data, your systems, your software. And all of that is really about keeping
communications between points A and B, for instance, secure, so
that no one in between can actually access the
information you're trying to share. But what if you, A, don't even want
B to have some of that information? So indeed, today, let's focus
on some of the technologies that you and I use every day
and some of the technologies that underlie the
software, and applications, and more that you and I are
going to use tomorrow and beyond and consider exactly what
information we're sharing now, perhaps, even without
our knowledge and also empower you with certain mechanisms
via which you can perhaps restrict all the more
of this information if you, indeed, do not want
to share it beyond yourself. So let's consider first
some of the obvious features that you and I probably use every
day, like your web browsing history. Whether you're on a laptop,
or desktop, or mobile device, odds are you know by now
that your browser tends to keep track of pretty much everywhere
you go on the World Wide Web. That is to say, if you
click on your URL bar, you can sometimes browse through
the past few URLs that you visited. If you go up to your browser's
history via some menu, you can probably see
everything you've done earlier today, yesterday,
last week, last year, or perhaps, even the
entirety of your history, particularly, if you're logging
into your Google account, Microsoft account, or something else. So the web browsing history is sort
of both a concern when it comes to your privacy, but also a feature. Well, let's first consider the feature. Well, why is that useful? Well, one, I mean, even
I occasionally go back through my history
trying to find some web page that I know I was
looking at earlier in the day, or yesterday, or some
previous time in the past because it just helps me find
information more quickly. And so in that sense, it
might solve a problem for me. Moreover, you probably have noticed
that your web browsing history is often used for features like autocomplete. So when you start typing
a URL or maybe even a keyword that was in
the name of a page, your browser might remember much more
quickly what it is you're looking for. So you can just hit Enter or click. And voila, you're at that same page. But of course, this is a concern,
potentially, for your privacy, whereby, you might not want someone else
who has physical access to your device to start poking through
where it is you've gone. You might not want someone else to
have access if you just so happen to visit that website or those
websites on maybe a computer in a lab environment, or an
internet cafe, or the like. So you can imagine quite
a few scenarios in which this is, yes, a feature, but quite a
few other scenarios in which this is not really a desirable
feature because it invades your privacy in some sense,
or at least, puts it at risk for being invaded by someone else. So we'll consider how we might
at least sanitize this history or remove it altogether in ways
that you might already know about. For instance, you're probably
already familiar with some option in your browser, whereby, you
can clear your browser history. And that forgets,
therefore, all of the places that you've been, all of the cookies
that you might have accumulated, all of the usernames
and passwords that might have been remembered by your browser. Although, that tends to be a
fairly heavy-handed solution because when you clear your
browser history, assuming you check all of those
boxes, all of it is gone. And that might mean negatively that
you're now logged out of Google, you're now logged out of
Outlook, or some other account that you actually
still want to use, even if you just wanted to clear your
history from something else altogether. So we'll consider then
what else might be a concern when it comes to your
privacy beyond your own browser. And in fact, it doesn't even matter
if you sanitize your own web browsing history and delete the entirety
of it because it turns out that typically, any website
you visit is, itself, on the server side also keeping track
of a lot of that same information. That is to say that servers
typically have logs. And these are not only
for diagnostic purposes. In case anything goes
wrong, the IT staff can use those logs to
reconstruct history and figure it out, figure
out who was doing what and when and how that
might explain some problem. It might be used for
auditing purposes if they want to keep track of exactly
what was accessed on a system. It might be used for
advertising purposes or analytical purposes
more generally to mine or analyze that data to figure out how
we might monetize it or do something else with that same information. But what do we mean
concretely when we say that information is logged on a server? Well, it's very similar to
your own web browsing history, but it's even more detailed. So here, for instance, is a
representative piece of configuration that captures what is a very
common convention for information that servers log, web
servers, specifically, when you visit them with a browser. And I'll highlight
just a subset thereof. This log format, the
so-called combined format, indicates to me that it's
very common for a server when you visit some web page on it that
the server will log, that is, remember your remote address, otherwise
known as your IP address. It will remember the day and time
at which you accessed that page. It will remember exactly
what you requested, so the name of the file
or folder on the server specifically that you sought
to download or look at. It'll remember the referrer, that
is, the URL from which you came. And it will even remember
the user agent that you use, that is to say your browser. So perhaps, unbeknownst
to you, every time you use your browser to
visit some website, inside of that virtual envelope is quite
a bit more than just the request that you're making of the browser-- of the server, rather. It includes, yes, your IP
address and on the outside of the envelope, as we've
described it in the past. It includes some number of HTTP
headers, as we've discussed in the past. But in particular, it
includes information that you might not want being
stored on servers in perpetuity and you have no control
over deleting necessarily. Unless there is some regulatory
requirement or law that requires that the server delete
it for you on some schedule, you have much, much less
control over this information. So let's consider in a
bit more technical detail what some of this
information is and how you might at least exert some control
over just how much of that information is being shared. So let's revisit first this
building block of HTTP headers that we keep coming back to if only
because in the world of systems and software nowadays, things
on the web are just so common. Using HTML, CSS, JavaScript, using
web browsers and web servers, that's driving a lot
of today's interactions with technology, whether
it's in native applications or whether it's with mobile websites,
or desktop websites, or the like. So HTTP headers, recall, are
just like key value pairs that are inside of those
virtual envelopes that indicate some kind of setting or
some kind of piece of information that the browser is
sending to the server or that the server is
sending to the browser. So for instance, if I go
on google.com and I search, as I often do, for cats, well, what
might be going on underneath the hood? Well, in that web page that Google
gives me with 10, or 20, or 30, or six billion, 240 million cats, there
might be HTML that looks like this. And recall that this
HTML, which I'm proposing exists somewhere in Google search
results, is an anchor tag for a link. There's the n tag over there. The hyper-reference for
this link, or href attribute has a value of
https://example.com, for instance. And the word that the human will
see is cats, literally in this case. Now, I'm assuming for the
sake of discussion that today, example.com is a website full of cats. And that's why it might be appearing
among Google search results when I search for cats as my keyword. But when a user like you or me
clicks on that link on google.com, because that is literally where
you're looking at the search results in this story, it turns out that
your browser not only goes and requests that web page, your browser
includes an HTTP header like this in that virtual envelope. That's specifically called referrer--
that's the key in this case-- the value of which is the
URL from which you came. So for instance, if I have just gone to
google.com, and I've searched for cats, and I've hit Enter,
recall, as in past classes, I proposed that the shortest version
of the URL that you might see in your browser upon
searching for cats is this, https://www.google.com/search?q=cats. Now, that is what you'd
see in your URL bar. Below that, you'd see the 10, or the
20, or the 30, or the six billion, 240 million cats, each of which
has a link that when clicked, leads you to a search result. But the implication of this
HTTP header is that by default, perhaps, unbeknownst to
you, indeed, your browser is telling the whole
world from which web page you came when you visited some
other web page via a link. Now, why in the world
is this compelling? Well, it's actually useful for the
website at which you end because it might be useful for their analytics. They might want to know, well,
how are people finding my website? How are people finding my
business on the internet? Oh. It looks like I'm
getting a lot of users, a lot of customers,
perhaps, from google.com, specifically, when someone searches
for cats, not dogs, not something else, but cats. So you can imagine, especially
in the world of commerce, that just being useful information
to know how people are finding you, or conversely, how people are
not apparently finding you. But this is very invasive
because now this website, even though it's arguably
none of their business, they know I use Google instead of Bing
or some other search engine, perhaps. And you can imagine that there could
be links on CS50's own website, on any number of other
websites in the world. And just because you happened to
visit them and you clicked a link, now they're broadcasting your
business to whatever website you're ending up on by revealing
where you came from, from where you were referred, so to speak. But this is long been a feature of HTTP. And this has long been a feature that's
enabled by default, unless the website, or perhaps, you, as the user, turn this
off or somehow moderate its response. Now, some of you might be noticing that
there's a bit of a typo on the screen. And I promise, this isn't actually mine. In English, at least, this is not
typically how you spell the word, referrer. And this is actually a fun fact. In referrer, there should
be four R's in total. It should be R-E-F-E-R-R-E-R.
However, fun fact, years ago, when the specification for this
standard was being written, the poor individual who wrote the
specification made a typo that has been immortalized in history
for years to come. And so this is what browsers and
servers have been using and expecting for years. There are other variants of this header
that these typographical error has been fixed in. But it's sort of a fun fact
from our internet history. But this is, indeed, what you might see
going from your browser to your server. So ideally, we'd send less
information, at least. I'd be a little more
comfortable if example.com, which is this website for
cats, told them, OK, fine. I came from Google. That's not a big deal. But I'd rather you not
know what I was looking for if only because that seems unnecessary. It seems invasive. And who knows what kinds
of cats I was looking for? Maybe I don't want you to know exactly
what my preferences are in cats, or dogs, or whatever types
of breeds there might be in this case of searching for animals. So it just feels like it's
unnecessary information to share. But better still, I
dare say, would not be to even tell example.com
where I'm coming from and essentially just get
rid of this altogether. So how might a website
go about moderating just how much output comes from the
browsers at the server's request? Or maybe, how might you
with special software suppress some of this information to
preserve all the more of your privacy and what it is you're doing online? Well, for instance, this is
a common tag that web pages can put in their own HTML code
that indicates to the browser that, yes, you may send the referring
address, but only send the origin, that is, https://www.google.com/. And that's it, no
search, path, no ?q=cats. Tell them the website you came from,
but not the specific page or not the specific search
query or search results. Notice here in the world of HTML, the
typographical error has been fixed. There's two R's in the middle there. But otherwise, this is an
HTML solution to the problem. The browser, assuming
it respects this HTML, will therefore, send
and refer HTTP header, but with less information, not the whole
URL, but just the origin, so really, the domain name, itself,
and a bit more the protocol. If you don't want any of that to be
sent for your users, for your customers you could do this instead. Now, Google does not do this. Google currently actually sends
origin, so part of the URL. But if you want to be
an even better citizen and not make it easy for browsers to
send more information than they need to, you can include
this HTML in your page instead, informing the
browser that you can send-- don't send a referrer at all because
the value of this meta tag, so to speak, is actually none instead of origin. And there are other values
as well that allow you a bit of range of opportunities
when it comes to these settings. But these are, perhaps, the
most common or ones to consider. There's an alternative too. If you happen to be a
little more technical and you have control over the web server
and not just the HTML on the server, you can actually configure a
referrer policy, HTTP header, that goes from the
browserver to the browser. So in this case, the
referrer policy can indicate that you only want the origin to
be sent, for instance, the shorter form of the URL. Or you can actually indicate
that no referrer should actually be sent in this particular
case, so a second mechanism for actually controlling the same. Let me pause here and see if there's
not only some concerns, perhaps, now that you understand
better, hopefully, how the web works, at least, by
default or how we might mitigate this concern with your privacy. AUDIENCE: Is there a
way that is easy enough for us to delete those
traces as a client in case that we don't want to be
tracked or something like that? DAVID J. MALAN: A really good question. We'll refer you to some URLs outside
of the context of class, itself. But yes, there is actually
client-side software that you can install
on your own Mac or PC, typically, that will scrub
some of this information, so that when your HTTP requests you
go from your browser to servers, you can ensure that this
third-party software removes a lot of that information automatically
for you because in that way, you don't have to trust that the
website, like the Googles of the world will actually reduce the
amount of information for you. You can instead do that for yourself
through client-side software. And we'll provide a few links online. Other questions on the same? AUDIENCE: By using a
private browser such as Tor, for example, or using a temporary
operating system like Tails, does this remove all of
our traces on the internet? Or does it leave some on the
client side or the server side? DAVID J. MALAN: A good question. Short answer is that it does leave
some evidence on both the server side and the client side. But we'll come back to Tor
in just a little bit as well. All right. How about one final question? AUDIENCE: You said previously about
the third-party software that's supposed to be used in order to scrub
the information from being submitted to the server side. What if that program,
itself, is used to eavesdrop on what we do on the computer? DAVID J. MALAN: That is
a very valid concern. It is absolutely possible. In general, what is
working in your favor is either open-source software,
where if you're using software that other people can see the
source code of, presumably, it's less likely that it's
doing something malicious. Capitalism often helps
you here too, whereby, it is often not in a
company's own interest to be violating the privacy of
their users because presumably, that would create some form of backlash,
which would not be good for business. But beyond that, there is
a lot of trust on your part and my part whenever it
comes to installing software. So that is, indeed, very much a risk. Now, it turns out
there's other information that your browser might be
sharing without your realizing that it's making it available. And that information is enough
via which servers can even fingerprint you, so to speak. That is to say there's this
technique generally called fingerprinting that in
the context of the web means to take as input a
whole bunch of characteristics of the request from the
internet that's coming in and see if you can use those
characteristics to create a profile of sorts for the
user via which you can uniquely identify that user. Now, that doesn't mean you'll know
specifically that user is David Malan. But you will know,
according to this system, if it's the same user
today, as you see tomorrow, as you see the next day because
you can use this information to infer with high probability that,
OK, we saw that exact same browser configuration again,
and again, and again. Odds are it's the same
person and not some twin on the internet who just happens
to have precisely those settings. Now, how might this be implemented
or achieved technologically? Well, the simplest
mechanism, perhaps, is just to rely on something
like your IP address. Recall that any time you're
doing something on the internet, those virtual envelopes
we keep talking about have your IP address on
the outside, so to speak, as well as the IP address of
the destination to which you're trying to send information. Your IP, in that case,
is the return address, which means you're literally
telling the remote server when using certain protocols
where you are in the world, or at least, what your IP address is. Now, that IP address might not
alone uniquely identify you because it turns out on campuses,
in homes, in corporate networks, you might actually share one IP
address with many other people, but at least narrows the
scope of whose IP it might be, even if it's shared among a few people. But your browser inside
of that virtual envelope is sharing other information as well. Another HTTP header that is
typically sent by browsers to servers is called user agent. And this is just a unique string of
text that uniquely identifies typically the browser that you're using and
the version thereof and the operating system that you're using
and the version thereof. So for instance, a standard format
might look a little something like this. And it's deliberately overwhelming. And it's just meant to
capture how much detail might be leaked in this header's value. But within this big string
of text that doesn't even fit onto one line-- it's wrapping
here under three lines-- is some indication of what browser you're
using, be it, Chrome or something else and what operating system
you're using, be it, Android or something else on a
phone, a laptop, or a desktop. Now, of course, a lot
of people in the world presumably have the
same browser installed. So that, too, even with
IP address, might not be enough information to
uniquely identify you, at least, with high probability. So what else can servers do? Well, if the server has the ability
to send some code to your computer, for instance, some HTML, some
CSS, and some JavaScript, servers can effectively interrogate the
browser and ask it certain questions. For instance, a server could figure out
what the resolution is of your screen. Now, this might be practically
useful, so they know how to render information on the screen. But that alone might be enough. Especially if you're in the habit
of full screening your browser and you always use the same
resolution on your monitor, that might be another ingredient with
which to identify or fingerprint you. The server might be able
to figure out what fonts you have installed on your system. The server might be able
to figure out what time zone you are in because that's also
a value available within the context of a browser. And there's yet other values still
that collectively with high probability can be used to fingerprint you and me. So even if you're not even
logged in, even if you're using various privacy enhancing
software products to try to remove some of these
HTTP headers and the like, you're still leaking other information,
including the extensions or plug-ins, sometimes, that your browser
might have installed. So if you're in the habit of using
the same computer again and again and you're in the habit of not
changing a lot of these settings, that alone might be enough for a
website to effectively track you. Now, it might be innocuous. They might just use this
for statistical purposes to get a sense of how many
users or customers they have. But it could be for
more invasive purposes, like serving you targeted
advertising, based on your behavior of these websites,
or really, just tracking you, specifically. And the catch is that if you
ever log in to this server just once, if the server has been
logging all of your traffic based on that fingerprint for days, for
months, for years, at that point, retroactively, with high probability,
they can infer, oh, wait a minute. If the user on this day
was David and we think it was the same user on all of these
previous days, now by transitivity, they know a lot more about
your browser history as well. So even unbeknownst to you, and
even without explicit header values being sent that identify
you, the collection of attributes or characteristics
that our browsers have and our browsing behavior has can
still be enough to uniquely identify most of us quite a bit of the time. Let me pause here and see if there's
any questions on fingerprinting or these implications for privacy. AUDIENCE: Will using a VPN
prevent browser fingerprinting? DAVID J. MALAN: A good question. And we'll talk about
VPNs a bit more soon. Short answer, no. So VPNs will typically mask your
IP address, but that's about it. If you still use your browser as
usual with your user account as usual, all of that same information
is going to be leaked. It's just going to
change one piece of it. A good question. Other questions on
fingerprinting and privacy? AUDIENCE: Is it possible that a
hacker can steal a fingerprint and use it for their own
purposes and everything will look like it was my computer
that performed certain actions? so it's like stealing an identity. DAVID J. MALAN: A short
answer, yes, if the hacker has access to the same information. If though, if we rewind
to our focus on encryption a couple of classes ago, if
you are accessing websites only via HTTPS and nothing is
unencrypted, then it's going to be a lot harder for a
hacker in between you and that server to glean any of the same information
because almost all of it is encrypted. IP address is not. But anything inside of the envelope
is, including these headers, the HTML, the JavaScript, and the CSS. If, though, the hacker
has somehow infiltrated your own laptop, or desktop, or phone,
or the server, then all bets are off. And they could absolutely
identify you, according to these same pieces of information. Other questions? AUDIENCE: I was just curious to
understand the difference, perhaps, when you are on mobile. My understanding is that they can
even get much more information when you are on mobile. DAVID J. MALAN: That's a fair question. I don't think I would
answer yes to that. I'm hard pressed to imagine what more
your phone is doing than the browser is doing, except that there are-- I suppose I could argue
that your phone tends to have additional features nowadays,
like GPS, like accelerometers, gyroscope, perhaps, so other
hardware features that theoretically can be interrogated by JavaScript
code, typically, on an opt-in basis. So you, the user, could deny access
to these pieces of information. But those characteristics,
I suspect could be used to identify you a bit more
uniquely because laptops, at least, today, have less of that functionality. Other questions? AUDIENCE: When storing and
retrieving data on the front end, is it more secure to use cookies,
local storage, or another alternative? DAVID J. MALAN: A really good question. And we will come to this subject
literally in one slide, cookies. In general, local
storage because cookies, by design, are meant to be sent
back and forth, back and forth between browser and server. Theoretically, that should not be a
concern if everything is encrypted. But we've talked in the past
already how mistakes can be made. You might start on HTTP,
be redirected to HTTPS. So in general, storing things
in local storage, at least, prevent things from accidentally
leaking out over the browser connection. That said, if you're storing
things in local storage, they are literally available locally. So if you have a colleague,
a friend, a sibling who gains physical access to that
device, let alone, an adversary, then they could see all of the
information and not only your cookies, but also local storage. So at that point, physical
access, generally, all bets are off when it
comes to your privacy. All right. How about one other question? AUDIENCE: There were calls being made
from people's local phone numbers on cell phones to other local numbers. Obviously, the people
weren't making the calls. And it had happened to me too. And I was wondering
how that kind of works or if it's related to this at all. DAVID J. MALAN: It is. We weren't planning to
talk about it today. But in a nutshell, it is very
easy to spoof telephone numbers. And this is how a lot of spam
calls are sent, particularly, internationally or abroad,
where they might not be regulated in the same way
as someone's home country. It's very common, too, for
if your number starts-- your own phone number starts with
555, for instance, very often, you'll get fake calls from
other numbers that also start with 555 because the
presumption by the adversary is that, oh, Sabrina's probably
more likely to pick this up if she thinks it's a neighbor with
a similar looking phone number. But unfortunately,
with the phone system, it's all too easy to fake phone numbers. And this is yet another reason
why using phones, using SMS, is not a recommended approach
for our earlier topic about multi-factor authentication. It's just not a secure network. That's not how Edison and others
designed it 100-plus years ago. This is why systems that use
cryptography in some form are much safer when it
comes to that information. All right. So beyond this user
agent header, there's other headers that your browser is often
sending back and forth with the server. And one of these we've talked
about, and one of these you probably came into the course
knowing about, namely, cookies. But there are different
types of cookies. But recall that in general, a
cookie is a piece of information that a server puts on your computer
to help remember who you are. So in the absence of these
fingerprints and the absence of specific headers
like these, it can just put a small random value
with numbers and letters or the like on your computer
or maybe even a bigger value if it has lots of users. And it uses that value
to uniquely identify you if you return again and
again to the website. It doesn't necessarily
know that I am David, unless I log in at some
point, at which point, then it can realize, oh, wait a minute. David's cookie is this value. Now I know who this user is. But in general, there
are different types of cookies and different
settings for cookies that are worth knowing a little something about. So we talked previously about what we'd
more properly call session cookies. So session cookies are used
by servers to maintain state, so to speak, between the
server and the browser. That is to say, without getting too
technical, HTTP is typically stateless, whereby, when you visit a page, the
browser icon might spin for a bit. And then it stops because the
transaction between the browser and the server is complete. But if you want to remember
who the user is, therefore, the second, the third, the fourth
time, the browser contacts the server. The browser had better
remind the server who it is. And this is why we use the metaphor
of the virtual handstamp, whereby, that handstamp is the browser's
way of reminding the server, you've seen me before. Don't make me log in again. I am David. I am David. --even though it's just relying on
this virtual handstamp or really some unique identifier that's going
in the cookie header from browser to server. So a session cookie allows
browsers and servers to maintain sessions,
this kind of state. A little more concretely, it allows them
to maintain things like shopping carts. So if you're shopping on
an amazon.com or the like, the session cookie is what
remembers who you are, or at least, that you're the same
person, so that every time you poke around on the website, Amazon shows
you the same contents of your shopping cart again and again,
so that they don't lose your business by accidentally deleting
it when you simply change the page. So how do session cookies work? Well, when you first
visit a website that wants to plant a cookie
on your computer, the response might look a
little something like this. HTTP. 200 is the status code,
which, recall, means OK. All is well. It's not something like 404,
which would mean file not found. So 200 is OK. But the server might also respond with
this key value pair, this HTTP header, Set-Cookie:. So that's the key. The value of which is session=1234abcd. And that's the same
value we used previously when we talked about
cookies in this context. And the point here is that the
name of this cookie is Session. And its value equals,
in this case, 1234abcd. Now, if you visit the same
website and you, and you, and you, we would all have different seemingly
random values for those cookies. And so this number, this
sequence of letters and numbers, would be different for each of us. That is to say we have
different handstamps that we're presenting each time. Now, this is a session cookie. And it's a session cookie
in the sense that it is supposed to expire when
you close the browser, when you quit for the night, when
you reboot or anything else. Now, with that said, that's
a bit of an overstatement because browsers nowadays will
frequently preserve your tabs for you. They might go to sleep. You might have to wake them back up. But increasingly, sessions are
living longer than they once did. But the idea is that this is not
meant to last for a year or forever. It has a much shorter
lifetime by design. When your browser has received that
cookie and you click on some other page, you visit some other
product on amazon.com, your browser might say something
like this, GET/ and then cookie:, that exact same value. So recall from our
previous class, this is how the browser just reminds
the server what its handstamp is or what its cookie value is. But again, the idea is that
when the browser is closed, you reboot for the night,
then you should not have the same session cookie
tomorrow, at least, in this model. That's not true for all websites,
but according to cookies as we are currently using them. Now, that's pretty good for your
privacy because if the cookie is by design meant to be a session cookie
and it expires pretty soon when you're done with that browser tab or done
using the browser for the day, then that's pretty good because it
means if you go back to the same website tomorrow, that cookie
might not exist anymore, so you might as well look
like or be a brand new user. So they can't correlate, perhaps, by
default as much information about you. But these are the cookies that
you read about being bad for you and bad for your privacy,
tracking cookies, which are the exact same idea,
key value pairs that are sent from server to browser to
remember who you are, or at least, that you're the same person,
even if we don't know that you're David Malan specifically just yet. But as per the name,
tracking cookies are really designed to track you and me. Why? Well maybe analytical purposes,
maybe debugging purposes, so that they know where users
were in case something breaks, maybe advertising
purposes, so that you get served different ads from me, so
that they can maximize their revenue by clickserving up ads
that you and I are each more individually likely to click on. So tracking cookies are the ones
that get a bad rep and rightfully so. So let's consider an
example of a cookie that's designed to track your behavior
on a particular website. Here, for instance,
is a set-cookie header that Google, specifically,
might send to your browser. In fact, they use a cookie that by
convention is called _ga for Google Analytics, which they use
for analytical purposes. And its value looks a
little something like this. And the point of this value
is that it's generated on a per website basis if that
website is using Google Analytics. And Google Analytics is a tool that
allows website designers to track who is clicking on what, what browsers
they're using, what operating systems they're using, and generally giving
them a sense of the demographics of their user base. But unlike session cookies, which
are meant to expire after a day, after the browser closes or the
like, Google's analytical cookie here has a maximum age of this many
seconds, which if you do out the math is by default two
years, which is to say, if you visit some website that is
using Google Analytics by embedding a bit of Google's JavaScript code in
their website, whenever that Google code is pulled from
Google's website, Google has an opportunity to plant
this cookie on your computer. And you'll get a unique ID based
on you visiting for the first time, based on the specific website that
is embedding Google Analytics. And that cookie is going
to live in your computer, according to this HTTP header,
for as long as two years. Now, that's useful for Google. It's perhaps, useful for the website. It's perhaps, a little more
invasive for me and you. Now, Google has many other
cookies that they use too. But this is, perhaps, one that
you should keep an eye out for. And indeed, in the
coming weeks or months if you poke around some of
your own browser settings, you might very well
see values like this. But what else might servers
use to keep track of us, especially if you and I are in
the habit of deleting our cookies or clearing your history, which
would be counterproductive for Google or websites that are trying
to track you in this way, but a plus for you and for my privacy
if you're behaving in this way? But it turns out there's
other ways servers can track us, including through HTTP
parameters, tracking parameters. So parameters are the
key value pairs that often appear in URLs that are
sent via GET requests typically. So we've seen one of these. If you recall when we searched
for cats on Google before, you might recall that
the URL was something like
https://www.google.com/search?q=cats. Anything after a question mark in a
URL is, indeed, an HTTP parameter. But it could be used not for innocuous
helpful purposes, like searching for cats, but also, to track you. And in fact, if you
see ampersands in URLs, that might mean that you have a second,
or a third, or more parameter up there. And sometimes the purpose
of these parameters is simply to track you as some person. So for instance, here
is a representative URL. It's a long one. And this is taken from example.com
having a path of as_engagement?. And then I'll highlight here click_id=
and then this long seemingly random string. But there's a second HTTP
parameter in this particular URL. &campaign_id=23. So the campaign ID, certainly
with such a small number, is not meant to track you. That's meant to be sufficient input to
the website to know what types of ads should be served to you. What campaign should be served up? But this click_id, which is sort
of a euphemism for tracking cookie or tracking parameter in this case, is
what's actually keeping track of you, specifically, because
different users are going to find that
whatever link they click on has a slightly different
value for click_id. So recall that a tracking
cookie is something that's sent via an HTTP header. And so it's harder for
you and me to see it, unless we're more
comfortable with our browsers and can poke around some
underlying settings. But these tracking parameters are
right there in front of you, at least, if you click on the URL in your browser
and take a look at its entirety. Now, wonderfully, at least for us end
users who are concerned about privacy, browsers and even third-party
software are increasingly removing values like this for us. As soon as the browser
manufacturer or as soon as the third-party software
developer knows that, wait a minute, click ID has no good purpose
other than tracking our users, they can simply automatically
remove it for you. After all, when you
visit a web page and you get the HTML that
represents that web page, the browser could
certainly poke around there before you even have a
chance to click on anything. And it could scrub or sanitize
these kinds of tracking parameters. Now, to be fair, if the browser
manufacturer doesn't necessarily know what the tracking
parameter is called or if maybe the website is
constantly changing the name or trying to mix things up,
this might not work so well. But it's at least an
attempt to try to put downward pressure on this very
commonplace technique of keeping track of you and me. Now, why is this parameter
able to track us? Well, this, too, can end up in those
server logs because this would be, for instance, the web
page that I am requesting, /ad_engagement?click_id= dot, dot,
dot, that could very well be logged by the server, stored
in a database, even. And they could use that
information to know exactly which pages I have clicked
on, because I visited those links, and even what ads I have seen. And maybe that's a
good thing commercially because now they know what
types of ads I'm clicking on. Now they can serve even
more of them to me. And that might be great for
them, but probably not so great, if not, annoying or
invasive for me and you. So something else to keep an eye
out for and something else that might guide your decision
making in the days and the years to come when it comes
to picking your browser. You don't necessarily have to nowadays
use the one that comes with your phone, comes with your laptop or desktop. You can, if more comfortable,
install something else. And increasingly, you and I are
having more and more options. Questions now on these
tracking parameters or anything prior with respect to our privacy? AUDIENCE: Are the cookies
the ones that track or are the ones that are being tracked? DAVID J. MALAN: The cookies are values
that are being used to track you. So recall that-- a metaphor for the
cookies is like that virtual handstamp. And so if all of these web servers are
putting ink on your hand and on my hand and because of HTTP,
you and I, our browsers are in the habit of
presenting these cookies, these handstamps to
every website we visit, that value is being used to track us. So cookies in and of themselves
are just a technology. It's a very simple idea storing a big
random value on your computer and mine just to uniquely identify us. They are necessary to give us
features like logging into websites, maintaining shopping carts. But very quickly, especially
since the internet from the get go has been largely free to use-- or rather, a lot of
the internet has been free to use once you have
a connection, at least-- they've been used, or
in some views, abused by the advertisers, the Facebooks,
and the others of the world. So another way to think
about tracking cookies is to consider them to be
third-party cookies because, indeed, even in the Google example,
that's how they're being used. If a website like example.com
is embedding Google Analytics, and therefore, some kind of HTML tag
that mentions google.com, well then, example.com is the first-party
in that story, so to speak. And google.com is the
third party in that story. What that means is that your
browser might get cookies from both example.com and google.com. But the most important ones,
presumably, are the first-party ones from example.com because that is
the website you chose to go to and whose functionality you want to use. The third-party functionality,
like tracking your clicks and your internet behavior on that
site via Google, that's third party. And so very commonly
do browsers nowadays certainly offer options via which
you can disable third-party cookies. And that tends to be
good for privacy sake because it means you're blocking
third parties like Google from keeping track of you via cookies. But, but, but that doesn't necessarily
mean the website isn't still using tracking parameters in some way. And you would only know that by actually
looking more closely at the URLs you're clicking on or that are
embedded in the web page itself. And that's where now browsers
and third-party software are additionally helping by helping us
remove not only those cookies, but even those parameters. But let's consider a
more concrete scenario of what third-party cookies are
and why they allow companies not only like Google to track
your behavior on one website, but even companies like
Google or other advertisers to track your behavior
on multiple websites. And in this sense, third
parties have increasingly been more powerful, more omniscient, for
instance, than the first-party websites that you and I are actually visiting. Why? Well, if there's a lot
of popular third parties out there, Google being one of them
for advertisements and for analytics, well, if lots of different
websites are using them-- maybe Harvard's using them. Yale's using them. Stanford's using them-- then
that third party very quickly becomes more powerful than even any
of those individual parties alone. Why? Because that third party, if it is
being embedded at Harvard, Yale, and Stanford, that third
party Google, for instance, kind of has eyes into
all three websites. And if it sends the same cookie
to you on all three websites, Google might actually know that you're
poking around Harvard's, and Yale's and Stanford's website when
Harvard might have no idea you're checking out Yale and Stanford. And Stanford might have no idea
you're checking out Yale and Harvard. So what does this mean concretely? Well, consider some HTML here,
such as we've seen before. And I've highlighted a couple
of salient characteristics in this particular example. Notice that I've given in this web
page not only a body, which contains the body, the bulk of the web page. I've also included a head for
the web page, inside of which is another tag called Title. And I'm doing this just
to, one, demonstrate there are more tags than we have
seen in this language thus far. And specifically, this I claim is meant
to represent harvard.edu's own website, the title of which would
be Harvard, like in the tab along the top of the screen. And inside of the body of
this page for simplicity, let's assume that for now, there's
just one big advertisement. There's no content for
the sake of discussion. There's just one advertisement. Well, where is that
advertisement coming from? It's coming from, in this case,
example.com, or our friends at Google, specifically, a file called ad.gif. And this particular URL is being
used as the value of the source attribute of an image tag. So what do I mean by this? Well, if you visit harvard.edu
in the story, what you are seeing is a big advertisement,
a big GIF, a graphic that is coming from example.com. Now, what is the implication of that? Well, suppose that Yale
is doing the same thing. So here now, for the sake of
discussion, is the exact same HTML, except it lives at yale.edu. So the title of the page
has now changed to Yale. And moreover, just to make
things really interesting, let's add Stanford to the mix. Same exact page. So the point of this story is that
Harvard, and Yale, and Stanford are all using the same third party,
example.com in this case or maybe someone like
Google in the real world. And they're requesting
moreover the same GIF. And so the same file is being accessed. But that even alone isn't
a strict requirement. The same website is being accessed
by all three of these first parties. So what does that mean? Suppose that you open up your browser
and you first visit harvard.edu, your browser is going to download
the HTML for Harvard's website. It's going to see that, oh,
there's an image tag in there. And that image tag wants to show
this ad.gif from example.com. So your browser is
automatically, by nature of how browsers work,
going to send a second HTTP request, this time requesting
ad.gif from the host, example.com. And just to tie today's
stories together, it's going to include, probably,
a referrer, HTTP header that specifies where I'm coming from. And that's useful for
our purposes because it puts these requests into context. Now that server, example.com, or
Google, in the case of the real world, is going to probably respond with 200
OK like, OK, here is the advertisement. And it's going to include not
only the image, but also an HTTP header of its own. And this is our old
friend set-cookie, where in this case, for the
sake of discussion, I'm going to propose that it's setting
a cookie on my computer called ID because this is going to be my
unique identifier for example.com. Its value is going to be the one I keep
using for discussion's sake, 1234abcd. But that would be some big
random value for each of us. And my gosh. This thing is going to last a year. That's the number of
seconds in 365 days. So this cookie is being
planted on my computer by example.com because
I visited harvard.edu. So Harvard is the first party. Example.com is the third
party in this case. But here now is the concern. When I visit yale.edu
with that same browser, my hand has been stamped
by example.com already. And so what happens is that
my browser now presents that handstamp to example.com, sending
the same ID and the same value, that is, the same handstamp. The host is as before example.com. But this time, the referrer
happens to be Yale. So in other words, after I
visited Harvard and my hand has been stamped with this tracking
cookie, this third-party cookie from example.com, my browser,
when I visit yale.edu, is going to present that same
handstamp again, this time, to example.com with this referrer. The next time I use my
browser to visit stanford.edu, the same message is going
to be sent from my browser to example.com to request
that same ad, this time now from stanford.edu's website. Now, what's the implication? Via these three HTTP
requests, example.com knows that I'm visiting
Stanford, and before that, Yale and before that, Harvard. And none of Harvard,
or Yale, or Stanford necessarily know that I'm visiting
any of those other websites. The third party is the more powerful. It's the more all seeing, simply because
example.com, or in the real world, Google, is just so
darn popular, that it's embedded in so many darn websites,
Google and others almost everything, dare say, about what you
and I are doing on the web because these ads are all
over the place in this way. So we've seen a very simple example. But it's simple because cookies
and HTTP really are relatively. It's once you realize
how they work, that you can use them not only to solve
compelling problems for all of us, sessions, and
shopping carts, and the like, but also can be used to
monetize the internet and has been used historically to
monetize the internet, or even worse, perhaps, for us, to track our
individual clicks and behavior. So let me pause here and see
if there's any questions now on third-party cookies
and why, therefore, it's perhaps so compelling for you or
me to opt in to disabling them, or better yet, to use browsers that
are starting to block them for us. AUDIENCE: What browsers are
more secure among others considering tracking parameters? DAVID J. MALAN: Sure. A quick tweak. I wouldn't say that
some browsers are more secure than others in this context. I would say that want browsers
that are more privacy conscious or privacy preserving because that's
what we're talking about today. Hopefully, all of them
are just as secure when it comes to HTTPS and the encryption
that's just keeping our data protected between points A and B. So generally, Safari has been pretty
good when it comes to privacy. And they are the ones that very
recently that you're using now announced that they're going to start
giving people the feature of removing tracking parameters from URLs. In fact, the sample
URL I gave was actually from Apple's recent
announcement about exactly that. DuckDuckGo is probably the most
popular third-party browser that is very privacy
conscious and tries to disable a lot of these tracking behaviors. Another one is Brave. Perhaps, the worst offender
is probably Chrome, even though I, myself, am guilty
of using it myself because it's so integrated into Google's ecosystem. But Google, of course,
has made their business on monetizing your behavior and mine. So that is, perhaps, one to put
toward the bottom of the list if you're concerned about this. So that's kind of how
I would rank things. And there's yet others. But I think those are
some of the most popular. And then, of course, in
the Microsoft ecosystem, there is Edge and Firefox too. I should have put them
higher on the list. They are more privacy conscious,
I do believe, than Google. So with all of these mechanisms
for tracking in mind, what can we do to protect
all the more of our privacy? Well, you might already know of
this feature, private browsing. So you don't necessarily have to
delete all of your browser history and delete all of your cookies. You can instead, on occasion,
open up a special type of window, which most
of today's browsers support that puts you into
private mode or incognito mode. And you can think of this as giving
you just a different chunk of memory in the computer that doesn't know any of
your past browser history, that doesn't have any of your past cookies,
that doesn't remember any of your past usernames and passwords. You're sort of starting fresh, so
that everything you do in that window is brand new. The catch, though, is
that everything you do in that window still works exactly as the
web works as we have been describing. So you're still might
have tracking parameters. You still might have tracking cookies. You still might have server logs. But when you close that private window
or you close that incognito mode, at least, the information is
discarded from your computer, so that if tomorrow, you
do the exact same thing and open up an incognito
window again, then it's as though you're starting
fresh with that server, except for the reality, as
per our past discussion, that fingerprinting is
still a possibility. Your IP address can still be factored
in as can be other information that your browser
might still be leaking. But what you're not doing is
contaminating, so to speak, your general browsing
history with specifically what you're using that window for. What you should realize, too, that
private browsing or incognito mode is entirely client side. So particularly, those
logs that we have mentioned are still being stored by the server. They might be storing, perhaps, a
different tracking cookie or parameter for you because it doesn't
necessarily recognize you when you're in private or incognito mode. But it doesn't mean that your tracks
are completely absent from the internet. Rather, it's really just scrubbing
them from your local computer and decreasing the probability,
but not eliminating the probability that a server still knows that it's you. So I would use with care. But with that said, if you take
a course in web development or you already design your own websites,
using private browsing or incognito mode can also be useful
for development purposes because it's a way of opening
a brand new window that has no recollection of
maybe past bugs that you had or past web pages that you clicked on. And it's very commonly used
as part of development tools to actually facilitate and mimic the
idea of starting fresh with some site. Super cookies, though,
these sound delicious, but these two are kind of the worst of
cookies that we've discussed already. We saw session cookies
for maintaining state. We saw tracking cookies
for tracking you. Super cookies are not so super, really. These are cookies that
are typically injected by a third party, like your
company, your university, or your internet service provider
into your HTTP request, which is to say, if you, from your
browser, visit some website, that traffic, of course, goes
from your laptop or phone through some internet service provider,
whether it's on campus, or home, or wirelessly in the real world. And if whoever is providing
you with that internet service can see the contents of
that virtual envelope, there's technically
nothing stopping them from opening up the
envelope, so to speak, and adding one or more
HTTP headers of their own. And so mobile phone carriers,
for instance, in the past have been known to do this,
whereby, if you are just requesting a website, like
example.com from your phone, they might-- halfway
between you and that server, they might inject a cookie of their own. For the sake of discussion, I'm going to
use the same name and value as before. id=1234abcd. But what's noteworthy here is that that
value is not coming from your phone. It is not coming from your browser. You can clear all of your cookies. You can clear all of your history. You can use incognito or
private mode on your phone. You're not going to see any
trace of that client side because the darn thing is being
injected into your traffic between you, point A, and the server, point B. So this is sort of a canonical example
of a machine in the middle attack. But your internet service provider
in this telling of the story is doing it because
they want to track you. Or they want-- because of
advertising relationships they might have with
some websites, they want to make sure that you can
be tracked by that website, even if you have opted out or
have been clearing proactively your very own cookies. So suffice it to say, these have
been particularly controversial. And thankfully, you and I do
have a pretty good defense here. Just never use HTTP without encryption. If URLs are always https:// and
then something, theoretically, this attack or this "feature" of your
mobile phone carrier should not be possible. Why? Because if the contents of
the envelope are encrypted, not only can't they see
what's actually inside, they can't add anything to the mix
because they don't have the key that's being used to encrypt that information. So simply using always HTTPS is
one solution to this problem. And also, at least, in the US,
some of the mobile phone carriers got a lot of backlash for this. But so, you can occasionally log into
your cell phone provider's website, go through a bunch of menus, find an
option to opt out of this feature. But I will say from experience, that
they typically bury these options too. And so it's not necessarily
even the iciest thing to find. But again, this is just a natural
result of the underlying technology that we're being used, or if you prefer,
abused, for alternative purposes. All right. Let me pause here and see
if there's any questions now on these super cookies, which indeed,
are not so super or anything prior. AUDIENCE: Given that cookies
store passwords and emails, can the adversary impersonate
another person by copying that cookie and pasting it into his own
computer and visiting that website? DAVID J. MALAN: A good question. So cookies can be used to store user
names, email addresses, even passwords, though, I would generally
not recommend doing this. But they theoretically
should be secure, even if you're storing
those values in cookies because they're going back and forth
between the browser and the server using encryption if
HTTPS is, indeed, in use. A danger, though, is that if someone
has physical access to your computer, it's very easy to poke around your
own browser's cookies, at which point, they're going to see your password,
which is probably not a good thing. So on an alternative
would be, for instance, for a browser to encrypt the cookie
or minimally digitally sign it, so that it can be identified as
belonging to that same server. But even better, I dare
say, would be for servers to only plant big random values
as cookies on your computer, like this virtual
handstamp, and then store recollection of your username,
email, and/or password on the server. So stamp my hand to remember
who I am and that I'm logged in, but don't bother expecting my browser
to send my username, my email address, my password again and again. It should suffice to
send that just once. Other questions here? AUDIENCE: I've heard
that it's possible-- for example, if I'm
writing a text to someone, it's possible to intercept, to alter
my text and send it on my behalf. So it's going to be a different
message, so it's possible to ask, maybe, for sensitive information. So I was wondering, don't those
messengers use something like cookies? How can this be possible? DAVID J. MALAN: A good question. So SMS, or traditional
texting, is generally insecure. It is very easy for someone
to forge your phone number. And in fact, if you've gotten
a lot of spam via text, that might be exactly what is happening. Or worse, it's also possible, recall, to
steal your SIM card essentially or port it to another carrier, so that someone
can intercept all of your actual texts. So in general, nowadays, you
should be reducing, if not, eliminating your usage of SMS, at least,
for anything important or anything you want to keep private. When it comes to other messaging tools,
like iMessage, like WhatsApp, Signal, Telegram, there's a lot of products
nowadays, third-party or otherwise, that use end-to-end
encryption, which recall, we discussed a couple of classes ago. And in that case, even though the
data is going through a company like Facebook, theoretically, assuming
they're behaving honorably and have implemented end-to-end
encryption properly, then even they cannot see the
message going between their servers. And that is independent of cookies. Cookies have no part of that solution. That solution is entirely thanks
to cryptography and encryption with digital signatures. All right. So let's consider one other
threat to your privacy that you might not
necessarily have thought about that isn't relate just to the web,
but really, your use of the internet more generally, namely,
DNS, the Domain Name System. Thankfully, even though computers on
the internet all have IP addresses, these unique numeric addresses
that we've discussed, you and I don't have to remember
what server's IP addresses are because servers typically
have domain names, something like harvard.edu, yale.edu,
stanford.edu, google.com, amazon.com, and others. But how then-- when you type in any of
those domain names into your browser or into any piece of
software on the internet, how does your browser or your computer
know what IP address to contact? Well, it turns out that there's a
domain name system in the world. And this is a system
deployed throughout the world on the internet whose purpose in life
is to translate domain names to IP addresses, so that on the
outside of those envelopes can, indeed, go the IP addresses
of source and destination. But you and I, as humans,
don't need to know or remember exactly what those IP addresses are. You can think about this
back in the day of when we were in the habit of
typing in phone numbers to actual analog landline telephones. It was actually pretty hard to
remember lots of people's numbers. And you might even have
had an address book that you looked up people's numbers in. Or there were certain mnemonics. For instance, in the United States,
there was a number, 1-800-COLLECT, C-O-L-L-E-C-T, which was just much
easier to remember than the actual numbers for making a collect call. The equivalent on the
internet is DNS, which just automates this process for us,
so that every website, every service can have its own unique name,
but it's translated automatically for us via DNS servers throughout the
world to the corresponding IP address. But why is this problematic? Well, it turns out that DNS servers are
typically in a few different places. One, you probably have one in your home,
or your company, or your university. And it probably is built into, if
in your home, the router, the device that you're using just to
connect to the internet. But your internet service provider
also tends to have a DNS server. And that DNS server
probably knows about way more IP addresses than your
own home does because why would your own home network know about all
of the IP addresses in the world? But with that said, why would
your internet service provider know about all of the possible
IP addresses and domain names in the world? Well, suffice it to
say for our purposes, there's a hierarchical system. So even if your home
router doesn't know, even if your internet service
provider doesn't know, there's some other server on
the internet that can eventually give you the answer to a question
like, what is harvard.edu's IP address? What is yale.edu's IP
address and so forth? And for efficiency, once that answer
has been figured out somewhere, then your internet service provider
might remember, or cache, the answer. And even your home router, and heck,
even your device or your browser might remember the same
answer for efficiency, so we don't have to keep
asking the same question. And it turns out by convention, DNS uses
port 53, if you recall our discussion, of also using unique numbers to identify
things like HTTP, or 80, HTTPS, or 443, or 22 for SSH. DNS tends to use 53. But the catch is that
the traffic used for DNS is typically unencrypted, which means
that when your phone, or your laptop, or your desktop is asking your home
device, or maybe your internet service provider, or someone else, what
is the IP address for harvard.edu, or yale.edu, or the like, you're
actually announcing to the world what website you are about to visit. Why? Because you're waiting for
a response from the DNS server to actually tell you
the corresponding IP address. So this isn't great. And moreover, your internet
service provider, therefore, knows all of this information
about you because every time you ask for a new website that you've
never been to before, your home network probably doesn't
know the IP address, so you have to ask your
internet service provider. And again, they might ask someone else. But the internet service provider
is going to know now that you asked. So your internet service provider,
be it for your home network or for your cellular
phone, pretty much knows every website you've
ever been to, assuming they're logging this information,
which they probably are, unless there are regulatory or legal
requirements that say they can't or they can't for very long. Now, why is this the case? Well, the domain name
system essentially requires that we ask these very questions. And if the internet service
providers remember these answers, well, they can keep track of everywhere
we've been, at least, at a high level. DNS only gives them back a translation
from the domain name to the IP address. What it does not include is the
specific page that you're looking at, the specific URL, the folder,
the file that you're looking at. So your internet service
provider might know you're visiting somewhere
on harvard.edu because you asked, of course, for its IP address. But they don't know what
department you were looking for or what course you were
looking at or the like. But there's still a decent amount of
invasion, therefore, of your privacy if you'd rather that ISP or someone
else just not know that information. So increasingly, there are
alternatives to the standard DNS functionality, one of which is called
DNS over HTTPS, or DoH for short. This means exactly that. Instead of just sending out DNS
requests unencrypted on port 53 to the local DNS server, now they're
sent, potentially if you enable this, over HTTPS. And what this means is that they will
be sent using the HTTP protocol, which we've talked about endlessly
in these virtual envelopes, but securely using TLS, which is
the encryption protocol that ensures that no one else can see what's
going on inside of that envelope, including your internet
service provider. Now, someone is going to still
know what domain name you're looking up because after all, to
whom are you sending this request? Maybe you're sending it to Google. Maybe you're sending
it to some third party. But you are sending it to someone. But at least, goes the thinking, it's
not your internet service provider, who really doesn't need
to know this information. So that's one way of thinking about it. And there's alternatives to this. There's actually something
called DNS over TLS, DoT, which is very similar in spirit, but it
doesn't even bother using HTTP. But it is still using encryption. So this is something
that's increasingly common. It's not necessarily the
default on a lot of systems. But it's yet another feature
of today's technology that you can increasingly look
for, seek out, enable proactively if this, too, is a concern that you
don't necessarily want a third party, like your ISP to know what
it is you're accessing. And it might not even be your own ISP. If you're on the road, in a
coffee shop that gives Wi-Fi, or an airport that gives
Wi-Fi, at that point, your internet service
provider is effectively that coffee shop or that airport. And do you really want them
knowing everywhere you're going? You might be, depending on
your comfort level, prefer-- you might be preferring that, at
least, all of your DNS requests go to some other central party that
you do trust for whatever reason, so you're not just informing every
different Wi-Fi hotspot that you might be using around the world. Let me pause here and see
if there's any questions now about DNS and this concern with respect
to your privacy or these solutions there to. AUDIENCE: Can DND [INAUDIBLE]
used to deceive users and steal information, which is sensitive? DAVID J. MALAN: Absolutely. So DNS, itself, can also
be used for evil purposes. If you control the DNS server, you
don't have to give an honest answer. If someone asks you for the
IP address of harvard.edu, you could give them the IP address
of some completely malicious server that you control. However, if the user, like Ryan,
in this case, is using HTTPS, the whole point of HTTPS is to encrypt
the data between browser and server. And presumably, the
browser is going to try to request the TLS certificate
of harvard.edu in this case. But if the IP address
returns the wrong certificate that wasn't signed by the right
website, then the connection might fail. And you'll be given a warning that you
can typically ignore in your browser. But this should be preventable
because you should at least be warned that that is not working correctly. And ISPs actually do this quite often. If you make a typographical error
sometimes on home networks, or coffee shops, or airports, you might actually
still see a website of search results, or worse, advertisements. And that's because even if you made
a typo in the domain name, the coffee shop's or the airport's
DNS server is still going to return to you an
IP address of their server, so they can at least
push some content at you. So let's consider some of
the mechanisms via which we can push back on some of these
more invasive privacy practices. And one is something we've
talked about before, namely, a virtual private network, or VPN, which
is a increasingly familiar technology. But it's worth knowing
exactly what problems it is solving for you and
exactly which problems it is not, particularly, if
you're using such a service to protect your own privacy. Well, what is a VPN? It allows us, recall,
to connect from point A to another point B using a
completely encrypted tunnel. So it doesn't matter if there
are machines in the middle, as indeed, there will
be on the internet. All of the traffic between A and B
on a VPN is encrypted or scrambled. So what does this do? This allows you to access sometimes
a corporate network or a university network that might have
servers or services that are only accessible if
you are on physically or if you are on virtually
that particular network. This ensures that even if you're at
home, or in a cafe, or an airport, at least, you have an encrypted,
more secure connection to the campus or the corporate network,
at which point, the campus or company might be more comfortable with
you accessing those services. Now, this does not prevent
you still from being hacked because if you're running malware
on your own computer accidentally, it doesn't matter if you have an
encrypted connection to the company or campus. You might very well have an infected
connection now to the company or campus if you, yourselves, are infected. VPNs can also be used to create
the illusion that you're actually in one country and not another. Why? Well, if point A is where you are and
point B is somewhere abroad, well, to the rest of the
world, if you start using this VPN, this virtual
private network, you will appear to have an IP address
that is in that foreign country because all of your internet traffic
for chatting, video conferencing, the web will be sent
through that VPN by design. That's what a VPN is for. And it will come out the other
end in that foreign country and then continue on its way to the
chat service, the email service, the web service, or the like. So each of those services
will think that you live or are physically in that foreign
country, even if you are not actually. So what's the implication of this? A virtual private
network only guarantees that the connection between you
and that point B is encrypted. It doesn't necessarily mean that
once you're out of that VPN, it's going to stay encrypted,
especially if you're using still HTTP and not HTTPS. But it does, at least, encrypt
everything between points A and B. It also does change what your
IP address appears to be, so that you will, indeed, appear
to have an IP address that's from that foreign country and
not your domestic IP address, which might have some value
in covering your tracks or decreasing the probability
that you'll be identified. But again, we've seen so
many other mechanisms today, whereby, your browser
can be fingerprinted in the context of the web,
that someone might still be able to realize that, OK,
your IP is different today, but this still looks like you,
even if they don't necessarily know that you are David
Malan or you, yourself. But it, at least, does solve at least
one problem, which is encrypting end to end all of your traffic. Well, there's another
piece of software that's been popular for some time
called Tor, The Onion Router. So this is a piece of software that
you can install on your own Mac, or PC, or other device. And this uses encryption to solve
the problem a different way using additional encryption to try to give
you a higher probability of privacy. And here's a picture that Tor,
themselves, puts on their website. And it has depicting here
you on a very old school PC connecting to a whole
bunch of nodes inside of this Tor network
connected to ultimately maybe the websites that you're visiting. And what happens here is that when your
computer is running the Tor software, the Tor software first figures
out, OK, who else in the world is using the Tor software? Because it's going to use those other
computers to route your traffic, up, down, left, and right and
kind of like the movies or TV, where you see a map of the world and
the traffic is bouncing back and forth and back and forth. That's kind of the spirit of Tor. And what happens is
if your computer here wants to send a request to a
website that's maybe over here and it decides, for instance,
to route it through one, two, three computers, what
the Tor software will do is encrypt the request at
least three different times. Whatever web request you are
sending, whatever email you are sending, whatever chat message
you're sending, whatever service you are using between point A on the
left and point B on the right is going to be encrypted
with this node's public key, with this node's public key,
with this node's public key. And so here's the onion
in Tor, The Onion Router. You are encrypting layer, upon
layer, upon layer of data, so that mathematically, recall per our
discussion of public key cryptography, only this node can peel
off one layer, only this node can peel off one layer, only
this node can peel off one layer using their own respective private keys,
which undoes the effect of your having encrypted your traffic. So what you're really
doing here by choosing, perhaps, a different path, every
request, a different path every day is you are with Tor effectively
covering your tracks in some sense. And by design, the Tor software doesn't
remember much information at all, so it doesn't have the sorts of logs
that I propose can be worrisome, at least, in the context of web servers. By design, Tor is meant to preserve
your privacy with higher probability. And so by design, it just doesn't keep
nearly as much information around. Now, this isn't to say that if you're
doing this for malicious purposes, trying to evade the
authorities, this isn't to say that this computer,
this computer, this computer couldn't be subpoenaed, so to
speak, by some government entity and they could reconstruct
the path that your data took. But the point is that it's
generally quite laborious. By this time, all of that data has
disappeared from those interior nodes. And so they don't have
much information to share. And so increasingly, it does provide you
with some higher probability of privacy by layering your requests with
encryption, encryption, encryption and sort of trusting that
these interior nodes are going to relay it to the
final endpoint, so something to consider if of interest. But realize, too, that because of how
the internet works with IP addresses, because of how the internet
works with port numbers, it's still possible on a network to
know who is using Tor, for instance. So if you happen to be the only person
at home, the only person on a company or on a university network
who's using Tor at the moment, it's being used for
malicious purposes, odds are, you could be targeted as
the source of that attack. And so realize, in particular, that
this just raises the bar to detection. It raises the bar to your
privacy being invaded. But it does not, as do none of
the technologies we've discussed, give you an absolute protection
of these same properties. So there's one final mechanism when
it comes to preserving one's privacy that's thankfully
increasingly available to us on devices, on desktops and
laptops, and especially, on phones. And that's this notion of
permissions, which isn't anything new. But as iOS, and Android,
and other operating systems have evolved, increasingly, you and
I are being asked by our operating systems, do you want to allow this? Not only do you want to
allow this program to run, but do you want to allow this program
to access your camera, for instance? Do you want this program to access
your microphone, for instance? Do you want this application to
access your contacts, for instance? So on the one hand, we're being
given much more fine-grained control, which is a good thing, presumably. At the same, time, it's also just
pushing the decision onto you and me. And very often, with these applications,
as you've probably found, well, if you don't enable the camera
and give access to the app, it just might not work because they
have some code in their application that says if camera's not on,
then do not do anything useful. So there's this tension between
usability and privacy in this case. But thankfully, there's
finer-grained controls too. On iOS, for instance,
you might be prompted, do you want to give this
app access to this feature always, or only while using
the application, or never? And that's certainly a
good thing for something like the camera or the
microphone, where it would be nice to trust
that when you close the app and put your phone in your
pocket, that it's not still listening to or trying to watch
you from this built-in hardware. Now, there is some
feature that might need to run all of the time, which includes
location-based services, which is to say that our phones,
especially nowadays, can pretty effectively track
our location using GPS, or Wi-Fi, or some other technology. Now, that's of course, useful, if
not, necessary for using mapping applications, like Maps, or
Google Maps, or the like that help us get physically from point
A to point B. But very commonly, these applications,
at least, by default, ask for access to your
geographic location always, which means just
by walking down the street, even if you're not following
a map on your phone, means that the app can still
be tracking where you're going. And certainly, among the
Googles and the Apples of the world nowadays
or other manufacturers, they certainly know
pretty much everywhere you and I are going if we leave these
location-based services on by default. So this is an example of something of
which you should be mindful if only because here is yet another
example of information that logically, when you think about
it, OK, obviously, that makes sense. They must be keeping track
of my location, otherwise, how could they provide
me with mapping services? But pause and think now, perhaps,
exactly what the implications are for you, for your privacy,
and just walking around 24/7 with these radios now in our pockets. So even though there are quite
a few threats to our privacy, online especially, at least,
there are these mechanisms that you and I can enable to at
least preserve some of the same. Well, what have we done
over the past few weeks? We began with a look at how we
can secure our accounts, then our data, then our systems, then
our software, and today, of course, focusing on preserving our privacy. And by way of the various
technologies we've looked at, the stories we've told, the
principles that we've introduced, we hope that in the days,
the weeks, and the years to come, you can use all
of these first principles, and these ideas, and
these building blocks to extrapolate to how
new technologies work, to how new threats might affect you, and
to what questions you should be asking of either the software you use or
the software you develop to ensure that not only your communications
are secure, but also, that it has these privacy-preserving
properties that you, and your users, and your customers might want. This then was CS50's
Introduction to Cyber Security. And this was CS50.