Hypertext '08: Jon Kleinberg, Link Structures, Information Flow, and Social Process

John Kleinberg

This year’s conference emphasizes social linking and its relation to information linking.

A striking slide illustrated the tangled interconnections of online friendships, as opposed to the red and blue nodes that characterize political blogs (with some neutral interconnections).  (Blackstorm-Huttenlocher-Kleinberg-Lan 2006) and (Adamic and Glance 2005)

Bridging levels of scale; Zacharly 1977 studying a university karate club in the process of splitting in two.  34 nodes in the social network, 2 years researching each of these nodes, big chunk of a PhD work to investigate the conflict.

Compare the 30-year-old study of 34 nodes to the kind of information we can get out of massive network datasets — both more and less information. It’s easier to measure, hard to pose nuanced questions about what the connections mean. We can hand-code small subsets, but not on the large scale.

Notes that the goal is not to accumulate huge amounts of data, but rather to find the point where these two lies of research converge.

Intrinsically missing from large-scale studies — we need to enrich our notion of link structure, so as to be able to talk about more complex, subtle questions.

Social networks aren’t static structures — they are circulatory systems for ideas and information. Tension between the global reach of diffusion and the localized views that most datasets (even massive ones) provide.  You can watch one person adopt something and see who’s influenced by that, but you can’t really watch the first 1000 people to adopt iPhones. How to find a dataset that will let us watch something spread, not just locally in the neighborhood of a node, but on a larger scale.

Example: chain-letter petitions.

Quoted from Vannevar Bush on the associated trails running through documents, and people as trailblazers. A person blazes a trail through a network of documents… a chain letter is a document that blazes a trail through a network of people. The document follows the social structure.

Shows the traditional picture of information propagating along a neat tree.  But the full tree is unobservable. But a chain letter petition includes traces — a signature of all people that this particular incarnation of the chain letter has passed through.

People change the list, changing spellings, or adding joke names, or truncating sections. The analogy to mutational events is computationally very useful.  Every kind of genomic mutation that you can imagine.

Showed 20,000 nodes of the Iraq chain letter — looks like hair.  More linear propagation than lateral propagation. Really few branches. (Jon has clarified that we’re reconstructing the chain based on what we find… we don’t have the whole tree.)  Trees for other chain letters have similar structure.

Timing gets you closer to the answer. Viral spreading is implicitly synchronous — that each branch propagates at basically the same rate.  Biological viral infections impose internal schedules, but people don’t forward information at the same rate, so different branches will progress at different rates.  Notes that there’s been productive recent research in the time people take to respond to emails.

Because people within social networks will likely get multiple copies of a chain letter, the relative dates at which people woke up and found multiple copies of the chain letter in their in box means that the chain latter followed a depth chain — only one copy was produced from each related group of friends who received it at the same time and got a copy from someone else before they acted on it. LiveJournal friendship networks produce a similar tree shape.

Spatial clusters… even in online social networks, there’s a large amount of homophilia (you are friends with people who are similar to you).

Opposing influence — sometimes people act only when they have multiple stimuli (multiple requests to send on the chain letter).

Altough you may hear about the chain letter from a remote link, you are not likely to participate until you’ve heard about it from enough of your closer relationships… so the chain letter only spreads locally, it doesn’t jump a great distance to bridge the gap between you and a distant connection who shares few of your friends.

Can we use this information to predict which viral events will lead to cascades? Columbia University “Music Lab” has a leader board. They ran eight parallel versions of the site, letting the universes evolve independently. The leader board feedback leads to inequality. Random symmetry breaking in the beginning gets amplified. Genuinely bad songs didn’t get propagated, and good songs weren’t totally published; nevertheless there may be an element of inherent unpredictability in the efforts to predict viral success.

Protection of anonymity. Handing data over to researchers with anonymization is “weak to the point of being dangerous.”  When we have detailed data about people, anonymization runs into trouble.  Someone who posts under multiple identities can readily be identified as the same person.  [I love the word — “de-anonymize.”]

The implications — we may not object to our use trails being published without names attached, but as Jon notes it’s possible to correlate the anonymized private data (such as Netfilx rentals) with public data (such as IMDB ratings).

The attacker of an anonymous network can have more power if they are part of the system. It’s not hard to plant yourself in a large network, and leave some kind of privacy-breeching Trojan horse. If you knew the information would be released, it’s surprisingly easy to compromise privacy by exploiting the pattern of links. The pattern will be so rich that anything  you create will be highly distinguishing and highly findable.

A network of 12 nodes is sufficient to identify within a dataset of 100M nodes. Would it be computationally hard to locate a particular node of 12?

In LiveJournal, you and 6 friends chosen at random can carry out an attack compromising (within the domain of an anonymized data set) 10 users. This can happen even unintentionally — you and a group of friends likely already have a unique linking structure that compromise a set of people you’ve already linked to.

 A clique is a set of nodes that are all interconnected; an independent set has no links.  In a large set, you’ve got to have either a large clique or a large independent set.

Takeaway — one way to prove that something exists is to show that there’s a high probability that it will occur in a random structure.

The IM graph is not random, but we’re randomizing the subgraph we’re targeting. It’s very unlikely that your subgraph is in the set of data released. We don’t need to randomize to get that signature — friendship structures already lay the groundwork for this kind of attack.

Reflections on privacy — social networks are skeletal structures, but they are something that needs to be treated with care. Anonymization of data doesn’t really protect users — only our contractual agreement with the service provider is protecting our information.

 t ^ -1.5 describes the likelihood that you will answer a given e-mail in t days.  (Barabasi 2005)

Toronto Globe and Mail — MySpace “doesn’t just create social network, it anatomizes them. It spreads them out like a digestive tract on the autopsy table.”  Do we want to know all that we can learn when the guts of our social networks spill out?  These relationships have been implicit, but social networking makes those relationships explicit.