# Hypertext '08: Session 3: Social Linking II: Analysis and Modeling

Chair: Andreas Hotho (Universität Kassel, Germany)

The Very Small World
of the Well-Connected
(Long Paper)
and Anna Gilbert

Matt Bonner is the presenter; this paper won a “best paper” award at the banquet last night.

Vertex Important Graph Synopsis. (Promises a definition!)

Opening
image — “Network or Hairball?” Huge networks are difficult to study
and share. To shrink or summarize a network, you create a subgraph of
vertices you decide are important. Study these important vertices, and
compare their behavior to the rest of the graph.

Degree,
betweenness, closeness, PageRank.  He’s spending time describing in
detail all the importance measures. Assortativity vs
disassortativity… [The graphs of “betweenness” and “PageRank” look
algorithim on betweenness.]

Demonstrated how subgraphs can
differ greatly based on which importance factor your subgraph selects
for. [To translate into my discipline, I suppose this is a close
reading of a small portion of a text, by blocking out all the
information you don’t want to find.]

Leass than 10% of the nodes are needed before the subgraph results look very close to the overall dataset.

This
is the third time the presenter has apologized because a dotted red
line is invisible in the slides being projected. His oral explanation
of what he lines should signify is clear enough, I gather, though he’s
referring to special kinds of graphs that network researchers (not I)
would be expected to understand.

A blog aggregator in his
dataset was connecting to many, many nodes, not just the most important
nodes, so there’s a notable anomaly in degree. [I’m actually intersted
in this, though he’s mentioned it only as a footnote and offered to say
more on it during the Q & A if desired.]

[“NP-complete:
reducible to Steiner…” nope, the slide changed even before I could
write this down as an example of a statement that’s perfectly
straightforward but requires background knowledge that I would need to
look up. I missed what “Keep One” and “Keep All” mean.]

[What
I’m getting out of this is a mathematical illustration of the thesis
that it is possible to gain useful, accurate information from a study
of a subset within a dataset, and that other factors (which are over my
head) seem to control for whatever bias you introduce when you select a
set of nodes for a particular characteristic. Unlike the “close
reading” metaphor I used earlier, if you know what you’re doing, you
can select a “synopsis” that emphasizes the features you wish to study,
and identify how closely your subset matches the overall dataset’s
similar features.]

Klaas
pressented, beginning with a basic introduction of tagging [probably
not necessary for this audience, but it was brief and… and I don’t
mind an introduction that gives me a minute or so to adjust to a new
speaker’s cadence, accent, and relationship to the visuals.]

How do users influence each other in tagging systems?

User
interface, user brain, background knowledge, and “something else” all
influence tagging behavior.  We can only observe the tagging behavior
[well, we can learn something about the user interface, too, but I
think that comment is intended to contrast with the inscrutability of
the brain.]

Klass introduces foksonomy, [again I would assume
most in the audience already know the term, but I’m always intersted in
how individual scholars define their terms… this is necessary in the
humanities, since we don’t always have empirical evidence to help us
define our terms.]

Presented sample of coding mechanism
“co-ocurrence stream” .  Measured how many occurrences, how many users,
how many different texts, and how many resources.

After 100
occurrences of a term, you have about 50 texts.  The number of distinct
texts is growing but the rate of growth decreases… it’s a nice
logarhithmic scale showing that each time the tag appears again, it is
more likely that the tag is used on an existing text, rather than a new
text.

The frequency of texts is inverted — The most often used
text has a 4-5% probability, text #100 was 01.% probability. [Did he
describe that as “fat tail distribution”?  I assume that’s the
oppossite of the “long tail”?]  Not quite as even as the tag graph, but

Resource streams [I missed something… does this mean he’s
graphing the channel by which the document is delivered, or by which
the tag entered into his dataset, or the timeline of each entry in his
dataset regardles of source or other characteristics? He briefly went
through some related research, so obviously this is a common concept
that I would know about if I knew the literature.  The whole idea of
measuring the time between the creation of links is an exciting new
concept for me, since it blends well with composition pedagogy that
considers writing as process rather than product.]

[Equation
that I can’t type fast enough] Probability of selecting from background
knowledge — probability of selecting word w for topic t modeled by —
[drat the slide changed.]

[Another equation] probabilty of repeating a previous tag assignment.

Now he’s finished presenting the model, and he’s moving on to compare the observed behavior with the behavior he modeled.

[I’m
starting to pick up on the biological metaphors — mutating of tag
assignments, limited number of texts in the pool.  I hadn’t realized
just how deeply the “virus” metaphor is rooted in this subject
matter… Does the metaphor drive the science? Are there patterns that
don’t already have known biological instances that aren’t pursued
because there isn’t already a vocabulary to describe them?  I’m going
to have to re-read The Selfish Gene for all the parts that I skimmed
over when I was reading it simply to undersand “meme”.]

After
more than 1000 tag assignments, the frequency of assignments is more or
less stable. There’s a drop around tag number 7 in resource streams.
Cited http://isweb.uni-koblenz.de/Research/Tagdataset and
http://tagora-project.edu

First question, from the conference
host, was about whether the user interface affects tagging behavior?
[Here I am, thinking that I’m an idiot, and a big-shot followed up on
something that dimly appeared to my Humanities brain. I confess I
didn’t follow the answer but that’s my fault — I was high-fiving
myself for having wondered about that earlier.]

Todd
presented. Organization of knowledge… the printing press was “a
start” at organizing human knowledge. [Well… the codex, the classical
form of the argument, rhyme, abstract thought… but I digress.]

Great–
Todd notes that the audience doesn’t need an introduction to tagging,
but notes the value of communicating the way he frames tagging. We all
have individualistic motivations for tagging, but the result is a
global knowledge map; the research posits that tags create a
lower-dimensional representation of the material the tags describe.

This
is a distributive process with hundreds and thousands of users, and
trying to infer high-level behavior is challenging.  If we sample
therefore don’t know whether we’re sampling the data accurately. Noted
the Measure/Model/Innovate cycle, and indicated roadblocks when you’re
measuring a dataset that you don’t own.

Information theory — how much information do tags tell us about an underlying document.

Entropy
measures the uncertainty in a dataset. A heterogenous set has maximun
entropy at log n, minimum of 0. You can increase entropy by increasing
the number of things in your dataset; or you can keep n constant and
make data more uniform [did he mean to say less uniform?]

[I lost track for a while because my blog was having trouble saving…  picking back up]

Overview of the data collected from deli.cio.us over about 120 weeks. The number of documents increases rapidly… there have to be many tags for every individual document that comes into the system.

Entropy of the document set is rising… the diversity of the document set is increasing over time. People are creating new content for the system, rather than re-enforcing opinions that have already been entered into the system. This is good for del.icio.us in the long term.

Tagging in delicious is both encoding and retrieving. Encoding — users have some notion of document space and future use; how likely are you to use a document in the future? We’re sending a message to ourself in the future that we’re going to have to retrieve. [How intersting — this concept might be very useful in getting students to think about the value of taking notes in class or annotating their journey through a challenging academic or literary text.]

Around week 80, there’s a dramatic stabilization of the entropy curve. The number of tags are increasing over time. People are becoming more and more likely to set up a tag that’s already in the document space. [What happened around that time to create that change? Did some blogging tool or some content site like Wired add “delicious” tags to its content?]

Tag “efficiency” is decreasing — crested around week 40.  Tags are becoming less and less signficant. The descriptive power of English is limited.

Do people change their behaviors to re-capture some of that efficiency? People are starting to add more tags to each item in response to the navigation pressure. Ziff’s principle of least effort. If you are just archiving a document, you probably won’t put much effort into the tagging.  Putting the notion of tagging in to the framework of information theory.

[My observation… the delicious interface changes over time, so the nature of tagging changes over time.]

Rather than delicious suggesting tags others have used, try asking the user to come up with new tags [as Google does in its picture-tagging game].

[During the Q and A I asked whether the drop at week 80 was realted to a change in the del.icio.us interface, but Todd says not. They did notice other phenomena that corresponded with changes in the user interface.]