Hypertext '08: Session 3: Social Linking II: Analysis and Modeling
The Very Small World
of the Well-Connected (Long Paper)
Xiaolin Shi, Matthew Bonner, Lada
Adamic and Anna Gilbert
Vertex Important Graph Synopsis. (Promises a definition!)
Opening image -- "Network or Hairball?" Huge networks are difficult to study and share. To shrink or summarize a network, you create a subgraph of vertices you decide are important. Study these important vertices, and compare their behavior to the rest of the graph.
Degree, betweenness, closeness, PageRank. He's spending time describing in detail all the importance measures. Assortativity vs disassortativity... [The graphs of "betweenness" and "PageRank" look very similar, which I gather is because Google bases its linking algorithim on betweenness.]
Demonstrated how subgraphs can differ greatly based on which importance factor your subgraph selects for. [To translate into my discipline, I suppose this is a close reading of a small portion of a text, by blocking out all the information you don't want to find.]
Leass than 10% of the nodes are needed before the subgraph results look very close to the overall dataset.
This is the third time the presenter has apologized because a dotted red line is invisible in the slides being projected. His oral explanation of what he lines should signify is clear enough, I gather, though he's referring to special kinds of graphs that network researchers (not I) would be expected to understand.
A blog aggregator in his dataset was connecting to many, many nodes, not just the most important nodes, so there's a notable anomaly in degree. [I'm actually intersted in this, though he's mentioned it only as a footnote and offered to say more on it during the Q & A if desired.]
["NP-complete: reducible to Steiner..." nope, the slide changed even before I could write this down as an example of a statement that's perfectly straightforward but requires background knowledge that I would need to look up. I missed what "Keep One" and "Keep All" mean.]
[What I'm getting out of this is a mathematical illustration of the thesis that it is possible to gain useful, accurate information from a study of a subset within a dataset, and that other factors (which are over my head) seem to control for whatever bias you introduce when you select a set of nodes for a particular characteristic. Unlike the "close reading" metaphor I used earlier, if you know what you're doing, you can select a "synopsis" that emphasizes the features you wish to study, and identify how closely your subset matches the overall dataset's similar features.]
An Epistemic Dynamic
Model for Tagging Systems (Long Paper)
Klaas Dellschaft and Steffen
Staab
Klaas pressented, beginning with a basic introduction of tagging [probably not necessary for this audience, but it was brief and... and I don't mind an introduction that gives me a minute or so to adjust to a new speaker's cadence, accent, and relationship to the visuals.]
How do users influence each other in tagging systems?
User interface, user brain, background knowledge, and "something else" all influence tagging behavior. We can only observe the tagging behavior [well, we can learn something about the user interface, too, but I think that comment is intended to contrast with the inscrutability of the brain.]
Klass introduces foksonomy, [again I would assume most in the audience already know the term, but I'm always intersted in how individual scholars define their terms... this is necessary in the humanities, since we don't always have empirical evidence to help us define our terms.]
Presented sample of coding mechanism "co-ocurrence stream" . Measured how many occurrences, how many users, how many different texts, and how many resources.
After 100 occurrences of a term, you have about 50 texts. The number of distinct texts is growing but the rate of growth decreases... it's a nice logarhithmic scale showing that each time the tag appears again, it is more likely that the tag is used on an existing text, rather than a new text.
The frequency of texts is inverted -- The most often used text has a 4-5% probability, text #100 was 01.% probability. [Did he describe that as "fat tail distribution"? I assume that's the oppossite of the "long tail"?] Not quite as even as the tag graph, but
Resource streams [I missed something... does this mean he's graphing the channel by which the document is delivered, or by which the tag entered into his dataset, or the timeline of each entry in his dataset regardles of source or other characteristics? He briefly went through some related research, so obviously this is a common concept that I would know about if I knew the literature. The whole idea of measuring the time between the creation of links is an exciting new concept for me, since it blends well with composition pedagogy that considers writing as process rather than product.]
[Equation that I can't type fast enough] Probability of selecting from background knowledge -- probability of selecting word w for topic t modeled by -- [drat the slide changed.]
[Another equation] probabilty of repeating a previous tag assignment.
Now he's finished presenting the model, and he's moving on to compare the observed behavior with the behavior he modeled.
[I'm starting to pick up on the biological metaphors -- mutating of tag assignments, limited number of texts in the pool. I hadn't realized just how deeply the "virus" metaphor is rooted in this subject matter... Does the metaphor drive the science? Are there patterns that don't already have known biological instances that aren't pursued because there isn't already a vocabulary to describe them? I'm going to have to re-read The Selfish Gene for all the parts that I skimmed over when I was reading it simply to undersand "meme".]
After more than 1000 tag assignments, the frequency of assignments is more or less stable. There's a drop around tag number 7 in resource streams. Cited http://isweb.uni-koblenz.de/Research/Tagdataset and http://tagora-project.edu
First question, from the conference host, was about whether the user interface affects tagging behavior? [Here I am, thinking that I'm an idiot, and a big-shot followed up on something that dimly appeared to my Humanities brain. I confess I didn't follow the answer but that's my fault -- I was high-fiving myself for having wondered about that earlier.]
Understanding the
Efficiency of Social Tagging Systems using Information Theory (Long Paper)
Ed H. Chi and Todd
Mytkowicz
Todd presented. Organization of knowledge... the printing press was "a start" at organizing human knowledge. [Well... the codex, the classical form of the argument, rhyme, abstract thought... but I digress.]
Great-- Todd notes that the audience doesn't need an introduction to tagging, but notes the value of communicating the way he frames tagging. We all have individualistic motivations for tagging, but the result is a global knowledge map; the research posits that tags create a lower-dimensional representation of the material the tags describe.
This is a distributive process with hundreds and thousands of users, and trying to infer high-level behavior is challenging. If we sample del.icio.us data, we don't own access to the entire dataset, and therefore don't know whether we're sampling the data accurately. Noted the Measure/Model/Innovate cycle, and indicated roadblocks when you're measuring a dataset that you don't own.
Information theory -- how much information do tags tell us about an underlying document.
Entropy measures the uncertainty in a dataset. A heterogenous set has maximun entropy at log n, minimum of 0. You can increase entropy by increasing the number of things in your dataset; or you can keep n constant and make data more uniform [did he mean to say less uniform?]
[I lost track for a while because my blog was having trouble saving... picking back up]
Overview of the data collected from deli.cio.us over about 120 weeks. The number of documents increases rapidly... there have to be many tags for every individual document that comes into the system.
Entropy of the document set is rising... the diversity of the document set is increasing over time. People are creating new content for the system, rather than re-enforcing opinions that have already been entered into the system. This is good for del.icio.us in the long term.
Tagging in delicious is both encoding and retrieving. Encoding -- users have some notion of document space and future use; how likely are you to use a document in the future? We're sending a message to ourself in the future that we're going to have to retrieve. [How intersting -- this concept might be very useful in getting students to think about the value of taking notes in class or annotating their journey through a challenging academic or literary text.]
Around week 80, there's a dramatic stabilization of the entropy curve. The number of tags are increasing over time. People are becoming more and more likely to set up a tag that's already in the document space. [What happened around that time to create that change? Did some blogging tool or some content site like Wired add "delicious" tags to its content?]
Tag "efficiency" is decreasing -- crested around week 40. Tags are becoming less and less signficant. The descriptive power of English is limited.
Do people change their behaviors to re-capture some of that efficiency? People are starting to add more tags to each item in response to the navigation pressure. Ziff's principle of least effort. If you are just archiving a document, you probably won't put much effort into the tagging. Putting the notion of tagging in to the framework of information theory.
[My observation... the delicious interface changes over time, so the nature of tagging changes over time.]
Rather than delicious suggesting tags others have used, try asking the user to come up with new tags [as Google does in its picture-tagging game].
[During the Q and A I asked whether the drop at week 80 was realted to a change in the del.icio.us interface, but Todd says not. They did notice other phenomena that corresponded with changes in the user interface.]
Recent Related Entries
Famous Programmers From Adleman to ZimmermannLook who's up there with Ada Lovelace and Grace Hopper:Famous women programmers are Adele Goldberg, who worked at Xerox PARC laboratory and wrote a number of SmallTalk books, Grace Hopper, a pioneer in the field who wrote the first compiler,...
Josh Harris: "Pseudo was a fake company." - Boing Boing
I noticed something fishy about Josh Harris's Jupiter Media Metrix back in 2000, when I wrote "Parasites on the Internet." Now Harris tells BoingBoing that his next project was a $25 million joke:I now acknowledge that Pseudo Programs, Inc., a...
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete
Wired:The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we...
Writing Style for Print vs. Web
Jakob Nielsen:Print publications -- from newspaper articles to marketing brochures -- contain linear content that's often consumed in a more relaxed setting and manner than the solution-hunting behavior that characterizes most high-value Web use. In print, you can spice up...
Inside Google Book Search: U.S. copyright renewal records available for download
Inside Google Book SearchFor U.S. books published between 1923 and 1963, the rights holder needed to submit a form to the U.S. Copyright Office renewing the copyright 28 years after publication. In most cases, books that were never renewed are...
