search

What Google’s book settlement means

Google Book Search
Image via Wikipedia

From Robert Darnton’s “Google & the Future of Books” (The New York Review of Books: 12 February 2009):

As the Enlightenment faded in the early nineteenth century, professionalization set in. You can follow the process by comparing the Encyclopédie of Diderot, which organized knowledge into an organic whole dominated by the faculty of reason, with its successor from the end of the eighteenth century, the Encyclopédie méthodique, which divided knowledge into fields that we can recognize today: chemistry, physics, history, mathematics, and the rest. In the nineteenth century, those fields turned into professions, certified by Ph.D.s and guarded by professional associations. They metamorphosed into departments of universities, and by the twentieth century they had left their mark on campuses…

Along the way, professional journals sprouted throughout the fields, subfields, and sub-subfields. The learned societies produced them, and the libraries bought them. This system worked well for about a hundred years. Then commercial publishers discovered that they could make a fortune by selling subscriptions to the journals. Once a university library subscribed, the students and professors came to expect an uninterrupted flow of issues. The price could be ratcheted up without causing cancellations, because the libraries paid for the subscriptions and the professors did not. Best of all, the professors provided free or nearly free labor. They wrote the articles, refereed submissions, and served on editorial boards, partly to spread knowledge in the Enlightenment fashion, but mainly to advance their own careers.

The result stands out on the acquisitions budget of every research library: the Journal of Comparative Neurology now costs $25,910 for a year’s subscription; Tetrahedron costs $17,969 (or $39,739, if bundled with related publications as a Tetrahedron package); the average price of a chemistry journal is $3,490; and the ripple effects have damaged intellectual life throughout the world of learning. Owing to the skyrocketing cost of serials, libraries that used to spend 50 percent of their acquisitions budget on monographs now spend 25 percent or less. University presses, which depend on sales to libraries, cannot cover their costs by publishing monographs. And young scholars who depend on publishing to advance their careers are now in danger of perishing.

The eighteenth-century Republic of Letters had been transformed into a professional Republic of Learning, and it is now open to amateurs—amateurs in the best sense of the word, lovers of learning among the general citizenry. Openness is operating everywhere, thanks to “open access” repositories of digitized articles available free of charge, the Open Content Alliance, the Open Knowledge Commons, OpenCourseWare, the Internet Archive, and openly amateur enterprises like Wikipedia. The democratization of knowledge now seems to be at our fingertips. We can make the Enlightenment ideal come to life in reality.

What provoked these jeremianic- utopian reflections? Google. Four years ago, Google began digitizing books from research libraries, providing full-text searching and making books in the public domain available on the Internet at no cost to the viewer. For example, it is now possible for anyone, anywhere to view and download a digital copy of the 1871 first edition of Middlemarch that is in the collection of the Bodleian Library at Oxford. Everyone profited, including Google, which collected revenue from some discreet advertising attached to the service, Google Book Search. Google also digitized an ever-increasing number of library books that were protected by copyright in order to provide search services that displayed small snippets of the text. In September and October 2005, a group of authors and publishers brought a class action suit against Google, alleging violation of copyright. Last October 28, after lengthy negotiations, the opposing parties announced agreement on a settlement, which is subject to approval by the US District Court for the Southern District of New York.[2]

The settlement creates an enterprise known as the Book Rights Registry to represent the interests of the copyright holders. Google will sell access to a gigantic data bank composed primarily of copyrighted, out-of-print books digitized from the research libraries. Colleges, universities, and other organizations will be able to subscribe by paying for an “institutional license” providing access to the data bank. A “public access license” will make this material available to public libraries, where Google will provide free viewing of the digitized books on one computer terminal. And individuals also will be able to access and print out digitized versions of the books by purchasing a “consumer license” from Google, which will cooperate with the registry for the distribution of all the revenue to copyright holders. Google will retain 37 percent, and the registry will distribute 63 percent among the rightsholders.

Meanwhile, Google will continue to make books in the public domain available for users to read, download, and print, free of charge. Of the seven million books that Google reportedly had digitized by November 2008, one million are works in the public domain; one million are in copyright and in print; and five million are in copyright but out of print. It is this last category that will furnish the bulk of the books to be made available through the institutional license.

Many of the in-copyright and in-print books will not be available in the data bank unless the copyright owners opt to include them. They will continue to be sold in the normal fashion as printed books and also could be marketed to individual customers as digitized copies, accessible through the consumer license for downloading and reading, perhaps eventually on e-book readers such as Amazon’s Kindle.

After reading the settlement and letting its terms sink in—no easy task, as it runs to 134 pages and 15 appendices of legalese—one is likely to be dumbfounded: here is a proposal that could result in the world’s largest library. It would, to be sure, be a digital library, but it could dwarf the Library of Congress and all the national libraries of Europe. Moreover, in pursuing the terms of the settlement with the authors and publishers, Google could also become the world’s largest book business—not a chain of stores but an electronic supply service that could out-Amazon Amazon.

An enterprise on such a scale is bound to elicit reactions of the two kinds that I have been discussing: on the one hand, utopian enthusiasm; on the other, jeremiads about the danger of concentrating power to control access to information.

Google is not a guild, and it did not set out to create a monopoly. On the contrary, it has pursued a laudable goal: promoting access to information. But the class action character of the settlement makes Google invulnerable to competition. Most book authors and publishers who own US copyrights are automatically covered by the settlement. They can opt out of it; but whatever they do, no new digitizing enterprise can get off the ground without winning their assent one by one, a practical impossibility, or without becoming mired down in another class action suit. If approved by the court—a process that could take as much as two years—the settlement will give Google control over the digitizing of virtually all books covered by copyright in the United States.

Google alone has the wealth to digitize on a massive scale. And having settled with the authors and publishers, it can exploit its financial power from within a protective legal barrier; for the class action suit covers the entire class of authors and publishers. No new entrepreneurs will be able to digitize books within that fenced-off territory, even if they could afford it, because they would have to fight the copyright battles all over again. If the settlement is upheld by the court, only Google will be protected from copyright liability.

Google’s record suggests that it will not abuse its double-barreled fiscal-legal power. But what will happen if its current leaders sell the company or retire? The public will discover the answer from the prices that the future Google charges, especially the price of the institutional subscription licenses. The settlement leaves Google free to negotiate deals with each of its clients, although it announces two guiding principles: “(1) the realization of revenue at market rates for each Book and license on behalf of the Rightsholders and (2) the realization of broad access to the Books by the public, including institutions of higher education.”

What will happen if Google favors profitability over access? Nothing, if I read the terms of the settlement correctly. Only the registry, acting for the copyright holders, has the power to force a change in the subscription prices charged by Google, and there is no reason to expect the registry to object if the prices are too high. Google may choose to be generous in it pricing, and I have reason to hope it may do so; but it could also employ a strategy comparable to the one that proved to be so effective in pushing up the price of scholarly journals: first, entice subscribers with low initial rates, and then, once they are hooked, ratchet up the rates as high as the traffic will bear.

What Google’s book settlement means Read More »

A fix for Apple Mail’s inability to search Entire Message

Spotlight
Creative Commons License photo credit: Ti.mo

When using Apple Mail, you should be able to search for a term in From, To, Subject, & Entire Message. However, today I could no longer search Entire Message. It was grayed out & completely unavailable.

I found interesting info on the following pages, with the last being the most helpful:

  • http://discussions.apple.com/message.jspa?messageID=6653445#6653445
  • http://www.bronzefinger.com/archives/2006/04/apple_mail_sear.html
  • http://discussions.apple.com/message.jspa?messageID=5934412#5934412
  • http://forums.macworld.com/message/425508
  • http://www.macosxhints.com/article.php?story=20080201111317585

I closed Mail and tried this, which re-indexes the entire hard drive in Spotlight:

sudo mdutil -E /

But it did nothing. Then I did this, which re-indexes just the Mail folders in Spotlight:

mdimport ~/Library/Mail

That fixed it.

A fix for Apple Mail’s inability to search Entire Message Read More »

More on Google’s server farms

From Joel Hruska’s “The Beast unveiled: inside a Google server” (Ars Technica: 2 April 2009):

Each Google server is hooked to an independent 12V battery to keep the units running in the event of a power outage. Data centers themselves are built and housed in shipping containers (we’ve seen Sun pushing this trend as well), a practice that went into effect after the brownouts of 2005. Each container holds a total of 1,160 servers and can theoretically draw up to 250kW. Those numbers might seem a bit high for a data center optimized for energy efficiency—it breaks down to around 216W per system—but there are added cooling costs to be considered in any type of server deployment. These sorts of units were built for parking under trees (or at sea, per Google’s patent application).

By using individual batteries hooked to each server (instead of a UPS), the company is able to use the available energy much more efficiently (99.9 percent efficiency vs. 92-95 percent efficiency for a typical battery) and the rack-mounted servers are 2U with 8 DIMM slots. Ironically, for a company talking about power efficiency, the server box in question is scarcely a power sipper. The GA-9IVDP is a custom-built motherboard—I couldn’t find any information about it in Gigabyte’s website—but online research and a scan of Gigabyte’s similarly named products implies that this is a Socket 604 dual-Xeon board running dual Nocono (Prescott) P4 processors.

More on Google’s server farms Read More »

Google’s server farm revealed

From Nicholas Carr’s “Google lifts its skirts” (Rough Type: 2 April 2009):

I was particularly surprised to learn that Google rented all its data-center space until 2005, when it built its first center. That implies that The Dalles, Oregon, plant (shown in the photo above) was the company’s first official data smelter. Each of Google’s containers holds 1,160 servers, and the facility’s original server building had 45 containers, which means that it probably was running a total of around 52,000 servers. Since The Dalles plant has three server buildings, that means – and here I’m drawing a speculative conclusion – that it might be running around 150,000 servers altogether.

Here are some more details, from Rich Miller’s report:

The Google facility features a “container hanger” filled with 45 containers, with some housed on a second-story balcony. Each shipping container can hold up to 1,160 servers, and uses 250 kilowatts of power, giving the container a power density of more than 780 watts per square foot. Google’s design allows the containers to operate at a temperature of 81 degrees in the hot aisle. Those specs are seen in some advanced designs today, but were rare indeed in 2005 when the facility was built.

Google’s design focused on “power above, water below,” according to [Jimmy] Clidaras, and the racks are actually suspended from the ceiling of the container. The below-floor cooling is pumped into the hot aisle through a raised floor, passes through the racks and is returned via a plenum behind the racks. The cooling fans are variable speed and tightly managed, allowing the fans to run at the lowest speed required to cool the rack at that moment …

[Urs] Holzle said today that Google opted for containers from the start, beginning its prototype work in 2003. At the time, Google housed all of its servers in third-party data centers. “Once we saw that the commercial data center market was going to dry up, it was a natural step to ask whether we should build one,” said Holzle.

Google’s server farm revealed Read More »

Social software: 5 properties & 3 dynamics

From danah boyd’s “Social Media is Here to Stay… Now What?” at the Microsoft Research Tech Fest, Redmond, Washington (danah: 26 February 2009):

Certain properties are core to social media in a combination that alters how people engage with one another. I want to discuss five properties of social media and three dynamics. These are the crux of what makes the phenomena we’re seeing so different from unmediated phenomena.

A great deal of sociality is about engaging with publics, but we take for granted certain structural aspects of those publics. Certain properties are core to social media in a combination that alters how people engage with one another. I want to discuss five properties of social media and three dynamics. These are the crux of what makes the phenomena we’re seeing so different from unmediated phenomena.

1. Persistence. What you say sticks around. This is great for asynchronicity, not so great when everything you’ve ever said has gone down on your permanent record. …

2. Replicability. You can copy and paste a conversation from one medium to another, adding to the persistent nature of it. This is great for being able to share information, but it is also at the crux of rumor-spreading. Worse: while you can replicate a conversation, it’s much easier to alter what’s been said than to confirm that it’s an accurate portrayal of the original conversation.

3. Searchability. My mother would’ve loved to scream search into the air and figure out where I’d run off with friends. She couldn’t; I’m quite thankful. But with social media, it’s quite easy to track someone down or to find someone as a result of searching for content. Search changes the landscape, making information available at our fingertips. This is great in some circumstances, but when trying to avoid those who hold power over you, it may be less than ideal.

4. Scalability. Social media scales things in new ways. Conversations that were intended for just a friend or two might spiral out of control and scale to the entire school or, if it is especially embarrassing, the whole world. …

5. (de)locatability. With the mobile, you are dislocated from any particular point in space, but at the same time, location-based technologies make location much more relevant. This paradox means that we are simultaneously more and less connected to physical space.

Those five properties are intertwined, but their implications have to do with the ways in which they alter social dynamics. Let’s look at three different dynamics that have been reconfigured as a result of social media.

1. Invisible Audiences. We are used to being able to assess the people around us when we’re speaking. We adjust what we’re saying to account for the audience. Social media introduces all sorts of invisible audiences. There are lurkers who are present at the moment but whom we cannot see, but there are also visitors who access our content at a later date or in a different environment than where we first produced them. As a result, we are having to present ourselves and communicate without fully understanding the potential or actual audience. The potential invisible audiences can be stifling. Of course, there’s plenty of room to put your head in the sand and pretend like those people don’t really exist.

2. Collapsed Contexts. Connected to this is the collapsing of contexts. In choosing what to say when, we account for both the audience and the context more generally. Some behaviors are appropriate in one context but not another, in front of one audience but not others. Social media brings all of these contexts crashing into one another and it’s often difficult to figure out what’s appropriate, let alone what can be understood.

3. Blurring of Public and Private. Finally, there’s the blurring of public and private. These distinctions are normally structured around audience and context with certain places or conversations being “public” or “private.” These distinctions are much harder to manage when you have to contend with the shifts in how the environment is organized.

All of this means that we’re forced to contend with a society in which things are being truly reconfigured. So what does this mean? As we are already starting to see, this creates all new questions about context and privacy, about our relationship to space and to the people around us.

Social software: 5 properties & 3 dynamics Read More »

Kids & adults use social networking sites differently

From danah boyd’s “Social Media is Here to Stay… Now What?” at the Microsoft Research Tech Fest, Redmond, Washington (danah: 26 February 2009):

For American teenagers, social network sites became a social hangout space, not unlike the malls in which I grew up or the dance halls of yesteryears. This was a place to gather with friends from school and church when in-person encounters were not viable. Unlike many adults, teenagers were never really networking. They were socializing in pre-exiting groups.

Social network sites became critically important to them because this was where they sat and gossiped, jockeyed for status, and functioned as digital flaneurs. They used these tools to see and be seen. …

Teen conversations may appear completely irrational, or pointless at best. “Yo, wazzup?” “Not much, how you?” may not seem like much to an outsider, but this is a form of social grooming. It’s a way of checking in, confirming friendships, and negotiating social waters.

Adults have approached Facebook in very different ways. Adults are not hanging out on Facebook. They are more likely to respond to status messages than start a conversation on someone’s wall (unless it’s their birthday of course). Adults aren’t really decorating their profiles or making sure that their About Me’s are up-to-date. Adults, far more than teens, are using Facebook for its intended purpose as a social utility. For example, it is a tool for communicating with the past.

Adults may giggle about having run-ins with mates from high school, but underneath it all, many of them are curious. This isn’t that different than the school reunion. … Teenagers craft quizzes for themselves and their friends. Adults are crafting them to show-off to people from the past and connect the dots between different audiences as a way of coping with the awkwardness of collapsed contexts.

Kids & adults use social networking sites differently Read More »

The importance of network effects to social software

From danah boyd’s “Social Media is Here to Stay… Now What?” at the Microsoft Research Tech Fest, Redmond, Washington (danah: 26 February 2009):

Many who build technology think that a technology’s feature set is the key to its adoption and popularity. With social media, this is often not the case. There are triggers that drive early adopters to a site, but the single most important factor in determining whether or not a person will adopt one of these sites is whether or not it is the place where their friends hangout. In each of these cases, network effects played a significant role in the spread and adoption of the site.

The uptake of social media is quite different than the uptake of non-social technologies. For the most part, you don’t need your friends to use Word to find the tool useful. You do need your friends to use email for it to be useful, but, thanks to properties of that medium, you don’t need them to be using Outlook or Hotmail to write to them. Many of the new genres of social media are walled gardens, requiring your friends to use that exact site to be valuable. This has its advantages for the companies who build it – that’s the whole attitude behind lock-in. But it also has its costs. Consider for example the fact that working class and upper class kids can’t talk to one another if they are on different SNSs.

Friendster didn’t understand network effects. In kicking off users who weren’t conforming to their standards, they pissed off more than those users; they pissed off those users’ friends who were left with little purpose to use the site. The popularity of Friendster unraveled as fast as it picked up, but the company never realized what hit them. All of their metrics were based on number of users. While only a few users deleted their accounts, the impact of those lost accounts was huge. The friends of those who departed slowly stopped using the site. At first, they went from logging in every hour to logging in every day, never affecting the metrics. But as nothing new came in and as the collective interest waned, their attention went elsewhere. Today, Friendster is succeeding because of its popularity in other countries, but in the US, it’s a graveyard of hipsters stuck in 2003.

The importance of network effects to social software Read More »

Defining social media, social software, & Web 2.0

From danah boyd’s “Social Media is Here to Stay… Now What?” at the Microsoft Research Tech Fest, Redmond, Washington (danah: 26 February 2009):

Social media is the latest buzzword in a long line of buzzwords. It is often used to describe the collection of software that enables individuals and communities to gather, communicate, share, and in some cases collaborate or play. In tech circles, social media has replaced the earlier fave “social software.” Academics still tend to prefer terms like “computer-mediated communication” or “computer-supported cooperative work” to describe the practices that emerge from these tools and the old skool academics might even categorize these tools as “groupwork” tools. Social media is driven by another buzzword: “user-generated content” or content that is contributed by participants rather than editors.

… These tools are part of a broader notion of “Web2.0.” Yet-another-buzzword, Web2.0 means different things to different people.

For the technology crowd, Web2.0 was about a shift in development and deployment. Rather than producing a product, testing it, and shipping it to be consumed by an audience that was disconnected from the developer, Web2.0 was about the perpetual beta. This concept makes all of us giggle, but what this means is that, for technologists, Web2.0 was about constantly iterating the technology as people interacted with it and learning from what they were doing. To make this happen, we saw the rise of technologies that supported real-time interactions, user-generated content, remixing and mashups, APIs and open-source software that allowed mass collaboration in the development cycle. …

For the business crowd, Web2.0 can be understood as hope. Web2.0 emerged out of the ashes of the fallen tech bubble and bust. Scars ran deep throughout Silicon Valley and venture capitalists and entrepreneurs wanted to party like it was 1999. Web2.0 brought energy to this forlorn crowd. At first they were skeptical, but slowly they bought in. As a result, we’ve seen a resurgence of startups, venture capitalists, and conferences. At this point, Web2.0 is sometimes referred to as Bubble2.0, but there’s something to say about “hope” even when the VCs start co-opting that term because they want four more years.

For users, Web2.0 was all about reorganizing web-based practices around Friends. For many users, direct communication tools like email and IM were used to communicate with one’s closest and dearest while online communities were tools for connecting with strangers around shared interests. Web2.0 reworked all of that by allowing users to connect in new ways. While many of the tools may have been designed to help people find others, what Web2.0 showed was that people really wanted a way to connect with those that they already knew in new ways. Even tools like MySpace and Facebook which are typically labeled social networkING sites were never really about networking for most users. They were about socializing inside of pre-existing networks.

Defining social media, social software, & Web 2.0 Read More »

A single medium, with a single search engine, & a single info source

From Nicholas Carr’s “All hail the information triumvirate!” (Rough Type: 22 January 2009):

Today, another year having passed, I did the searches [on Google] again. And guess what:

World War II: #1
Israel: #1
George Washington: #1
Genome: #1
Agriculture: #1
Herman Melville: #1
Internet: #1
Magna Carta: #1
Evolution: #1
Epilepsy: #1

Yes, it’s a clean sweep for Wikipedia.

The first thing to be said is: Congratulations, Wikipedians. You rule. Seriously, it’s a remarkable achievement. Who would have thought that a rag-tag band of anonymous volunteers could achieve what amounts to hegemony over the results of the most popular search engine, at least when it comes to searches for common topics.

The next thing to be said is: what we seem to have here is evidence of a fundamental failure of the Web as an information-delivery service. Three things have happened, in a blink of history’s eye: (1) a single medium, the Web, has come to dominate the storage and supply of information, (2) a single search engine, Google, has come to dominate the navigation of that medium, and (3) a single information source, Wikipedia, has come to dominate the results served up by that search engine. Even if you adore the Web, Google, and Wikipedia – and I admit there’s much to adore – you have to wonder if the transformation of the Net from a radically heterogeneous information source to a radically homogeneous one is a good thing. Is culture best served by an information triumvirate?

It’s hard to imagine that Wikipedia articles are actually the very best source of information for all of the many thousands of topics on which they now appear as the top Google search result. What’s much more likely is that the Web, through its links, and Google, through its search algorithms, have inadvertently set into motion a very strong feedback loop that amplifies popularity and, in the end, leads us all, lemminglike, down the same well-trod path – the path of least resistance. You might call this the triumph of the wisdom of the crowd. I would suggest that it would be more accurately described as the triumph of the wisdom of the mob. The former sounds benign; the latter, less so.

A single medium, with a single search engine, & a single info source Read More »

Social networking and “friendship”

From danah boyd’s “Friends, Friendsters, and MySpace Top 8: Writing Community Into Being on Social Network Sites” (First Monday: December 2006)

John’s reference to “gateway Friends” concerns a specific technological affordance unique to Friendster. Because the company felt it would make the site more intimate, Friendster limits users from surfing to Profiles beyond four degrees (Friends of Friends of Friends of Friends). When people login, they can see how many Profiles are “in their network” where the network is defined by the four degrees. For users seeking to meet new people, growing this number matters. For those who wanted it to be intimate, keeping the number smaller was more important. In either case, the number of people in one’s network was perceived as directly related to the number of friends one had.

“I am happy with the number of friends I have. I can access over 26,000 profiles, which is enough for me!” — Abby

The number of Friends one has definitely affects the size of one’s network but connecting to Collectors plays a much more significant role. Because these “gateway friends” (a.k.a. social network hubs) have lots of Friends who are not connected to each other, they expand the network pretty rapidly. Thus, connecting to Collectors or connecting to people who connect to Collectors opens you up to a large network rather quickly.

While Collectors could be anyone interested in amassing many Friends, fake Profiles were developed to aid in this process. These Fakesters included characters, celebrities, objects, icons, institutions, and ideas. For example, Homer Simpson had a Profile alongside Jesus and Brown University. By connecting people with shared interests or affiliations, Fakesters supported networking between like-minded individuals. Because play and connecting were primary incentives for many Fakesters, they welcomed any and all Friends. Likewise, people who wanted access to more people connected to Fakesters. Fakesters helped centralize the network and two Fakesters — Burning Man and Ali G — reached mass popularity with over 10,000 Friends each before the Web site’s creators put an end to their collecting and deleted both accounts. This began the deletion of all Fakesters in what was eventually termed the Fakester Genocide [8].

While Friendster was irritated by fake Profiles, MySpace embraced this practice. One of MySpace’s early strategies was to provide a place for everyone who was rejected from Friendster or who didn’t want to be on a dating site [9]. Bands who had been kicked off of Friendster were some of the earliest MySpace users. Over time, movie stars, politicians, porn divas, comedians, and other celebrities joined the fray. Often, the person behind these Profiles was not the celebrity but a manager. Corporations began creating Profiles for their products and brands. While Friendster eventually began allowing such fake Profiles for a fee, MySpace never charged people for their commercial uses.

Investigating Friendship in LiveJournal, Kate Raynes-Goldie and Fono (2005) found that there was tremendous inconsistency in why people Friended others. They primarily found that Friendship stood for: content, offline facilitator, online community, trust, courtesy, declaration, or nothing. When I asked participants about their practices on Friendster and MySpace, I found very similar incentives. The most common reasons for Friendship that I heard from users [11] were:

1. Actual friends
2. Acquaintances, family members, colleagues
3. It would be socially inappropriate to say no because you know them
4. Having lots of Friends makes you look popular
5. It’s a way of indicating that you are a fan (of that person, band, product, etc.)
6. Your list of Friends reveals who you are
7. Their Profile is cool so being Friends makes you look cool
8. Collecting Friends lets you see more people (Friendster)
9. It’s the only way to see a private Profile (MySpace)
10. Being Friends lets you see someone’s bulletins and their Friends-only blog posts (MySpace)
11. You want them to see your bulletins, private Profile, private blog (MySpace)
12. You can use your Friends list to find someone later
13. It’s easier to say yes than no

These incentives account for a variety of different connections. While the first three reasons all concern people that you know, the rest can explain why people connect to a lot of people that they do not know. Most reveal how technical affordances affect people’s incentives to connect.

Raynes-Goldie and Fono (2005) also found that there is a great deal of social anxiety and drama provoked by Friending in LiveJournal (LJ). In LJ, Friendship does not require reciprocity. Anyone can list anyone else as a Friend; this articulation is public but there is no notification. The value of Friendship on LJ is deeply connected to the privacy settings and subscription processes. The norm on LJ is to read others’ entries through a “Friends page.” This page is an aggregation of all of an individual’s Friends’ posts. When someone posts an LJ entry, they have a choice as to whether the post should be public, private, Friends-only, or available to subgroups of Friends. In this way, it is necessary to be someone’s Friend to have access to Friends-only posts. To locate how the multiple and conflicting views of Friendship cause tremendous conflict and misunderstanding on LJ, Raynes-Goldie and Fono speak of “hyperfriending.” This process is quite similar to what takes place on other social network sites, but there are some differences. Because Friends-only posts are commonplace, not being someone’s Friend is a huge limitation to information access. Furthermore, because reciprocity is not structurally required, there’s a much greater social weight to recognizing someone’s Friendship and reciprocating intentionally. On MySpace and Friendster, there is little to lose by being loose with Friendship and more to gain; the perception is that there is much more to lose on LJ.

While users can scroll through their list of Friends, not all Friends are displayed on the participant’s Profile. Most social network sites display Friends in the order in which their account was created or their last login date. By implementing a “Top 8” feature, MySpace changed the social dynamics around the ordering of Friends. Initially, “Top 8” allowed users to select eight Friends to display on their Profile. More recently, that feature was changed to “Top Friends” as users have more options in how many people they could list [12]. Many users will only list people that they know and celebrities that they admire in their Top Friends, often as a way to both demarcate their identity and signal meaningful relationships with others.

There are many advantages to the Top Friends feature. It allows people to show connections that really say something about who they are. It also serves as a bookmark to the people that matter. By choosing to list the people who one visits the most frequently, simply going to one’s Profile provides a set of valuable links.

“As a kid, you used your birthday party guest list as leverage on the playground. ‘If you let me play I’ll invite you to my birthday party.’ Then, as you grew up and got your own phone, it was all about someone being on your speed dial. Well today it’s the MySpace Top 8. It’s the new dangling carrot for gaining superficial acceptance. Taking someone off your Top 8 is your new passive aggressive power play when someone pisses you off.” — Nadine

There are a handful of social norms that pervade Top 8 culture. Often, the person in the upper left (“1st” position) is a significant other, dear friend, or close family member. Reciprocity is another salient component of Top Friends dynamics. If Susan lists Mary on her Top 8, she expects Mary to reciprocate. To acknowledge this, Mary adds a Comment to Susan’s page saying, “Thanx for puttin me on ur Top 8! I put you on mine 2.” By publicly acknowledging this addition, Mary is making certain Susan’s viewers recognize Mary’s status on Susan’s list. Of course, just being in someone’s list is not always enough. As Samantha explains, “Friends get into fights because they’re not 1st on someone’s Top 8, or somebody else is before them.” While some people are ecstatic to be added, there are many more that are frustrated because they are removed or simply not listed.

The Top Friends feature requires participants to actively signal their relationship with others. Such a system makes it difficult to be vague about who matters the most, although some tried by explaining on their bulletins what theme they are using to choose their Top 8 this week: “my Sagittarius friends,” “my basketball team,” and “people whose initials are BR.” Still others relied on fake Profiles for their Top 8.

The networked nature of impressions does not only affect the viewer — this is how newcomers decided what to present in the first place. When people first joined Friendster, they took cues from the people who invited them. Three specific subcultures dominated the early adopters — bloggers, attendees of the Burning Man [14] festival, and gay men mostly living in New York. If the invitee was a Burner, their Profile would probably be filled with references to the event with images full of half-naked, costumed people running around the desert. As such, newcomers would get the impression that it was a site for Burners and they would create a Profile that displayed that facet of their identity. In decided who to invite, newcomers would perpetuate the framing by only inviting people who are part of the Burning Man subculture.

Interestingly, because of this process, Burners believed that the site was for Burners, gay men thought it was a gay dating site, and bloggers were ecstatic to have a geek socializing tool. The reason each group got this impression had to do with the way in which context was created on these systems. Rather than having the context dictated by the environment itself, context emerged through Friends networks. As a result, being socialized into Friendster meant connected to Friends that reinforced the contextual information of early adopters.

The growth of MySpace followed a similar curve. One of the key early adopter groups were hipsters living in the Silverlake neighborhood of Los Angeles. They were passionate about indie rock music and many were musicians, promoters, club goers, etc. As MySpace took hold, long before any press was covering the site, MySpace took off amongst 20/30-something urban socializers, musicians, and teenagers. The latter group may not appear obvious, but teenagers are some of the most active music consumers — they follow music culture avidly, even when they are unable to see the bands play live due to age restrictions. As the site grew, the teenagers and 20/30-somethings pretty much left each other alone, although bands bridged these groups. It was not until the site was sold to News Corp. for US$580 million in the summer of 2005 that the press began covering the phenomenon. The massive press helped it grow larger, penetrating those three demographics more deeply but also attracting new populations, namely adults who are interested in teenagers (parents, teachers, pedophiles, marketers).

When context is defined by whom one Friends, and addressing multiple audiences simultaneously complicates all relationships, people must make hard choices. Joshua Meyrowitz (1985) highlights this problem in reference to television. In the early 1960s, Stokely Carmichael regularly addressed segregated black and white audiences about the values of Black Power. Depending on his audience, he used very different rhetorical styles. As his popularity grew, he began to attract media attention and was invited to speak on TV and radio. Unfortunately, this was more of a curse than a blessing because the audiences he would reach through these mediums included both black and white communities. With no way to reconcile the two different rhetorical styles, he had to choose. In choosing to maintain his roots in front of white listeners, Carmichael permanently alienated white society from the messages of Black Power.

Notes

10. Friendster originally limited users to 150 Friends. It is no accident that they chose 150, as this is the “Dunbar number.” In his research on gossip and grooming, Robin Dunbar argues that there is a cognitive limit to the number of relations that one can maintain. People can only keep gossip with 150 people at any given time (Dunbar, 1998). By capping Friends at 150, Friendster either misunderstood Dunbar or did not realize that their users were actually connecting to friends from the past with whom they are not currently engaging.

12. Eight was the maximum number of Friends that the system initially let people have. Some users figured out how to hack the system to display more Friends; there are entire bulletin boards dedicated to teaching others how to hack this. Consistently, upping the limit was the number one request that the company received. In the spring of 2006, MySpace launched an ad campaign for X-Men. In return for Friending X-Men, users were given the option to have 12, 16, 20, or 24 Friends in their Top Friends section. Millions of users did exactly that. In late June, this feature was introduced to everyone, regardless of Friending X-Men. While eight is no longer the limit, people move between calling it Top 8 or Top Friends. I will use both terms interchangeably, even when the number of Friends might be greater than eight.

Social networking and “friendship” Read More »

An analysis of Google’s technology, 2005

From Stephen E. Arnold’s The Google Legacy: How Google’s Internet Search is Transforming Application Software (Infonortics: September 2005):

The figure Google’s Fusion: Hardware and Software Engineering shows that Google’s technology framework has two areas of activity. There is the software engineering effort that focuses on PageRank and other applications. Software engineering, as used here, means writing code and thinking about how computer systems operate in order to get work done quickly. Quickly means the sub one-second response times that Google is able to maintain despite its surging growth in usage, applications and data processing.

Google is hardware plus software

The other effort focuses on hardware. Google has refined server racks, cable placement, cooling devices, and data center layout. The payoff is lower operating costs and the ability to scale as demand for computing resources increases. With faster turnaround and the elimination of such troublesome jobs as backing up data, Google’s hardware innovations give it a competitive advantage few of its rivals can equal as of mid-2005.

How Google Is Different from MSN and Yahoo

Google’s technologyis simultaneously just like other online companies’ technology, and very different. A data center is usually a facility owned and operated by a third party where customers place their servers. The staff of the data center manage the power, air conditioning and routine maintenance. The customer specifies the computers and components. When a data center must expand, the staff of the facility may handle virtually all routine chores and may work with the customer’s engineers for certain more specialized tasks.

Before looking at some significant engineering differences between Google and two of its major competitors, review this list of characteristics for a Google data center.

1. Google data centers – now numbering about two dozen, although no one outside Google knows the exact number or their locations. They come online and automatically, under the direction of the Google File System, start getting work from other data centers. These facilities, sometimes filled with 10,000 or more Google computers, find one another and configure themselves with minimal human intervention.

2. The hardware in a Google data center can be bought at a local computer store. Google uses the same types of memory, disc drives, fans and power supplies as those in a standard desktop PC.

3. Each Google server comes in a standard case called a pizza box with one important change: the plugs and ports are at the front of the box to make access faster and easier.

4. Google racks are assembled for Google to hold servers on their front and back sides. This effectively allows a standard rack, normally holding 40 pizza box servers, to hold 80.

5. A Google data center can go from a stack of parts to online operation in as little as 72 hours, unlike more typical data centers that can require a week or even a month to get additional resources online.

6. Each server, rack and data center works in a way that is similar to what is called “plug and play.” Like a mouse plugged into the USB port on a laptop, Google’s network of data centers knows when more resources have been connected. These resources, for the most part, go into operation without human intervention.

Several of these factors are dependent on software. This overlap between the hardware and software competencies at Google, as previously noted, illustrates the symbiotic relationship between these two different engineering approaches. At Google, from its inception, Google software and Google hardware have been tightly coupled. Google is not a software company nor is it a hardware company. Google is, like IBM, a company that owes its existence to both hardware and software. Unlike IBM, Google has a business model that is advertiser supported. Technically, Google is conceptually closer to IBM (at one time a hardware and software company) than it is to Microsoft (primarily a software company) or Yahoo! (an integrator of multiple softwares).

Software and hardware engineering cannot be easily segregated at Google. At MSN and Yahoo hardware and software are more loosely-coupled. Two examples will illustrate these differences.

Microsoft – with some minor excursions into the Xbox game machine and peripherals – develops operating systems and traditional applications. Microsoft has multiple operating systems, and its engineers are hard at work on the company’s next-generation of operating systems.

Several observations are warranted:

1. Unlike Google, Microsoft does not focus on performance as an end in itself. As a result, Microsoft gets performance the way most computer users do. Microsoft buys or upgrades machines. Microsoft does not fiddle with its operating systems and their subfunctions to get that extra time slice or two out of the hardware.

2. Unlike Google, Microsoft has to support many operating systems and invest time and energy in making certain that important legacy applications such as Microsoft Office or SQLServer can run on these new operating systems. Microsoft has a boat anchor tied to its engineer’s ankles. The boat anchor is the need to ensure that legacy code works in Microsoft’s latest and greatest operating systems.

3. Unlike Google, Microsoft has no significant track record in designing and building hardware for distributed, massively parallelised computing. The mice and keyboards were a success. Microsoft has continued to lose money on the Xbox, and the sudden demise of Microsoft’s entry into the home network hardware market provides more evidence that Microsoft does not have a hardware competency equal to Google’s.

Yahoo! operates differently from both Google and Microsoft. Yahoo! is in mid-2005 a direct competitor to Google for advertising dollars. Yahoo! has grown through acquisitions. In search, for example, Yahoo acquired 3721.com to handle Chinese language search and retrieval. Yahoo bought Inktomi to provide Web search. Yahoo bought Stata Labs in order to provide users with search and retrieval of their Yahoo! mail. Yahoo! also owns AllTheWeb.com, a Web search site created by FAST Search & Transfer. Yahoo! owns the Overture search technology used by advertisers to locate key words to bid on. Yahoo! owns Alta Vista, the Web search system developed by Digital Equipment Corp. Yahoo! licenses InQuira search for customer support functions. Yahoo has a jumble of search technology; Google has one search technology.

Historically Yahoo has acquired technology companies and allowed each company to operate its technology in a silo. Integration of these different technologies is a time-consuming, expensive activity for Yahoo. Each of these software applications requires servers and systems particular to each technology. The result is that Yahoo has a mosaic of operating systems, hardware and systems. Yahoo!’s problem is different from Microsoft’s legacy boat-anchor problem. Yahoo! faces a Balkan-states problem.

There are many voices, many needs, and many opposing interests. Yahoo! must invest in management resources to keep the peace. Yahoo! does not have a core competency in hardware engineering for performance and consistency. Yahoo! may well have considerable competency in supporting a crazy-quilt of hardware and operating systems, however. Yahoo! is not a software engineering company. Its engineers make functions from disparate systems available via a portal.

The figure below provides an overview of the mid-2005 technical orientation of Google, Microsoft and Yahoo.

2005 focuses of Google, MSN, and Yahoo

The Technology Precepts

… five precepts thread through Google’s technical papers and presentations. The following snapshots are extreme simplifications of complex, yet extremely fundamental, aspects of the Googleplex.

Cheap Hardware and Smart Software

Google approaches the problem of reducing the costs of hardware, set up, burn-in and maintenance pragmatically. A large number of cheap devices using off-the-shelf commodity controllers, cables and memory reduces costs. But cheap hardware fails.

In order to minimize the “cost” of failure, Google conceived of smart software that would perform whatever tasks were needed when hardware devices fail. A single device or an entire rack of devices could crash, and the overall system would not fail. More important, when such a crash occurs, no full-time systems engineering team has to perform technical triage at 3 a.m.

The focus on low-cost, commodity hardware and smart software is part of the Google culture.

Logical Architecture

Google’s technical papers do not describe the architecture of the Googleplex as self-similar. Google’s technical papers provide tantalizing glimpses of an approach to online systems that makes a single server share features and functions of a cluster of servers, a complete data center, and a group of Google’s data centers.

The collections of servers running Google applications on the Google version of Linux is a supercomputer. The Googleplex can perform mundane computing chores like taking a user’s query and matching it to documents Google has indexed. Further more, the Googleplex can perform side calculations needed to embed ads in the results pages shown to user, execute parallelized, high-speed data transfers like computers running state-of-the-art storage devices, and handle necessary housekeeping chores for usage tracking and billing.

When Google needs to add processing capacity or additional storage, Google’s engineers plug in the needed resources. Due to self-similarity, the Googleplex can recognize, configure and use the new resource. Google has an almost unlimited flexibility with regard to scaling and accessing the capabilities of the Googleplex.

In Google’s self-similar architecture, the loss of an individual device is irrelevant. In fact, a rack or a data center can fail without data loss or taking the Googleplex down. The Google operating system ensures that each file is written three to six times to different storage devices. When a copy of that file is not available, the Googleplex consults a log for the location of the copies of the needed file. The application then uses that replica of the needed file and continues with the job’s processing.

Speed and Then More Speed

Google uses commodity pizza box servers organized in a cluster. A cluster is group of computers that are joined together to create a more robust system. Instead of using exotic servers with eight or more processors, Google generally uses servers that have two processors similar to those found in a typical home computer.

Through proprietary changes to Linux and other engineering innovations, Google is able to achieve supercomputer performance from components that are cheap and widely available.

… engineers familiar with Google believe that read rates may in some clusters approach 2,000 megabytes a second. When commodity hardware gets better, Google runs faster without paying a premium for that performance gain.

Another key notion of speed at Google concerns writing computer programs to deploy to Google users. Google has developed short cuts to programming. An example is Google’s creating a library of canned functions to make it easy for a programmer to optimize a program to run on the Googleplex computer. At Microsoft or Yahoo, a programmer must write some code or fiddle with code to get different pieces of a program to execute simultaneously using multiple processors. Not at Google. A programmer writes a program, uses a function from a Google bundle of canned routines, and lets the Googleplex handle the details. Google’s programmers are freed from much of the tedium associated with writing software for a distributed, parallel computer.

Eliminate or Reduce Certain System Expenses

Some lucky investors jumped on the Google bandwagon early. Nevertheless, Google was frugal, partly by necessity and partly by design. The focus on frugality influenced many hardware and software engineering decisions at the company.

Drawbacks of the Googleplex

The Laws of Physics: Heat and Power 101

In reality, no one knows. Google has a rapidly expanding number of data centers. The data center near Atlanta, Georgia, is one of the newest deployed. This state-of-the-art facility reflects what Google engineers have learned about heat and power issues in its other data centers. Within the last 12 months, Google has shifted from concentrating its servers at about a dozen data centers, each with 10,000 or more servers, to about 60 data centers, each with fewer machines. The change is a response to the heat and power issues associated with larger concentrations of Google servers.

The most failure prone components are:

  • Fans.
  • IDE drives which fail at the rate of one per 1,000 drives per day.
  • Power supplies which fail at a lower rate.

Leveraging the Googleplex

Google’s technology is one major challenge to Microsoft and Yahoo. So to conclude this cursory and vastly simplified look at Google technology, consider these items:

1. Google is fast anywhere in the world.

2. Google learns. When the heat and power problems at dense data centers surfaced, Google introduced cooling and power conservation innovations to its two dozen data centers.

3. Programmers want to work at Google. “Google has cachet,” said one recent University of Washington graduate.

4. Google’s operating and scaling costs are lower than most other firms offering similar businesses.

5. Google squeezes more work out of programmers and engineers by design.

6. Google does not break down, or at least it has not gone offline since 2000.

7. Google’s Googleplex can deliver desktop-server applications now.

8. Google’s applications install and update without burdening the user with gory details and messy crashes.

9. Google’s patents provide basic technology insight pertinent to Google’s core functionality.

An analysis of Google’s technology, 2005 Read More »

An analysis of splogs: spam blogs

From Charles C. Mann’s “Spam + Blogs = Trouble” (Wired: September 2006):

Some 56 percent of active English-language blogs are spam, according to a study released in May by Tim Finin, a researcher at the University of Maryland, Baltimore County, and two of his students. “The blogosphere is growing fast,” Finin says. “But the splogosphere is now growing faster.”

A recent survey by Mitesh Vasa, a Virginia-based software engineer and splog researcher, found that in December 2005, Blogger was hosting more than 100,000 sploggers. (Many of these are likely pseudonyms for the same people.)

Some Title, the splog that commandeered my name, was created by Dan Goggins, the proud possessor of a 2005 master’s degree in computer science from Brigham Young University. Working out of his home in a leafy subdivision in Springville, Utah, Goggins, his BYU friend and partner, John Jonas, and their handful of employees operate “a few thousand” splogs. “It’s not that many,” Goggins says modestly. “Some people have a lot of sites.” Trolling the Net, I came across a PowerPoint presentation for a kind of spammers’ conference that details some of the earnings of the Goggins-Jonas partnership. Between August and October of 2005, they made at least $71,136.89.

In addition to creating massive numbers of phony blogs, sploggers sometimes take over abandoned real blogs. More than 10 million of the 12.9 million profiles on Blogger surveyed by splog researcher Vasa in June were inactive, either because the bloggers had stopped blogging or because they never got started.

Not only do sploggers create fake blogs or take over abandoned ones, they use robo-software to flood real blogs with bogus comments that link back to the splog. (“Great post! For more on this subject, click here!”) Statistics compiled by Akismet, a system put together by WordPress developer Mullenweg that tries to filter out blog spam, suggest that more than nine out of 10 comments in the blogosphere are spam.

Maryland researcher Finin and his students found that splogs produce about three-quarters of the pings from English-language blogs. Another way of saying this is that the legitimate blogosphere generates about 300,000 posts a day, but the splogosphere emits 900,000, inundating the ping servers.

Another giveaway: Both Some Title and the grave-robbing page it links to had Web addresses in the .info domain. Spammers flock to .info, which was created as an alternative to the crowded .com, because its domain names are cheaper – registrars often let people use them gratis for the first year – which is helpful for those, like sploggers, who buy Internet addresses in bulk. Splogs so commonly have .info addresses that many experts simply assume all blogs from that domain are fake.

An analysis of splogs: spam blogs Read More »

Google PageRank explained

From Danny Sullivan’s “What Is Google PageRank? A Guide For Searchers & Webmasters” (Search Engine Land: 26 April 2007):

Let’s start with what Google says. In a nutshell, it considers links to be like votes. In addition, it considers that some votes are more important than others. PageRank is Google’s system of counting link votes and determining which pages are most important based on them. These scores are then used along with many other things to determine if a page will rank well in a search.

PageRank is only a score that represents the importance of a page, as Google estimates it (By the way, that estimate of importance is considered to be Google’s opinion and protected in the US by the First Amendment. When Google was once sued over altering PageRank scores for some sites, a US court ruled: “PageRanks are opinions–opinions of the significance of particular Web sites as they correspond to a search query….the court concludes Google’s PageRanks are entitled to full constitutional protection.)

Google PageRank explained Read More »

My new book – Linux Phrasebook – is out!

I’m really proud to announce that my 3rd book is now out & available for purchase: Linux Phrasebook. My first book – Don’t Click on the Blue E!: Switching to Firefox – was for general readers (really!) who wanted to learn how to move to and use the fantastic Firefox web browser. I included a lot of great information for more technical users as well, but the focus was your average Joe. My second book – Hacking Knoppix – was for the more advanced user who wanted to take advantage of Knoppix, a version of Linux that runs entirely off of a CD. You don’t need to be super-technical to use and enjoy Hacking Knoppix, but the more technical you are, the more you’ll enjoy the book. Linux Phrasebook is all about the Linux command line, and it’s perfect for both Linux newbies and experienced users. In fact, when I was asked to write the book, I responded, “Write it? I can’t wait to buy it!”

The idea behind Linux Phrasebook is to give practical examples of Linux commands and their myriad options, with examples for everything. Too often a Linux user will look up a command in order to discover how it works, and while the command and its many options will be detailed, something vitally important will be left out: examples. That’s where Linux Phrasebook comes in. I cover a huge number of different commands and their options, and for every single one, I give an example of usage and results that makes it clear how to use it.

Here’s the table of contents; in parentheses I’ve included some (just some) of the commands I cover in each chapter:

  1. Things to Know About Your Command Line
  2. The Basics (ls, cd, mkdir, cp, mv, rm)
  3. Learning About Commands (man, info, whereis, apropos)
  4. Building Blocks (;, &&, |, >, >>)
  5. Viewing Files (cat, less, head, tail)
  6. Printing and Managing Print Jobs (lpr, lpq, lprm)
  7. Ownerships and Permissions (chgrp, chown, chmod)
  8. Archiving and Compression (zip, gzip, bzip2, tar)
  9. Finding Stuff: Easy (grep, locate)
  10. The find command (find)
  11. Your Shell (history, alias, set)
  12. Monitoring System Resources (ps, lsof, free, df, du)
  13. Installing software (rpm, dkpg, apt-get, yum)
  14. Connectivity (ping, traceroute, route, ifconfig, iwconfig)
  15. Working on the Network (ssh, sftp, scp, rsync, wget)
  16. Windows Networking (nmblookup, smbclient, smbmount)

I’m really proud of the whole book, but the chapter on the super-powerful and useful find command is a standout, along with the material on ssh and its descendants sftp and scp. But really, the whole book is great, and I will definitely be keeping a copy on my desk as a reference. If you want to know more about the Linux command line and how to use it, then I know you’ll enjoy and learn from Linux Phrasebook.

You can read about and buy the book at Amazon (http://www.amazon.com/gp/product/0672328380/) for $10.19. If you have any questions or comments, don’t hesitate to contact me at scott at granneman dot com.

My new book – Linux Phrasebook – is out! Read More »

Search for “high score” told them who stole the PC

From Robert Alberti’s “more on Supposedly Destroyed Hard Drive Purchased In Chicago” (Interesting People mailing list: 3 June 2006):

It would be interesting to analyze that drive to see if anyone else was using it during the period between when it went to Best Buy, and when it turned up at the garage sale. We once discovered who stole, and then returned, a Macintosh from a department at the University of Minnesota with its drive erased. We did a hex search of the drive surface for the words “high score”. There was the name of the thief, one of the janitors, who confessed when presented with evidence.

Search for “high score” told them who stole the PC Read More »

Killer search terms

From The Inquirer‘s “Killer phrase will fill your PC with spam”:

THERE IS ONE phrase which, if you type into any search engine will expose your PC to shed-loads of spam, according to a new report.

Researchers Ben Edelman and Hannah Rosenbaum reckon that typing the phrase “Free Screensavers” into any search engine is the equivalent of lighting a blue touch paper and standing well back. …

More than 64 per cent of sites that are linked to this phrase will cause you some trouble, either with spyware or adware. The report found 1,394 popular keywords searches found via Google, Yahoo, MSN, AOL and Ask that were linked to spyware or adware and the list is quite amusing. Do not type in the following words into any search engine:

Bearshare
Screensavers
Winmx
Limewire
Download Yahoo messenger
Lime wire
Free ringtones

Killer search terms Read More »

Thoughts on tagging/folksonomy

From Ulises Ali Mejias’ “A del.icio.us study: Bookmark, Classify and Share: A mini-ethnography of social practices in a distributed classification community“:

This principle of distribution is at work in socio-technical systems that allow users to collaboratively organize a shared set of resources by assigning classifiers, or tags, to each item. The practice is coming to be known as free tagging, open tagging, ethnoclassification, folksonomy, or faceted hierarchy (henceforth referred to in this study as distributed classification) …

One important feature of systems such as these is that they do not impose a rigid taxonomy. Instead, they allow users to assign whatever classifiers they choose. Although this might sound counter-productive to the ultimate goal of organizing content, in practice it seems to work rather well, although it does present some drawbacks. For example, most people will probably classify pictures of cats by using the tag ‘cats.’ But what happens when some individuals use ‘cat’ or ‘feline’ or ‘meowmeow’ …

It seems that while most people might not be motivated to contribute to a pre-established system of classification that may not meet their needs, or to devise new and complex taxonomies of their own, they are quite happy to use distributed systems of classification that are quick and able to accommodate their personal (and ever changing) systems of classification. …

But distributed classification does not accrue benefits only to the individual. It is a very social endeavor in which the community as a whole can benefit. Jon Udell describes some of the individual and social possibilities of this method of classification:

These systems offer lots of ways to visualize and refine the tag space. It’s easy to know whether a tag you’ve used is unique or, conversely, popular. It’s easy to rename a tag across a set of items. It’s easy to perform queries that combine tags. Armed with such powerful tools, people can collectively enrich shared data. (Udell 2004) …

Set this [an imposed taxonomy] against the idea of allowing a user to add tags to any given document in the corpus. Like Del.icio.us, there needn’t be a pre-defined hierarchy or lexicon of terms to use; one can simply lean on the power of ethnoclassification to build that lexicon dynamically. As such, it will dynamically evolve as usages change and shift, even as needs change and shift. (Williams, 2004)

The primary benefit of free tagging is that we know the classification makes sense to users… For a content creator who is uploading information into such a system, being able to freely list subjects, instead of choosing from a pre-approved “pick list,” makes tagging content much easier. This, in turn, makes it more likely that users will take time to classify their contributions. (Merholz, 2004)

Folksonomies work best when a number of users all describe the same piece of information. For instance, on del.icio.us, many people have bookmarked wikipedia (http://del.icio.us/url/bca8b85b54a7e6c01a1bcfaf15be1df5), each with a different set of words to describe it. Among the various tags used, del.icio.us shows that reference, wiki, and encyclopedia are the most popular. (Wikipedia entry for folksonomy, retrieved December 15, 2004 from http://en.wikipedia.org/wiki/Folksonomy)

Of course, this approach is not without its potential problems:

With no one controlling the vocabulary, users develop multiple terms for identical concepts. For example, if you want to find all references to New York City on Del.icio.us, you’ll have to look through “nyc,” “newyork,” and “newyorkcity.” You may also encounter the inverse problem — users employing the same term for disparate concepts. (Merholz, 2004) …

But as Clay Shirky remarks, this solution might diminish some of the benefits that we can derive from folksonomies:

Synonym control is not as wonderful as is often supposed, because synonyms often aren’t. Even closely related terms like movies, films, flicks, and cinema cannot be trivially collapsed into a single word without loss of meaning, and of social context … (Shirky, 2004) …

The choice of tags [in the entire del.icio.us system] follows something resembling the Zipf or power law curve often seen in web-related traffic. Just six tags (python, delicious/del.icio.us, programming, hacks, tools, and web) account for 80% of all the tags chosen, and a long tail of 58 other tags make up the remaining 20%, with most occurring just once or twice … In the del.icio.us community, the rich get richer and the poor stay poor via http://del.icio.us/popular. Links noted by enough users within a short space of time get listed here, and many del.icio.us users use it to keep up with the zeitgeist. (Biddulph, 2004) …

Thoughts on tagging/folksonomy Read More »

Tracking terrorists with Unintended Information Revelation

From “New search engine to help thwart terrorists“:

With news that the London bombers were British citizens, radicalised on the streets of England and with squeaky-clean police records, comes the realisation that new mechanisms for hunting terrorists before they strike must be developed.

Researchers at the University of Buffalo, US, believe they have discovered a technique that will reveal information on public web sites that was not intended to be published.

The United States Federal Aviation Administration (FAA) and the National Science Foundation (NSF) are supporting the development of a new search engine based on Unintended Information Revelation (UIR), and designed for anti-terrorism applications.

UIR supposes that snippets of information – that by themselves appear to be innocent – may be linked together to reveal highly sensitive data.

… “A concept chain graph will show you what’s common between two seemingly unconnected things,” said Srihari. “With regular searches, the input is a set of key words, the search produces a ranked list of documents, any one of which could satisfy the query.

“UIR, on the other hand, is a composite query, not a keyword query. It is designed to find the best path, the best chain of associations between two or more ideas. It returns to you an evidence trail that says, ‘This is how these pieces are connected.'”

Tracking terrorists with Unintended Information Revelation Read More »

Crack Windows passwords in seconds

This is an oldie but still a goodie – or a baddie, if you use or depend on Windows. Back in 2003, researchers released tools that enable the cracking of Windows passwords in an average of 13.6 seconds. Not bad, not bad at all. CNET has a nice writeup titled Cracking Windows passwords in seconds, which explains that the best way to guard against the attack is to create passwords that use more than just alphanumeric items. In other words, read my SecurityFocus column from May 2004, Pass the Chocolate, which contains this advice: “… you should use a mix of at least three of these four things: small letters, capital letters, numbers, and symbols. If you can use all four, great, but at least use three of them.”

If you want to download and test the security of your Windows passwords, you can grab the software at Ophcrack. You can get source, as well as binaries for Windows and Linux. There’s even an online demo of the software, in which you can paste a hash of the password you’d like to crack and get back the actual password. Nice!

Crack Windows passwords in seconds Read More »