storage

David Pogue’s insights about tech over time

From David Pogue’s “The Lessons of 10 Years of Talking Tech” (The New York Times: 25 November 2010):

As tech decades go, this one has been a jaw-dropper. Since my first column in 2000, the tech world has not so much blossomed as exploded. Think of all the commonplace tech that didn’t even exist 10 years ago: HDTV, Blu-ray, GPS, Wi-Fi, Gmail, YouTube, iPod, iPhone, Kindle, Xbox, Wii, Facebook, Twitter, Android, online music stores, streaming movies and on and on.

With the turkey cooking, this seems like a good moment to review, to reminisce — and to distill some insight from the first decade in the new tech millennium.

Things don’t replace things; they just splinter. I can’t tell you how exhausting it is to keep hearing pundits say that some product is the “iPhone killer” or the “Kindle killer.” Listen, dudes: the history of consumer tech is branching, not replacing.

Things don’t replace things; they just add on. Sooner or later, everything goes on-demand. The last 10 years have brought a sweeping switch from tape and paper storage to digital downloads. Music, TV shows, movies, photos and now books and newspapers. We want instant access. We want it easy.

Some people’s gadgets determine their self-esteem. … Today’s gadgets are intensely personal. Your phone or camera or music player makes a statement, reflects your style and character. No wonder some people interpret criticisms of a product as a criticism of their choices. By extension, it’s a critique of them.

Everybody reads with a lens. … feelings run just as strongly in the tech realm. You can’t use the word “Apple,” “Microsoft” or “Google” in a sentence these days without stirring up emotion.

It’s not that hard to tell the winners from the losers. … There was the Microsoft Spot Watch (2003). This was a wireless wristwatch that could display your appointments and messages — but cost $10 a month, had to be recharged nightly and wouldn’t work outside your home city unless you filled out a Web form in advance.

Some concepts’ time may never come. The same “breakthrough” ideas keep surfacing — and bombing, year after year. For the love of Mike, people, nobody wants videophones!

Teenagers do not want “communicators” that do nothing but send text messages, either (AT&T Ogo, Sony Mylo, Motorola V200). People do not want to surf the Internet on their TV screens (WebTV, AOLTV, Google TV). And give it up on the stripped-down kitchen “Internet appliances” (3Com Audrey, Netpliance i-Opener, Virgin Webplayer). Nobody has ever bought one, and nobody ever will.

Forget about forever — nothing lasts a year. Of the thousands of products I’ve reviewed in 10 years, only a handful are still on the market. Oh, you can find some gadgets whose descendants are still around: iPod, BlackBerry, Internet Explorer and so on.

But it’s mind-frying to contemplate the millions of dollars and person-years that were spent on products and services that now fill the Great Tech Graveyard: Olympus M-Robe. PocketPC. Smart Display. MicroMV. MSN Explorer. Aibo. All those PlaysForSure music players, all those Palm organizers, all those GPS units you had to load up with maps from your computer.

Everybody knows that’s the way tech goes. The trick is to accept your
gadget’s obsolescence at the time you buy it…

Nobody can keep up. Everywhere I go, I meet people who express the same reaction to consumer tech today: there’s too much stuff coming too fast. It’s impossible to keep up with trends, to know what to buy, to avoid feeling left behind. They’re right. There’s never been a period of greater technological change. You couldn’t keep up with all of it if you tried.

David Pogue’s insights about tech over time Read More »

Unix: An Oral History

From ‘s “Unix: An Oral History” (: ):

Multics

Gordon M. Brown

[Multics] was designed to include fault-free continuous operation capabilities, convenient remote terminal access and selective information sharing. One of the most important features of Multics was to follow the trend towards integrated multi-tasking and permit multiple programming environments and different human interfaces under one operating system.

Moreover, two key concepts had been picked up on in the development of Multics that would later serve to define Unix. These were that the less important features of the system introduced more complexity and, conversely, that the most important property of algorithms was simplicity. Ritchie explained this to Mahoney, articulating that:

The relationship of Multics to [the development of Unix] is actually interesting and fairly complicated. There were a lot of cultural things that were sort of taken over wholesale. And these include important things, [such as] the hierarchical file system and tree-structure file system – which incidentally did not get into the first version of Unix on the PDP-7. This is an example of why the whole thing is complicated. But any rate, things like the hierarchical file system, choices of simple things like the characters you use to edit lines as you’re typing, erasing characters were the same as those we had. I guess the most important fundamental thing is just the notion that the basic style of interaction with the machine, the fact that there was the notion of a command line, the notion was an explicit shell program. In fact the name shell came from Multics. A lot of extremely important things were completely internalized, and of course this is the way it is. A lot of that came through from Multics.

The Beginning

Michael Errecart and Cameron Jones

Files to Share

The Unix file system was based almost entirely on the file system for the failed Multics project. The idea was for file sharing to take place with explicitly separated file systems for each user, so that there would be no locking of file tables.

A major part of the answer to this question is that the file system had to be open. The needs of the group dictated that every user had access to every other user’s files, so the Unix system had to be extremely open. This openness is even seen in the password storage, which is not hidden at all but is encrypted. Any user can see all the encrypted passwords, but can only test one solution per second, which makes it extremely time consuming to try to break into the system.

The idea of standard input and output for devices eventually found its way into Unix as pipes. Pipes enabled users and programmers to send one function’s output into another function by simply placing a vertical line, a ‘|’ between the two functions. Piping is one of the most distinct features of Unix …

Language from B to C

… Thompson was intent on having Unix be portable, and the creation of a portable language was intrinsic to this. …

Finding a Machine

Darice Wong & Jake Knerr

… Thompson devoted a month apiece to the shell, editor, assembler, and other software tools. …

Use of Unix started in the patent office of Bell Labs, but by 1972 there were a number of non-research organizations at Bell Labs that were beginning to use Unix for software development. Morgan recalls the importance of text processing in the establishment of Unix. …

Building Unix

Jason Aughenbaugh, Jonathan Jessup, & Nicholas Spicher

The Origin of Pipes

The first edition of Thompson and Ritchie’s The Unix Programmer’s Manual was dated November 3, 1971; however, the idea of pipes is not mentioned until the Version 3 Unix manual, published in February 1973. …

Software Tools

grep was, in fact, one of the first programs that could be classified as a software tool. Thompson designed it at the request of McIlroy, as McIlroy explains:

One afternoon I asked Ken Thompson if he could lift the regular expression recognizer out of the editor and make a one-pass program to do it. He said yes. The next morning I found a note in my mail announcing a program named grep. It worked like a charm. When asked what that funny name meant, Ken said it was obvious. It stood for the editor command that it simulated, g/re/p (global regular expression print)….From that special-purpose beginning, grep soon became a household word. (Something I had to stop myself from writing in the first paragraph above shows how firmly naturalized the idea now is: ‘I used ed to grep out words from the dictionary.’) More than any other single program, grep focused the viewpoint that Kernighan and Plauger christened and formalized in Software Tools: make programs that do one thing and do it well, with as few preconceptions about input syntax as possible.

eqn and grep are illustrative of the Unix toolbox philosophy that McIlroy phrases as, “Write programs that do one thing and do it well. Write programs to work together. Write programs that handle text streams, because that is a universal interface.” This philosophy was enshrined in Kernighan and Plauger’s 1976 book, Software Tools, and reiterated in the “Foreword” to the issue of The Bell Systems Technical Journal that also introduced pipes.

Ethos

Robert Murray-Rust & Malika Seth

McIlroy says,

This is the Unix philosophy. Write programs that do one thing and do it well. Write programs to work together. Write programs that handle text streams because, that is a universal interface.

The dissemination of Unix, with a focus on what went on within Bell Labs

Steve Chen

In 1973, the first Unix applications were installed on a system involved in updating directory information and intercepting calls to numbers that had been changed. This was the first time Unix had been used in supporting an actual, ongoing operating business. Soon, Unix was being used to automate the operations systems at Bell Laboratories. It was automating the monitoring, involved in measurement, and helping to rout calls and ensure the quality of the calls.

There were numerous reasons for the friendliness the academic society, especially the academic Computer Science community, showed towards Unix. John Stoneback relates a few of these:

Unix came into many CS departments largely because it was the only powerful interactive system that could run on the sort of hardware (PDP-11s) that universities could afford in the mid ’70s. In addition, Unix itself was also very inexpensive. Since source code was provided, it was a system that could be shaped to the requirements of a particular installation. It was written in a language considerably more attractive than assembly, and it was small enough to be studied and understood by individuals. (John Stoneback, “The Collegiate Community,” Unix Review, October 1985, p. 27.)

The key features and characteristics of Unix that held it above other operating systems at the time were its software tools, its portability, its flexibility, and the fact that it was simple, compact, and efficient. The development of Unix in Bell Labs was carried on under a set of principles that the researchers had developed to guide their work. These principles included:

(i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.

(ii) Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.

(iii) Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.

(iv) Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them.”

(M.D. McIlroy, E.N. Pinson, and B.A. Tague “Unix Time-Sharing System Forward,” The Bell System Technical Journal, July-Aug 1088 vol 57, number 6 part 2. P. 1902)

Unix: An Oral History Read More »

An analysis of Google’s technology, 2005

From Stephen E. Arnold’s The Google Legacy: How Google’s Internet Search is Transforming Application Software (Infonortics: September 2005):

The figure Google’s Fusion: Hardware and Software Engineering shows that Google’s technology framework has two areas of activity. There is the software engineering effort that focuses on PageRank and other applications. Software engineering, as used here, means writing code and thinking about how computer systems operate in order to get work done quickly. Quickly means the sub one-second response times that Google is able to maintain despite its surging growth in usage, applications and data processing.

Google is hardware plus software

The other effort focuses on hardware. Google has refined server racks, cable placement, cooling devices, and data center layout. The payoff is lower operating costs and the ability to scale as demand for computing resources increases. With faster turnaround and the elimination of such troublesome jobs as backing up data, Google’s hardware innovations give it a competitive advantage few of its rivals can equal as of mid-2005.

How Google Is Different from MSN and Yahoo

Google’s technologyis simultaneously just like other online companies’ technology, and very different. A data center is usually a facility owned and operated by a third party where customers place their servers. The staff of the data center manage the power, air conditioning and routine maintenance. The customer specifies the computers and components. When a data center must expand, the staff of the facility may handle virtually all routine chores and may work with the customer’s engineers for certain more specialized tasks.

Before looking at some significant engineering differences between Google and two of its major competitors, review this list of characteristics for a Google data center.

1. Google data centers – now numbering about two dozen, although no one outside Google knows the exact number or their locations. They come online and automatically, under the direction of the Google File System, start getting work from other data centers. These facilities, sometimes filled with 10,000 or more Google computers, find one another and configure themselves with minimal human intervention.

2. The hardware in a Google data center can be bought at a local computer store. Google uses the same types of memory, disc drives, fans and power supplies as those in a standard desktop PC.

3. Each Google server comes in a standard case called a pizza box with one important change: the plugs and ports are at the front of the box to make access faster and easier.

4. Google racks are assembled for Google to hold servers on their front and back sides. This effectively allows a standard rack, normally holding 40 pizza box servers, to hold 80.

5. A Google data center can go from a stack of parts to online operation in as little as 72 hours, unlike more typical data centers that can require a week or even a month to get additional resources online.

6. Each server, rack and data center works in a way that is similar to what is called “plug and play.” Like a mouse plugged into the USB port on a laptop, Google’s network of data centers knows when more resources have been connected. These resources, for the most part, go into operation without human intervention.

Several of these factors are dependent on software. This overlap between the hardware and software competencies at Google, as previously noted, illustrates the symbiotic relationship between these two different engineering approaches. At Google, from its inception, Google software and Google hardware have been tightly coupled. Google is not a software company nor is it a hardware company. Google is, like IBM, a company that owes its existence to both hardware and software. Unlike IBM, Google has a business model that is advertiser supported. Technically, Google is conceptually closer to IBM (at one time a hardware and software company) than it is to Microsoft (primarily a software company) or Yahoo! (an integrator of multiple softwares).

Software and hardware engineering cannot be easily segregated at Google. At MSN and Yahoo hardware and software are more loosely-coupled. Two examples will illustrate these differences.

Microsoft – with some minor excursions into the Xbox game machine and peripherals – develops operating systems and traditional applications. Microsoft has multiple operating systems, and its engineers are hard at work on the company’s next-generation of operating systems.

Several observations are warranted:

1. Unlike Google, Microsoft does not focus on performance as an end in itself. As a result, Microsoft gets performance the way most computer users do. Microsoft buys or upgrades machines. Microsoft does not fiddle with its operating systems and their subfunctions to get that extra time slice or two out of the hardware.

2. Unlike Google, Microsoft has to support many operating systems and invest time and energy in making certain that important legacy applications such as Microsoft Office or SQLServer can run on these new operating systems. Microsoft has a boat anchor tied to its engineer’s ankles. The boat anchor is the need to ensure that legacy code works in Microsoft’s latest and greatest operating systems.

3. Unlike Google, Microsoft has no significant track record in designing and building hardware for distributed, massively parallelised computing. The mice and keyboards were a success. Microsoft has continued to lose money on the Xbox, and the sudden demise of Microsoft’s entry into the home network hardware market provides more evidence that Microsoft does not have a hardware competency equal to Google’s.

Yahoo! operates differently from both Google and Microsoft. Yahoo! is in mid-2005 a direct competitor to Google for advertising dollars. Yahoo! has grown through acquisitions. In search, for example, Yahoo acquired 3721.com to handle Chinese language search and retrieval. Yahoo bought Inktomi to provide Web search. Yahoo bought Stata Labs in order to provide users with search and retrieval of their Yahoo! mail. Yahoo! also owns AllTheWeb.com, a Web search site created by FAST Search & Transfer. Yahoo! owns the Overture search technology used by advertisers to locate key words to bid on. Yahoo! owns Alta Vista, the Web search system developed by Digital Equipment Corp. Yahoo! licenses InQuira search for customer support functions. Yahoo has a jumble of search technology; Google has one search technology.

Historically Yahoo has acquired technology companies and allowed each company to operate its technology in a silo. Integration of these different technologies is a time-consuming, expensive activity for Yahoo. Each of these software applications requires servers and systems particular to each technology. The result is that Yahoo has a mosaic of operating systems, hardware and systems. Yahoo!’s problem is different from Microsoft’s legacy boat-anchor problem. Yahoo! faces a Balkan-states problem.

There are many voices, many needs, and many opposing interests. Yahoo! must invest in management resources to keep the peace. Yahoo! does not have a core competency in hardware engineering for performance and consistency. Yahoo! may well have considerable competency in supporting a crazy-quilt of hardware and operating systems, however. Yahoo! is not a software engineering company. Its engineers make functions from disparate systems available via a portal.

The figure below provides an overview of the mid-2005 technical orientation of Google, Microsoft and Yahoo.

2005 focuses of Google, MSN, and Yahoo

The Technology Precepts

… five precepts thread through Google’s technical papers and presentations. The following snapshots are extreme simplifications of complex, yet extremely fundamental, aspects of the Googleplex.

Cheap Hardware and Smart Software

Google approaches the problem of reducing the costs of hardware, set up, burn-in and maintenance pragmatically. A large number of cheap devices using off-the-shelf commodity controllers, cables and memory reduces costs. But cheap hardware fails.

In order to minimize the “cost” of failure, Google conceived of smart software that would perform whatever tasks were needed when hardware devices fail. A single device or an entire rack of devices could crash, and the overall system would not fail. More important, when such a crash occurs, no full-time systems engineering team has to perform technical triage at 3 a.m.

The focus on low-cost, commodity hardware and smart software is part of the Google culture.

Logical Architecture

Google’s technical papers do not describe the architecture of the Googleplex as self-similar. Google’s technical papers provide tantalizing glimpses of an approach to online systems that makes a single server share features and functions of a cluster of servers, a complete data center, and a group of Google’s data centers.

The collections of servers running Google applications on the Google version of Linux is a supercomputer. The Googleplex can perform mundane computing chores like taking a user’s query and matching it to documents Google has indexed. Further more, the Googleplex can perform side calculations needed to embed ads in the results pages shown to user, execute parallelized, high-speed data transfers like computers running state-of-the-art storage devices, and handle necessary housekeeping chores for usage tracking and billing.

When Google needs to add processing capacity or additional storage, Google’s engineers plug in the needed resources. Due to self-similarity, the Googleplex can recognize, configure and use the new resource. Google has an almost unlimited flexibility with regard to scaling and accessing the capabilities of the Googleplex.

In Google’s self-similar architecture, the loss of an individual device is irrelevant. In fact, a rack or a data center can fail without data loss or taking the Googleplex down. The Google operating system ensures that each file is written three to six times to different storage devices. When a copy of that file is not available, the Googleplex consults a log for the location of the copies of the needed file. The application then uses that replica of the needed file and continues with the job’s processing.

Speed and Then More Speed

Google uses commodity pizza box servers organized in a cluster. A cluster is group of computers that are joined together to create a more robust system. Instead of using exotic servers with eight or more processors, Google generally uses servers that have two processors similar to those found in a typical home computer.

Through proprietary changes to Linux and other engineering innovations, Google is able to achieve supercomputer performance from components that are cheap and widely available.

… engineers familiar with Google believe that read rates may in some clusters approach 2,000 megabytes a second. When commodity hardware gets better, Google runs faster without paying a premium for that performance gain.

Another key notion of speed at Google concerns writing computer programs to deploy to Google users. Google has developed short cuts to programming. An example is Google’s creating a library of canned functions to make it easy for a programmer to optimize a program to run on the Googleplex computer. At Microsoft or Yahoo, a programmer must write some code or fiddle with code to get different pieces of a program to execute simultaneously using multiple processors. Not at Google. A programmer writes a program, uses a function from a Google bundle of canned routines, and lets the Googleplex handle the details. Google’s programmers are freed from much of the tedium associated with writing software for a distributed, parallel computer.

Eliminate or Reduce Certain System Expenses

Some lucky investors jumped on the Google bandwagon early. Nevertheless, Google was frugal, partly by necessity and partly by design. The focus on frugality influenced many hardware and software engineering decisions at the company.

Drawbacks of the Googleplex

The Laws of Physics: Heat and Power 101

In reality, no one knows. Google has a rapidly expanding number of data centers. The data center near Atlanta, Georgia, is one of the newest deployed. This state-of-the-art facility reflects what Google engineers have learned about heat and power issues in its other data centers. Within the last 12 months, Google has shifted from concentrating its servers at about a dozen data centers, each with 10,000 or more servers, to about 60 data centers, each with fewer machines. The change is a response to the heat and power issues associated with larger concentrations of Google servers.

The most failure prone components are:

  • Fans.
  • IDE drives which fail at the rate of one per 1,000 drives per day.
  • Power supplies which fail at a lower rate.

Leveraging the Googleplex

Google’s technology is one major challenge to Microsoft and Yahoo. So to conclude this cursory and vastly simplified look at Google technology, consider these items:

1. Google is fast anywhere in the world.

2. Google learns. When the heat and power problems at dense data centers surfaced, Google introduced cooling and power conservation innovations to its two dozen data centers.

3. Programmers want to work at Google. “Google has cachet,” said one recent University of Washington graduate.

4. Google’s operating and scaling costs are lower than most other firms offering similar businesses.

5. Google squeezes more work out of programmers and engineers by design.

6. Google does not break down, or at least it has not gone offline since 2000.

7. Google’s Googleplex can deliver desktop-server applications now.

8. Google’s applications install and update without burdening the user with gory details and messy crashes.

9. Google’s patents provide basic technology insight pertinent to Google’s core functionality.

An analysis of Google’s technology, 2005 Read More »

Lots of good info about the FBI’s far-reaching wiretapping of US phone systems

From Ryan Singel’s “Point, Click … Eavesdrop: How the FBI Wiretap Net Operates” (Wired News: 29 August 2007):

The FBI has quietly built a sophisticated, point-and-click surveillance system that performs instant wiretaps on almost any communications device, according to nearly a thousand pages of restricted documents newly released under the Freedom of Information Act.

The surveillance system, called DCSNet, for Digital Collection System Network, connects FBI wiretapping rooms to switches controlled by traditional land-line operators, internet-telephony providers and cellular companies. It is far more intricately woven into the nation’s telecom infrastructure than observers suspected.

It’s a “comprehensive wiretap system that intercepts wire-line phones, cellular phones, SMS and push-to-talk systems,” says Steven Bellovin, a Columbia University computer science professor and longtime surveillance expert.

DCSNet is a suite of software that collects, sifts and stores phone numbers, phone calls and text messages. The system directly connects FBI wiretapping outposts around the country to a far-reaching private communications network.

The $10 million DCS-3000 client, also known as Red Hook, handles pen-registers and trap-and-traces, a type of surveillance that collects signaling information — primarily the numbers dialed from a telephone — but no communications content. (Pen registers record outgoing calls; trap-and-traces record incoming calls.)

DCS-6000, known as Digital Storm, captures and collects the content of phone calls and text messages for full wiretap orders.

A third, classified system, called DCS-5000, is used for wiretaps targeting spies or terrorists.

What DCSNet Can Do

Together, the surveillance systems let FBI agents play back recordings even as they are being captured (like TiVo), create master wiretap files, send digital recordings to translators, track the rough location of targets in real time using cell-tower information, and even stream intercepts outward to mobile surveillance vans.

FBI wiretapping rooms in field offices and undercover locations around the country are connected through a private, encrypted backbone that is separated from the internet. Sprint runs it on the government’s behalf.

The network allows an FBI agent in New York, for example, to remotely set up a wiretap on a cell phone based in Sacramento, California, and immediately learn the phone’s location, then begin receiving conversations, text messages and voicemail pass codes in New York. With a few keystrokes, the agent can route the recordings to language specialists for translation.

The numbers dialed are automatically sent to FBI analysts trained to interpret phone-call patterns, and are transferred nightly, by external storage devices, to the bureau’s Telephone Application Database, where they’re subjected to a type of data mining called link analysis.

The numerical scope of DCSNet surveillance is still guarded. But we do know that as telecoms have become more wiretap-friendly, the number of criminal wiretaps alone has climbed from 1,150 in 1996 to 1,839 in 2006. That’s a 60 percent jump. And in 2005, 92 percent of those criminal wiretaps targeted cell phones, according to a report published last year.

These figures include both state and federal wiretaps, and do not include antiterrorism wiretaps, which dramatically expanded after 9/11. They also don’t count the DCS-3000’s collection of incoming and outgoing phone numbers dialed. Far more common than full-blown wiretaps, this level of surveillance requires only that investigators certify that the phone numbers are relevant to an investigation.

In the 1990s, the Justice Department began complaining to Congress that digital technology, cellular phones and features like call forwarding would make it difficult for investigators to continue to conduct wiretaps. Congress responded by passing the Communications Assistance for Law Enforcement Act, or CALEA, in 1994, mandating backdoors in U.S. telephone switches.

CALEA requires telecommunications companies to install only telephone-switching equipment that meets detailed wiretapping standards. Prior to CALEA, the FBI would get a court order for a wiretap and present it to a phone company, which would then create a physical tap of the phone system.

With new CALEA-compliant digital switches, the FBI now logs directly into the telecom’s network. Once a court order has been sent to a carrier and the carrier turns on the wiretap, the communications data on a surveillance target streams into the FBI’s computers in real time.

The released documents suggest that the FBI’s wiretapping engineers are struggling with peer-to-peer telephony provider Skype, which offers no central location to wiretap, and with innovations like caller-ID spoofing and phone-number portability.

Despite its ease of use, the new technology is proving more expensive than a traditional wiretap. Telecoms charge the government an average of $2,200 for a 30-day CALEA wiretap, while a traditional intercept costs only $250, according to the Justice Department inspector general. A federal wiretap order in 2006 cost taxpayers $67,000 on average, according to the most recent U.S. Court wiretap report.

What’s more, under CALEA, the government had to pay to make pre-1995 phone switches wiretap-friendly. The FBI has spent almost $500 million on that effort, but many traditional wire-line switches still aren’t compliant.

Processing all the phone calls sucked in by DCSNet is also costly. At the backend of the data collection, the conversations and phone numbers are transferred to the FBI’s Electronic Surveillance Data Management System, an Oracle SQL database that’s seen a 62 percent growth in wiretap volume over the last three years — and more than 3,000 percent growth in digital files like e-mail. Through 2007, the FBI has spent $39 million on the system, which indexes and analyzes data for agents, translators and intelligence analysts.

Lots of good info about the FBI’s far-reaching wiretapping of US phone systems Read More »

Tim O’Reilly defines cloud computing

From Tim O’Reilly’s “Web 2.0 and Cloud Computing” (O’Reilly Radar: 26 October 2008):

Since “cloud” seems to mean a lot of different things, let me start with some definitions of what I see as three very distinct types of cloud computing:

1. Utility computing. Amazon’s success in providing virtual machine instances, storage, and computation at pay-as-you-go utility pricing was the breakthrough in this category, and now everyone wants to play. Developers, not end-users, are the target of this kind of cloud computing.

This is the layer at which I don’t presently see any strong network effect benefits (yet). Other than a rise in Amazon’s commitment to the business, neither early adopter Smugmug nor any of its users get any benefit from the fact that thousands of other application developers have their work now hosted on AWS. If anything, they may be competing for the same resources.

That being said, to the extent that developers become committed to the platform, there is the possibility of the kind of developer ecosystem advantages that once accrued to Microsoft. More developers have the skills to build AWS applications, so more talent is available. But take note: Microsoft took charge of this developer ecosystem by building tools that both created a revenue stream for Microsoft and made developers more reliant on them. In addition, they built a deep — very deep — well of complex APIs that bound developers ever-tighter to their platform.

So far, most of the tools and higher level APIs for AWS are being developed by third-parties. In the offerings of companies like Heroku, Rightscale, and EngineYard (not based on AWS, but on their own hosting platform, while sharing the RoR approach to managing cloud infrastructure), we see the beginnings of one significant toolchain. And you can already see that many of these companies are building into their promise the idea of independence from any cloud infrastructure vendor.

In short, if Amazon intends to gain lock-in and true competitive advantage (other than the aforementioned advantage of being the low-cost provider), expect to see them roll out their own more advanced APIs and developer tools, or acquire promising startups building such tools. Alternatively, if current trends continue, I expect to see Amazon as a kind of foundation for a Linux-like aggregation of applications, tools and services not controlled by Amazon, rather than for a Microsoft Windows-like API and tools play. There will be many providers of commodity infrastructure, and a constellation of competing, but largely compatible, tools vendors. Given the momentum towards open source and cloud computing, this is a likely future.

2. Platform as a Service. One step up from pure utility computing are platforms like Google AppEngine and Salesforce’s force.com, which hide machine instances behind higher-level APIs. Porting an application from one of these platforms to another is more like porting from Mac to Windows than from one Linux distribution to another.

The key question at this level remains: are there advantages to developers in one of these platforms from other developers being on the same platform? force.com seems to me to have some ecosystem benefits, which means that the more developers are there, the better it is for both Salesforce and other application developers. I don’t see that with AppEngine. What’s more, many of the applications being deployed there seem trivial compared to the substantial applications being deployed on the Amazon and force.com platforms. One question is whether that’s because developers are afraid of Google, or because the APIs that Google has provided don’t give enough control and ownership for serious applications. I’d love your thoughts on this subject.

3. Cloud-based end-user applications. Any web application is a cloud application in the sense that it resides in the cloud. Google, Amazon, Facebook, twitter, flickr, and virtually every other Web 2.0 application is a cloud application in this sense. However, it seems to me that people use the term “cloud” more specifically in describing web applications that were formerly delivered locally on a PC, like spreadsheets, word processing, databases, and even email. Thus even though they may reside on the same server farm, people tend to think of gmail or Google docs and spreadsheets as “cloud applications” in a way that they don’t think of Google search or Google maps.

This common usage points up a meaningful difference: people tend to think differently about cloud applications when they host individual user data. The prospect of “my” data disappearing or being unavailable is far more alarming than, for example, the disappearance of a service that merely hosts an aggregated view of data that is available elsewhere (say Yahoo! search or Microsoft live maps.) And that, of course, points us squarely back into the center of the Web 2.0 proposition: that users add value to the application by their use of it. Take that away, and you’re a step back in the direction of commodity computing.

Ideally, the user’s data becomes more valuable because it is in the same space as other users’ data. This is why a listing on craigslist or ebay is more powerful than a listing on an individual blog, why a listing on amazon is more powerful than a listing on Joe’s bookstore, why a listing on the first results page of Google’s search engine, or an ad placed into the Google ad auction, is more valuable than similar placement on Microsoft or Yahoo!. This is also why every social network is competing to build its own social graph rather than relying on a shared social graph utility.

This top level of cloud computing definitely has network effects. If I had to place a bet, it would be that the application-level developer ecosystems eventually work their way back down the stack towards the infrastructure level, and the two meet in the middle. In fact, you can argue that that’s what force.com has already done, and thus represents the shape of things. It’s a platform I have a strong feeling I (and anyone else interested in the evolution of the cloud platform) ought to be paying more attention to.

Tim O’Reilly defines cloud computing Read More »

Dropbox for Linux is coming soon

According to this announcement, a Linux client for Dropbox should be coming out in a week or so:

http://forums.getdropbox.com/topic.php?id=2371&replies=1

I’ve been using Dropbox for several months, and it’s really, really great.

What is it? Watch this video:

http://www.getdropbox.com/screencast

It’s backup and auto-syncing done REALLY well. Best of all, you can sync between more than one computer, even if one is owned by someone else. So I could create a folder then share it with Robert. It shows up on his machine. If either of us changes files in the folder, those changes are auto-synced with each other.

Very nice.

So check it out when you get a chance. 2 GB are free. After that, you pay a small fee.

Dropbox for Linux is coming soon Read More »

Info about the Internet Archive

From The Internet Archive’s “Orphan Works Reply Comments” (9 May 2005):

The Internet Archive stores over 500 terabytes of ephemeral web pages, book and moving images, adding an additional twenty-five terabytes each month. The short life span and immense quantity of these works prompts a solution that provides immediate and efficient preservation and access to orphaned ephemeral works. For instance, the average lifespan of a webpage is 100 days before it undergoes alteration or permanent deletion, and there are an average of fifteen links on a webpage.

Info about the Internet Archive Read More »

The growth in data & the problem of storage

From Technology Review‘s “The Fading Memory of the State“:

Tom Hawk, general manager for enterprise storage at IBM, says that in the next three years, humanity will generate more data–from websites to digital photos and video–than it generated in the previous 1,000 years. … In 1996, companies spent 11 percent of their IT budgets on storage, but that figure will likely double to 22 percent in 2007, according to International Technology Group of Los Altos, CA.

… the Pentagon generates tens of millions of images from personnel files each year; the Clinton White House generated 38 million e-mail messages (and the current Bush White House is expected to generate triple that number); and the 2000 census returns were converted into more than 600 million TIFF-format image files, some 40 terabytes of data. A single patent application can contain a million pages, plus complex files like 3-D models of proteins or CAD drawings of aircraft parts. All told, NARA expects to receive 347 petabytes … of electronic records by 2022.

Currently, the Archives holds only a trivial number of electronic records. Stored on steel racks in NARA’s [National Archives and Records Administration] 11-year-old facility in College Park, the digital collection adds up to just five terabytes. Most of it consists of magnetic tapes of varying ages, many of them holding a mere 200 megabytes apiece–about the size of 10 high-resolution digital photographs.

The growth in data & the problem of storage Read More »