Thursday, October 26, 2017

The Upspin manifesto: On the ownership and sharing of data

Here follows the original "manifesto" from late 2014 proposing the idea for what became Upspin. The text has been lightly edited to remove a couple of references to Google-internal systems, with no loss of meaning.

I'd like to thank Eduardo Pinheiro, Eric Grosse, Dave Presotto and Andrew Gerrand for helping me turn this into a working system, in retrospect remarkably close to the original vision.

Augie
Augie image Copyright © 2017 Renee French


Manifesto



Outside our laptops, most of us today have no shared file system at work. (There was a time when we did, but it's long gone.) The world took away our /home folders and moved us to databases, which are not file systems. Thus I can no longer (at least not without clumsy hackery) make my true home directory be where my files are any more. Instead, I am expected to work on some local machine using some web app talking to some database or other external repository to do my actual work. This is mobile phone user interfaces brought to engineering workstations, which has its advantages but also some deep flaws. Most important is the philosophy it represents.

You don't own your data any more. One argument is that companies own it, but from a strict user perspective, your "apps" own it. Each item you use in the modern mobile world is coupled to the program that manipulates it. Your Twitter, Facebook, or Google+ program is the only way to access the corresponding feed. Sometimes the boundary is softened within a company—photos in Google+ are available in other Google products—but that is the exception that proves the rule. You don't control your data, the programs do.

Yet there are many reasons to want to access data from multiple programs. That is, almost by definition, the Unix model. Unix's model is largely irrelevant today, but there are still legitimate ways to think about data that are made much too hard by today's way of working. It's not necessarily impossible to share data among programs (although it's often very difficult), but it's never natural. There are workarounds like plugin architectures and documented workflows but they just demonstrate that the fundamental idea—sharing data among programs—is not the way we work any more.

This is backwards. It's a reversal of the old way we used to work, with a file system that all programs could access equally. The very notion of "download" and "upload" is plain stupid. Data should simply be available to any program with authenticated access rights. And of course for any person with those rights. Sharing between programs and people can be, technologically at least, the same problem, and a solved one.

This document proposes a modern way to achieve the good properties of the old world: consistent access, understandable data flow, and easy sharing without workarounds. To do this, we go back to the old notion of a file system and make it uniform and global. The result should be a data space in which all people can use all their programs, sharing and collaborating at will in a consistent, intuitive way.

Not downloading, uploading, mailing, tarring, gzipping, plugging in and copying around. Just using. Conceptually: If I want to post a picture on Twitter, I just name the file that holds it. If I want to edit a picture on Twitter using Photoshop, I use the File>Open menu in Photoshop and name the file stored on Twitter, which is easy to discover or even know a priori. (There are security and access questions here, and we'll come back to those.)

Working in a file system.

I want my home directory to be where all my data is. Not just my local files, not just my source code, not just my photos, not just my mail. All my data. My "phone" should be able to access the same data as my laptop, which should be able to access the same data as the servers. (Ignore access control for the moment.) $HOME should be my home for everything: work, play, life; toy, phone, work, cluster.

This was how things worked in the old single-machine days but we lost sight of that when networking became universally available. There were network file systems and some research systems used them to provide basically this model, but the arrival of consumer devices, portable computing, and then smartphones eroded the model until every device is its own fiefdom and every program manages its own data through networking. We have a network connecting devices instead of a network composed of devices.

The knowledge of how to achieve the old way still exists, and networks are fast and ubiquitous enough to restore the model. From a human point of view, the data is all we care about: my pictures, my mail, my documents. Put those into a globally addressable file system and I can see all my data with equal facility from any device I control. And then, when I want to share with another person, I can name the file (or files or directory) that holds the information I want to share, grant access, and the other person can access it.

The essence here is that the data (if it's in a single file) has one name that is globally usable to anyone who knows the name and has the permission to evaluate it. Those names might be long and clumsy, but simple name space techniques can make the data work smoothly using local aliasing so that I live in "my" world, you live in your world (also called "my" world from your machines), and the longer, globally unique names only arise when we share, which can be done with a trivial, transparent, easy to use file-system interface.

Note that the goal here is not a new file system to use alongside the existing world. Its purpose is to be the only file system to use. Obviously there will be underlying storage systems, but from the user's point of view all access is through this system. I edit a source file, compile it, find a problem, point a friend at it; she accesses the same file, not a copy, to help me understand it. (If she wants a copy, there's always cp!).

This is not a simple thing to do, but I believe it is possible. Here is how I see it being assembled. This discussion will be idealized and skate over a lot of hard stuff. That's OK; this is a vision, not a design document.

Everyone has a name.

Each person is identified by a name. To make things simple here, let's just use an e-mail address. There may be a better idea, but this is sufficient for discussion. It is not a problem to have multiple names (addresses) in this model, since the sharing and access rights will treat the two names as distinct users with whatever sharing rights they choose to use.

Everyone has stable storage in the network.

Each person needs a way to make data accessible to the network, so the storage must live in the network. The easiest way to think of this is like the old network file systems, with per-user storage in the server farm. At a high level, it doesn't matter what that storage is or how it is structured, as long as it can be used to provide the storage layer for a file-system-like API.

Everyone's storage server has a name identified by the user's name.

The storage in the server farm is identified by the user's name.

Everyone has local storage, but it's just a cache.

It's too expensive to send all file access to the servers, so the local device, whatever it is—laptop, desktop, phone, watch—caches what it needs and goes to the server as needed. Cache protocols are an important part of the implementation; for the point of this discussion, let's just say they can be designed to work well. That is a critical piece and I have ideas, but put that point aside for now.

The server always knows what devices have cached copies of the files on local storage. 

The cache always knows what the associated server is for each directory file in its cache and maintains consistency within reasonable time boundaries.

The cache implements the API of a full file system. The user lives in this file system for all the user's own files. As the user moves between devices, caching protocols keep things working.

Everyone's cache can talk to multiple servers.

A user may have multiple servers, perhaps from multiple providers. The same cache and therefore same file system API refers to them all equivalently. Similarly, if a user accesses a different user's files, the exact same protocol is used, and the result is cached in the same cache the same way. This is federation as architecture.

Every file has a globally unique name.

Every file is named by this triple: (host address, user name, file path). Access rights aside, any user can address any other user's file by evaluating the triple. The real access method will be nicer in practice, of course, but this is the gist.

Every file has a potentially unique ACL.

Although the user interface for access control needs to be very easy, the effect is that each file or directory has an access control list (ACL) that mediates all access to its contents. It will need to be very fine-grained with respect to each of users, files, and rights.

Every user has a local name space.

The cache/file-system layer contains functionality to bind things, typically directories, identified by such triples into locally nicer-to-use names. An obvious way to think about this is like an NFS mount point for /home, where the remote binding attaches to /home/XXX the component or components in the network that the local user wishes to identify by XXX. For example, Joe might establish /home/jane as a place to see all the (accessible to Joe) pieces of Jane's world. But it can be much finer-grained than that, to the level of pulling in a single file.

The NFS analogy only goes so far. First, the binding is a lazily-evaluated, multi-valued recipe, not a Unix-like mount. Also, the binding may itself be programmatic, so that there is an element of auto-discovery. Perhaps most important, one can ask any file in the cached local system what its triple is and get its unique name, so when a user wishes to share an item, the triple can be exposed and the remote user can use her locally-defined recipe to construct the renaming to make the item locally accessible. This is not as mysterious or as confusing in practice as it sounds; Plan 9 pretty much worked like this, although not as dynamically.

Everyone's data becomes addressable.

Twitter gives you (or anyone you permit) access to your Twitter data by implementing the API, just as the regular, more file-like servers do. The same story applies to any entity that has data it wants to make usable. At some scaling point, it becomes wrong not to play.

Everyone's data is secure.

It remains to be figured out how to do that, I admit, but with a simple, coherent data model that should be achievable.

Is this a product?

The protocols and some of the pieces, particularly what runs on local devices, should certainly be open source, as should a reference server implementation. Companies should be free to provide proprietary implementations to access their data, and should also be free to charge for hosting. A cloud provider could charge hosting fees for the service, perhaps with some free or very cheap tier that would satisfy the common user. There's money in this space.

What is this again?

What Google Drive should be. What Dropbox should be. What file systems can be. The way we unify our data access across companies, services, programs, and people. The way I want to live and work.

Never again should someone need to manually copy/upload/download/plugin/workflow/transfer data from one machine to another. 

Thursday, September 21, 2017

Go: Ten years and climbing



Drawing Copyright ©2017 Renee French


This week marks the 10th anniversary of the creation of Go.


The initial discussion was on the afternoon of Thursday, the 20th of September, 2007. That led to an organized meeting between Robert Griesemer, Rob Pike, and Ken Thompson at 2PM the next day in the conference room called Yaounde in Building 43 on Google's Mountain View campus. The name for the language arose on the 25th, several messages into the first mail thread about the design:

Subject: Re: prog lang discussion From: Rob 'Commander' Pike Date: Tue, Sep 25, 2007 at 3:12 PM To: Robert Griesemer, Ken Thompson i had a couple of thoughts on the drive home. 1. name 'go'. you can invent reasons for this name but it has nice properties. it's short, easy to type. tools: goc, gol, goa. if there's an interactive debugger/interpreter it could just be called 'go'. the suffix is .go ...

(It's worth stating that the language is called Go; "golang" comes from the web site address (go.com was already a Disney web site) but is not the proper name of the language.)


The Go project counts its birthday as November 10, 2009, the day it launched as open source, originally on code.google.com before migrating to GitHub a few years later. But for now let's date the language from its conception, two years earlier, which allows us to reach further back, take a longer view, and witness some of the earlier events in its history.


The first big surprise in Go's development was the receipt of this mail message:

Subject: A gcc frontend for Go
From: Ian Lance Taylor Date: Sat, Jun 7, 2008 at 7:06 PM To: Robert Griesemer, Rob Pike, Ken Thompson One of my office-mates pointed me at http://.../go_lang.html . It seems like an interesting language, and I threw together a gcc frontend for it. It's missing a lot of features, of course, but it does compile the prime sieve code on the web page.

The shocking yet delightful arrival of an ally (Ian) and a second compiler (gccgo) was not only encouraging, it was enabling. Having a second implementation of the language was vital to the process of locking down the specification and libraries, helping guarantee the high portability that is part of Go's promise.


Even though his office was not far away, none of us had even met Ian before that mail, but he has been a central player in the design and implementation of the language and its tools ever since.


Russ Cox joined the nascent Go team in 2008 as well, bringing his own bag of tricks. Russ discovered—that's the right word—that the generality of Go's methods meant that a function could have methods, leading to the http.HandlerFunc idea, which was an unexpected result for all of us. Russ promoted more general ideas too, like the the io.Reader and io.Writer interfaces, which informed the structure of all the I/O libraries.


Jini Kim, who was our product manager for the launch, recruited the security expert Adam Langley to help us get Go out the door. Adam did a lot of things for us that are not widely known, including creating the original golang.org web page and the build dashboard, but of course his biggest contribution was in the cryptographic libraries. At first, they seemed disproportionate in both size and complexity, at least to some of us, but they enabled so much important networking and security software later that they become a crucial part of the Go story. Network infrastructure companies like Cloudflare lean heavily on Adam's work in Go, and the internet is better for it. So is Go, and we thank him.


In fact a number of companies started to play with Go early on, particularly startups. Some of those became powerhouses of cloud computing. One such startup, now called Docker, used Go and catalyzed the container industry for computing, which then led to other efforts such as Kubernetes. Today it's fair to say that Go is the language of containers, another completely unexpected result.


Go's role in cloud computing is even bigger, though. In March of 2014 Donnie Berkholz, writing for RedMonk, claimed that Go was "the emerging language of cloud infrastructure". Around the same time, Derek Collison of Apcera stated that Go was already the language of the cloud. That might not have been quite true then, but as the word "emerging" used by Berkholz implied, it was becoming true.


Today, Go is the language of the cloud, and to think that a language only ten years old has come to dominate such a large and growing industry is the kind of success one can only dream of. And if you think "dominate" is too strong a word, take a look at the internet inside China. For a while, the huge usage of Go in China signaled to us by the Google trends graph seemed some sort of mistake, but as anyone who has been to the Go conferences in China can attest, the measurements are real. Go is huge in China.


In short, ten years of travel with the language have brought us past many milestones. The most astonishing is at our current position: a conservative estimate suggests there are at least half a million Go programmers. When the mail message naming Go was sent, the idea of there being half a million gophers would have sounded preposterous. Yet here we are, and the number continues to grow.


Speaking of gophers, it's been fun to watch how Renee French's idea for a mascot, the Go gopher, became not only a much loved creation but also a symbol for Go programmers everywhere. Many of the biggest Go conferences are called GopherCons as they gather together gophers from all over the world.


Gopher conferences are taking off. The first one was only three years ago, yet today there are many, all around the world, plus countless smaller local "meetups". On any given day, there is more likely than not a group of gophers meeting somewhere in the world to share ideas.


Looking back over ten years of Go design and development, it is astounding to reflect on the growth of the Go community. The number of conferences and meetups, the long and ever-increasing list of contributors to the Go project, the profusion of open source repositories hosting Go code, the number of companies using Go, some exclusively: these are all astonishing to contemplate.


For the three of us, Robert, Rob, and Ken, who just wanted  to make our programming lives easier, it's incredibly gratifying to witness what our work has started.

What will the next ten years bring?

- Rob Pike, with Robert Griesemer and Ken Thompson

Saturday, February 25, 2017

The power of role models

I spent a few days a while back in a board meeting for a national astronomy organization and noticed a property of the population in that room: Out of about 40 people, about a third were women. And these were powerful women, too: professors, observatory directors and the like. Nor were they wallflowers. Their contributions to the meeting exceeded their proportion.

In my long career, I had never before been in a room like that, and the difference in tone, conversation, respect, and professionalism was unlike any I have experienced. I can't prove it was the presence of women that made the difference - it could just be that astronomers are better people all around, a possibility I cannot really refute - but it seemed to me that the difference stemmed from the demographics.

The meeting was one-third women, but of course in private conversation, when pressed, the women I spoke to complained that things weren't equal yet. We all have our reference points.

But let's back up for a moment and think about the main point: In a room responsible for overseeing the budget and operation of major astronomical observatories, including things like the Hubble telescope, women played a major role. The contrast with computing is stark.

It really got me thinking. At dinner I asked some of the women to speak to me about this, how astronomy became so (relatively) egalitarian. And one topic became clear: role models. Astronomy has a long history of women active in the field, going all the way back to Caroline Herschel in the early 19th century. Women have made huge contributions to the field. Dava Sobel just wrote a book about the women who laid the foundations for the discovery of the expansion of the universe. Just a couple of weeks ago, papers ran obituaries of Vera Rubin, the remarkable observational astronomer who discovered the evidence for dark matter. I could mention Jocelyn Bell, whose discovery of pulsars got her advisor a Nobel (sic).

The most famous astronomer I met growing up was Helen Hogg, the (adopted) Canadian astronomer at David Dunlap Observatory outside Toronto, who also did a fair bit of what we now call outreach.

The women at the meeting spoke of this, a history of women contributing, of role models to look up to, of proof that women can make major contributions to the field.

What can computing learn from this?

It seems we're doing it wrong. The best way to improve the representation of women in the field is not to recruit them, important though that is, but to promote them. To create role models. To push them into positions of influence. Women leave computing in large numbers because they don't see a path up, or because the culture makes them unwelcome. More women excelling in the field, famous women, brilliant women, would be inspiring.

Men have the power to help fix those things, but they also should have the courage to cede the stage to women more often, to fight the stupid bias that keeps women from excelling in the field. It may take proactive behavior, like choosing a women over a man when growing your team, just because, or promoting women more freely.

But as I see it, without something being done to promote female role models, the way things are going computing will still be backwards a hundred years from now.