John Andrew Kunze – Biography

John Andrew Kunze is a computer scientist most known for his contributions to information systems and standards, and to the theory and practice of digital libraries. From his earliest work experiences at the University of California Berkeley (UCB), he developed an affinity for public service and a taste for what would come to be known twenty years later as “open source”, and so he remained with the University of California for most of his career.

He was born in 1958 in Boston, Massachusetts, the fourth of the five children of Alice Caroline Gaar and Ray Alden Kunze. His father was a math professor and his mother a PhD candidate in comparative literature. He attended public schools in St. Louis, Missouri and Corona del Mar, California, played tennis for his high school team, studied piano and organ, and gave several church organ performances starting at age 16. After a year in Nancy, France during his father’s sabbatical, John returned to enroll at UCB in 1977, where he became a double major in computer science and mathematics, and was two courses short of a third major in French and German comparative literature. At UCB he also picked up squash, played for the university team, and remains an active member of the Berkeley squash community. Except for his first term, he paid all his university and living expenses working a half-time job at the UCB Computer Center. There his first boss had been hired as a result of successfully hacking the student database using a punched card deck to change one of his computer science class grades from an A- to an A.

John started contributing to the Berkeley Software Distribution (BSD) of Unix (the basis of Mac OS) by fixing bugs as a freshman in 1977. It began when his copious emailed bug reports started to be returned with the note “here’s the code – fix them yourself”. Coinciding with his user support job, he wrote a general Unix “help” program, revived the broken Unix “learn” program (an early script-driven computer-aided-instruction system), maintained the “termcap” database, and, inspired by APL operators, created the “jot”, “lam”, and “rs” utilities found on modern Mac computers. In 1985 he became a co-author of the highly cited “A Trace-Driven Analysis of the UNIX 4.2 BSD File System”, supplying graphs scripted using S, the pre-cursor of R.

In 1988 he became the principal author of the 899-page book, “Common Lisp: the Reference” (he also implemented automated testing for all its coding examples). His oldest brother, Fritz, was a founder of the Lisp support company, Franz, Inc. Early experience with “help” and “learn” taught John that “content was king” and led in 1989 to his proposing and creating “Infocal” (pronounced info-CAL), UCB’s first Campus-Wide Information System (CWIS). Written in source-available C and Tcl, the system was an early rival of the WWW, had its own hypertext (modeled on “learn” scripts), a custom search engine, and the brand new Z39.50 search and retrieval protocol.

That work brought him in contact with Tim Berners-Lee (inventor of the WWW) and Brewster Kahle (founder of WAIS and the Internet Archive) and, by 1992, Infocal went from being a rival of the WWW to an early user of its multi-protocol client library. Infocal was offered free to the public from 200 campus kiosks and from remote Telnet clients, and it participated in the first Z39.50 interoperability demonstration (at the Net ‘92 conference in Washington, DC, the other two participants being the Penn State University Library and the University of California’s Division of Library Automation). It provided the Berkeley campus’ first experience of online access to each of four major UCB datasets: the schedule of classes, course catalog, job vacancy listings, and faculty funding opportunities. All technical, protocol, policy, data, and user interface work was done with only 2.5 full time staff positions. Organizing an open-source code release of Infocal was impossible given the burdens of running a production service, maintaining datasets, and adapting Z39.50 and working with its standards bodies to accommodate non-bibliographic applications.

For better or worse, Z39.50 had plunged John into the world of standards (ANSI, NISO, ISO, IEEE, IETF) where he was drawn by the IETF’s innovative culture and lightweight, high impact specifications. It was clear that Berners-Lee’s combination of automated FTP plus hypertext – minus search – was the winning approach, and that the URL and possibly the URN would be key. Kahle’s Wide Area Information Service (WAIS, based on Z39.50-1988) and John’s Infocal would not have long futures. It soon became a priority to finish standardizing the URL/URI since the spec had come to an impasse. To get things moving, the URI working group, led by Alan Emtage, asked John to write up the URL requirements (RFC1736) and to take over from Berners-Lee as main editor of the spec. John only had time to accept the former commitment, and offered the latter to Larry Masinter. RFC1736 succeeded in getting the URI group unstuck, largely by explicitly allowing links to be “impermanent”, and a consensus on the URL standard was achieved shortly thereafter.

The cost of the hard-won, drawn-out consensus battle was high. Burnout caused the IETF URI working group to lose many key people, along with their collective memory, while critical work was unfinished in resource description (URC) and its persistent identifier hypothesis (URN). In an earlier encounter with the former, John coined the term URC (Uniform Resource Citation) in 1992 as a cluster of descriptive elements that could be displayed when a user’s mouse hovered over a hyperlink. He was wrapping up his involvement in Z39.50 (RFC1625, RFC2056) and saw a way to leverage the unique Z39.50 notion of standardized “search access points”, or data elements supporting discovery. In 1993 he brought this perspective to the IETF IIIR working group as it took up the topic of “Data Elements”, described then as “metainformation” for internet resources. The once-obscure term, “metadata”, suddenly came into wide use in information systems starting with the Dublin Core initiative in 1995.

John drafted the Dublin Core consensus and shepherded the first three Dublin Core specifications through the standards processes: RFC2413, RFC2731, and ANSI/NISO Z39.85. Dublin Core is arguably the most influential internet metadata standard, with countless descendant ontologies and very broad support among digital repositories, harvesters, and search indexes. Aware of its shortcomings (nothing is perfect), he proposed Dublin Kernel metadata to add teeth to the “all elements are optional” Dublin Core. Along with Kernel metadata, he created a pre-YAML-like syntax (ERC/ANVL) and a date format (TEMPER) that would support ranges and approximate dates. In 1996 he wrote up a vision of a crowd-sourced cross-domain superset vocabulary – a sort of “Dublin Mantle” – that would be realized much later as yamz.net. In 1994 he organized the 8th San Francisco Bay Area SIGWEB meeting, a 400-person, all-day conference held at UCB. He then left Berkeley to take up a one-year fellowship with the National Library of Medicine (NLM) after a large cross-campus committee formally recommended that Berkeley’s future lay with the Gopher system (the Infocal team managed to append a minority recommendation in favor of the WWW). If content was king, the place to be was the library world.

In 1995 John handed off termcap maintenance to Eric Raymond (author of The Cathedral and the Bazaar) and joined the Center for Knowledge Management (CKM) in the Library of the University of California San Francisco (UCSF). The CKM had launched the SIGWEB meeting series and quite recently had been at the heart of two controversies. Three CKM staff members had departed to create Eolas, a firm that many saw as a patent troll and which later won a patent judgement of over $550 million. Unrelated, the CKM had also just exploited the then new image delivery capabilities of the web to mount online scanned images of stolen tobacco company documents, an act that would become a landmark in tobacco litigation, academic freedom, government policy, and the fight against corporate control of information. Taking up the latter, he created systems and protocols for digitizing, searching, and identifying industry documents that would help UCSF build a multi-million-page document library that has been a crucial resource in the global regulation of tobacco.

In 1998 John resumed a fellowship with the NLM, where he was asked to evaluate existing approaches to the problem of persistently referencing things that URLs linked to: URN, Handle, PURL, and DOI. In his view these approaches had all accepted a flawed hypothesis that, in this instance, reduced a complex, socio-organizational service-sustainability problem to one of simple “indirection”. He suspected the blame lay with the dazzling success that DNS had recently achieved in using indirection to make hostnames persistent. He argued that even if one accepted the hypothesis, none of the existing approaches were needed since anyone who had a web server and a carefully chosen hostname already possessed a “solution” (unless one also wanted to hire a junior developer to create a simple admin interface to maintain a two-column indirection table). In 2000 he co-authored a set of NLM “permanence ratings” to specify the multi-dimensional ways in which a web resource might change – the object itself, its content, its identifier – within an institution’s commitment to that resource. He finished the NLM work with a recommendation not to bother with existing persistent identifier (PID) systems when creating links, and just to assign and manage URLs carefully according to a short list of guiding principles.

John thought he was done with PIDs, but that wasn’t to be when he joined the California Digital Library (CDL) in 2001. In devising a strategy for link persistence, his NLM work didn’t help one gauge the persistence of URLs received from elsewhere. The vast majority of URLs in the world are not meant to persist, so it seems useful to be able to distinguish those that are meant to persist. This led to his creating the Archival Resource Key (ARK), a scheme for a distinguishable URL that can reference a resource of any type, without fees, and that is globally unique independent of hostname or HTTP. To gauge its persistence, a user can “inflect” the ARK by adding “??” to the end in order to request metadata and a nuanced (non-binary) commitment statement. It combined his guiding principles, the NLM permanence ratings, the CDL use case, ERC/ANVL Kernel metadata, and his new scheme agnostic protocol (THUMP) for inflecting URLs. For ARK implementers, he created the open source Noid (Nice Opaque Identifier) software. Noid was a fast scalable tool that could mint, bind, and resolve identifiers of any kind. It was not purely an ARK tool, since he had conceived the ARK partly out of opposition to the exclusionary practices adopted by other identifier systems that sought to capture markets and lock users in. So while most people use Noid with ARKs, some use it with Handles.

In 2003 John began working with the CDL’s new preservation program. The CDL adopted ARKs along with other University of California campuses. He gave presentations on ARKs once or twice a year and within a decade 168 institutions had registered to use them, including heavy adopters such as the National Library of France and the Internet Archive. While working on a Library of Congress grant, he wrote the BagIt packaging format (RFC8493) and drafted the first version of the web archiving (WARC) standard, both of which are widely used in digital repository and web harvesting systems. The latter uses an early version of his ANVL metadata format. During that time he also specified and implemented a handful of digital repository microservices used at CDL and other digital libraries across the world: Pairtree, Namaste, ReDD, oxum, Checkm. Several of these were re-implemented independently based on his specifications and are used in such places as the HathiTrust and OCFL-based systems. Although ARKs are decentralized by design, he created N2T.net in response to demand for a centralized ARK resolver. He also conceived the EZID system in response to demand for a front-end identifier management microservice that could be used with multiple storage repositories. N2T and EZID were unique in the world for their scheme-agnostic approach to managing and resolving individual identifiers. Neither system was purely about ARKs for the same reason that Noid is not purely an ARK tool.

In 2005 John resurrected the Digital Library Federation’s Developers Forum and chaired it for three years. In 2006 he went on a six-week fellowship with the University of Edinburgh to consider the emerging topic of PIDs and citations for scientific data. That fellowship resulted in CDL being invited to join the 5-year NSF DataONE grant, where he chaired the Preservation and Metadata Working Group starting in 2009. During this period he was also technical lead on collaborations with Microsoft Research and the Moore Foundation.

The DataONE involvement led to his creating yamz.net, which was the first realization of his 1996 vision for a crowdsourced “metadictionary” vocabulary builder. Around this time John created the “suffix passthrough” capability that permits N2T to resolve millions of identifiers on the basis of one registered identifier (similar to PURL’s “partial redirect”). In 2017 a formal collaboration with identifiers.org resulted in their adopting N2T’s “compact identifier” syntax and N2T’s adopting their curated ruleset of 600+ identifier schemes. That same year saw registration of the 419th ARK institution and his publication of a “persistence” vocabulary that considerably expanded on the NLM permanence ratings from 2000. In 2018 the National Library of France held a two-day, 350-person French-language ARK Summit in Paris, for which he gave the keynote address.

By the end of 2018, with CDL’s support to focus on ARKs, John spearheaded creation of what would become the ARK Alliance (arks.org). Over 500 institutions had registered and created over 8 billion ARKs, and for the first time ARKs had their own organization, including an advisory group and outreach, technical, and sustainability working groups. With the registration rate roughly doubling each year, four years later there were over 1000 ARK institutions. In 2022 John left the University of California to work with partners to keep the ARK Alliance moving towards self-sufficiency and to pursue his interest in the Decentralized Web (DWeb).