This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Introduction
Technical Abstract
A new graph data model is proposed as the single central invariant of future networks and software systems. Content-based (secure hash) identity is used for all public data entities, making them immutable by reference. Entities contain structured data and metadata, with chosen restrictions that promote design patterns suited for collaborative information and decentralized internet architecture. Managed collections of known references between entities are then used to support composition and versioning. To best facilitate this data model, new software and network architectures are required, and these will evolve independently over time. As an initial exploration and proof-of-concept, an idealized unified model for decentralized, distributable information systems is proposed. Networked graph data repositories collect, filter, and propagate references surrounding known or hosted immutable data entities. Public and private compositions, revisions, and annotations can thereby be independently layered from unlimited sources. This supports global-scale distributed collaborative media and is foundational for a new general-purpose software architecture that prioritizes maximum composability and re-use of both code and data. The repository serves as a universal interface for managing persistent data, remotely and locally. Likewise, shared graph data itself serves as the universal interface for user-space software components. Interaction patterns upon the graph, described by declarative code and contracts, replace traditional APIs as composition and communication mechanisms. In support, repositories propagate data entities in response to provided directives. Software components themselves exist within a graph-native programming environment that is suitable to both human and AI users. Finally, human user interfaces are dynamically rendered based on context, environment, interactive needs, preferences, history, and customizable hints.
The InfoCentral project draws inspiration from academic research in many computer science subfields, including distributed systems and databases, knowledge representation, programming languages, human-computer interaction, and networking. The Semantic Web effort toward universal standards for graph-structured information systems provides much technical inspiration and background theory for this work. The research area of Information-Centric Networking (ICN) has also been highly influential, and related aspects of InfoCentral represent a competing entry.
In contrast to similar academic and open source Internet architecture projects, InfoCentral has a wider scope that allows more assumptions to be challenged. For instance, most ICN projects do not consider alternative software architectures and thus make design assumptions based upon supporting current application needs and market demands. Likewise, the Semantic Web / Linked Data effort has largely left existing web data and network architecture unchallenged.
Author's Preface
The software and internet technology landscape is primed for another major revision. As an industry, we have achieved great things, but we can do far better. With so many revolutionary advances on the practicality horizon, this is an ideal time to revisit foundations and first principles, taking into account the lessons learned over the past five decades. We have too long been stuck in a rut of incrementalism, building upon foundations that no longer properly support our ambitions. The next phase of the information revolution awaits, concurrent with an accelerated transition from frenzied exploration to mature engineering. The resulting quality uniformity will not only make our jobs more enjoyable, but tear down digital divides and improve societies globally. These are ideals shared by academic and startup enterprises alike, following a coalescence of related ideas and research efforts. Many architecture-level projects and proposals have surfaced in the last decade, bearing strong similarity in objectives and principles. They herald an era of computing dominated by distributed and decentralized systems, ubiquitous cryptography, practical forms of artificial intelligence, declarative and functional programming, increasingly verifiable code, fully-dynamic and multi-modal human interfaces, socially and semantically rich data, and universal interoperability. Unfortunately, existing projects seem to lack an overarching vision to bring together their best contributions. The primary aspiration of the InfoCentral project is to discover a truly unifying yet evolvable foundation that everyone can build upon. I hope that aspects of this design proposal will inspire fresh thinking and new collaborative explorations. Everything presented in this early publication is a work-in-progress. However, I believe that enough pieces of the puzzle are in place to begin prototype implementations that demonstrate the power of the chosen architectural principles.
Overview
InfoCentral is an open engineering effort toward a new, unifying software and communications architecture. Besides immediate practical benefits, it seeks to help close the massive sophistication gap between current personal and business IT and the needs of tomorrow's intelligent machines and pervasive computing environments. InfoCentral is also motivated by social objectives such as improving collaborative processes, community interaction, perspective building, productive debate, rational economics, trust management, and lifelong education. These features follow naturally from good general-purpose information and software architecture.
InfoCentral has a long-range futuristic vision, with some ideals that will admittedly be hard to realize quickly. Recognizing this, the project aims for layered, modular research and development, starting with new data and network models, building toward stronger semantic and logic models, and culminating in new user interface and social computing paradigms. This approach allows the long-range goals to inform lower-level architecture, while not imposing unrealistic expectations on pragmatic development toward earlier real-world applications.
Infocentric Design
The core philosophy of InfoCentral—and its namesake—is that the design of information should be the absolute central focus and priority when creating software systems. However, information design should proceed independently of software design, with no assumptions about how information will be used. Software may then come alongside the neutral information space, bringing functionality appropriate to different contexts. This contrasts with the common software-as-product philosophy, which views application-specific functionality and interfaces as the focus of design and the data model as implementation details. This top-down approach encourages production of self-contained, limited-purpose systems that define in advance what information they work with and what interactions are permitted. Composition using APIs within or between such systems is inherently fragile, due to the difficulty of precisely specifying expected behavior and side effects. This complexity is then compounded by continuous revision and the emergence of interwoven dependencies. In the end, high maintenance costs compel toward greater siloing of data and functionality rather than integration.
Semantically-rich, highly-normalized information, coupled with intuitively-programmable user environments, will someday yield a degree of integration and fluidity that obsoletes self-contained software. Writing applications will be replaced by creating and connecting small, functional modules that operate within a vast sea of standardized shared information. Most interaction patterns among users and software components will be able to be captured without specialized coding. As new functions are required, a focus on elegant, re-usable modules will ensure longevity of solutions. Meanwhile, information itself will become disambiguated, machine-friendly, and future-proof.
Infocentric design allows information to survive all changes to networks and software around it. The networking and programming models of the InfoCentral proposal are fully abstracted from the data model. In the past, we have generally designed information, networks and software with human users, developers, and maintainers in mind. This has affected everything from naming schemes and data structures to trust and authority mechanisms. To allow for unhindered evolution of AI, we must instead abstract as much as possible from the core data model, such that it will be independently useful to machines.
Graph-Structured Data
The InfoCentral proposal is an alternate vision for the widely-desired “web of data,” in which links are between structured information rather than hypertext. The design properties and economic structures that have worked well for the hypertext and applications web conflict with the needs of a pure data web. For example, there are fewer places to insert advertisements into raw data that is consumed by local software under the user's active control. Service-based business models must typically be used instead. Likewise, there is little business incentive to give users access to the raw data behind current proprietary web and mobile applications, as this would largely reduce switching costs and enable direct competition.
While sharing the same ultimate goals and theoretical underpinning as the Semantic Web effort, InfoCentral diverges from certain entrenched architectural tenets. The proposed changes aim to improve integration of other research areas and make the resulting platform more economically feasible and accessible for developers and content creators. Because the resulting architecture will differ substantially from the current web, a new name should be considered. The casual term “Graph” seems fitting, with a possible formal title of the “Global Information Graph.” (“Giant Global Graph” has formerly been proposed, but is redundant sounding and has an unpronounceable acronym.)
In the public information space, InfoCentral proposes a minimalistic, fully-distributable data / network model in which everyone effectively has write access to publish new data, nobody has write access to modify existing data, cryptography is used to control visibility, and layered social/trust networks (both public and private) are used to shape retainment, prioritization, and propagation. Such a model tends to be democratic, self-regulating, and freedom preserving. Though many nodes will not allow global write access, user-facing systems will perpetually source data from many independent repositories, creating a customizable information layering effect. Write access to any repository participating in a popular overlay is sufficient. There will be many, and any person or group will be able to easily start their own. This promotes both competition and censorship-resistance.
Standards powering the Global Information Graph will eventually subsume all standalone database technology, even for fully-private instances. InfoCentral designs comprise the minimal subset of primitive features needed to support all database architectures via layering higher-level structures and protocols. Meanwhile, compliant implementations are universally cross-compatible at a base level. When faced with unavoidable engineering trade-offs, InfoCentral designs first prioritize independence and flexibility, second scalability, and third efficiency. The selection of certain mandatory primitives guarantees that InfoCentral-style repositories will never be as efficient as a traditional relational database or column store. Keys of at least 256-bits (hash values) are required and highly-normalized, strongly-typed data is default. With time, increasingly smart engines will close the gap. However, the InfoCentral philosophy is about making information more fluid and machine friendly. Employee hours are expensive. Personal time is valuable. Machines are cheap – and becoming ever cheaper.
The User Experience
InfoCentral does not propose a specific, standardized user experience, but rather common metaphors that should eventually form the basis of all user experiences around graph-structured information. Some of these include:
- fluid and fully-integrated information
- pervasive contextual awareness (including social aspects)
- non-destructive operations (default unlimited undo and history)
- dynamically composable and reconfigurable software components
- multi-modal, task-aware, environment-aware human user interfaces
- standardized drill-down / zoomable navigation across UI modes
- dynamic level-of-detail and automatic summarization
- automated, customizable, per-user spacial layout
- separation of software functionality and shared data (no exclusive-control boundaries)
- separation of hardware and network concerns from user / information space concerns
- transparent data flows (full visibility / traceability of software components interactions)
The ideal concept of a unified Information Environment replaces standalone applications and all forms of web pages and services. All data and surrounding software functionality is fluid, with no hard boundaries, usage limitations, or mandatory presentations. Everything is interwoven on-the-fly to meet the users' present information and interaction needs. There are no applications or pages to switch between, though users will typically assemble workspaces related to the scopes of current tasks. The user brings their private IE across any devices they interact with. It is their singular digital nexus – unique, all-inclusive, and personally optimized.
The everyday user experience of an Information Environment is whatever it needs to be, in the moment, to interact with whatever information is relevant to a task or situation. It is not defined by certain UI paradigms or modalities. InfoCentral envisions a practical replacement for application functionality in the form of captured interaction patterns around information, rather than pre-designed user experiences. Interaction Patterns are declarative rulesets that define multi-party, multi-component data management and computational orchestrations. (A simple, high-level example is a business process workflow.) They are rendered by a local framework, based upon the current mode of interaction, available UI devices, and user preferences. Because patterns do not encapsulate data or custom business logic, they do not limit information re-use, as contemporary software applications usually do. Neither do they hide the flow of data behind the scenes. Likewise, patterns do not even assume that the user is a human, thus serving as integration points for automation and AI.
The ultimate expression of the Information Environment concept is the yet-unrealized vision of Ubiquitous Computing. This term may be defined, most simply, as the turning point at which most computing technology has become fully and effortlessly interoperable, such that complete integration is default. All of the futuristic academic goals for UC follow from this basic property – from a safe and effective Internet of Things to advanced AI applications. The web has brought us to a certain level of interoperability, in terms of providing a standard human-facing UI framework though browsers, but it has failed to create interoperability at the data model, semantic, and logic levels. The fact that proprietary, self-contained applications have returned so strongly with the rise of mobile computing is a sobering demonstration of this. We must find a different solution that is secure, consumer-friendly, non-proprietary, and yet still market-driven. For the sake of privacy, it is imperative that users are in full control of the technology that will soon deeply pervade their lives.
Decentralized Social Computing
InfoCentral's proposed architecture provides a substrate for new mediums of communication and commerce designed to encourage rationalism, civility, and creativity. All aspects of social computing will become first-class features, woven into the core software architecture, rather than depending on myriad incompatible third-party internet services. This will increase the default level of expressiveness, as all public information becomes interactive and socially networkable by default. Decentralization will guarantee that it is not even possible to strictly limit the interaction around published information, leaving filtering and prioritization up to end-users and communities.
Improved collaborative information spaces will revolutionize how we manage and interact in our increasingly globalized and hyper-connected world. We desperately need better integration and contextualization – the ability for all assertions and ideas to be understood and engaged in a holistic context. Greater civility and novel expression will result if all parties are given a voice, not only to share and collaboratively refine their ideas but to engage other sides formally, in a manner of point and counterpoint, statement and retraction. Layered annotations and continuous, automated, socially-aware discovery can be used to keep information fresh with the latest discourse and relevant facts. Even in the absence of consensus, the best information and arguments can rise to the top, with all sides presenting refined positions and well-examined claims. This contrasts with the traditional broadcast mode of competing channels, controlled by a single person or group and inevitably biased accordingly, with no reliable mechanisms of feedback or review. Likewise, it contrasts with the chaotic multi-cast mode of microblogging, where interaction is like an unstructured shouting match and has limited mechanisms for refinement or consolidation. By making contextualization default, echo chambers of isolated thinking and ideology can be virtually eliminated. As in science, refined consensus is more powerful than any individual authority claim. Faulty ideas can be more quickly eradicated by exposure to engaged communities and open public discourse. This includes encouraging greater internal debate that harnesses diversity within groups. Meanwhile, among commercial applications, traditional advertising can be replaced by customer-driven propagation of reliable information.
Decentralized social networks have inherently different properties than those of current-generation centralized solutions, all of which depend upon a single, large, trusted third-party and its economic realities. Decentralized designs can support features that are either technically or practically impossible to realize or guarantee in centralized designs. These include:
- direct, independent, offline-verifiable trust mechanisms
- full transparency and history of all data involved in interactions
- inability to monitor users' private interactions
- inability to force advertising
- inability to shape viewership via hidden algorithms
- inability to impose access or interoperability restrictions
- standardized open data and functionality that survives industry competition
- stable platform for development not subject to shifting business interests
- seamless layering of specialized public and private social networks for myriad purposes
- independence from reliable internet connectivity
These distinctions do not, however, imply that centralized services cannot be provided as optional private layers on top of decentralized public networks. Such services may include all manner of indexing and analysis derived from public information, typically at scales that favor dense infrastructure. In addition, not all decentralized designs have all of the listed features, as tradeoffs sometimes exist with convenience and QoS objectives. InfoCentral promotes flexibility here, allowing users and communities to discover which are most useful in different contexts.
The Developer Perspective
To understand how the proposed InfoCentral designs affect developers, it is necessary to make a high-level first pass at the software architecture. Within the Information Environment model, users and public software components interact solely by appending new information to a shared graph of immutable data entities. Software components watching the graph are notified and respond accordingly, typically based on formally-defined interaction patterns, which codify expected behaviors. This entirely replaces use of traditional APIs and application-level protocols in user-space and across networks. For example, to send a message, a user will publish a "message" data entity that references a known “mailbox” data entity by a hash value. (Replication toward the recipient(s) happens automatically behind the scenes.) To update a document or database entity, users or software components push revised entities that link to the previous revisions and/or a “root” entity. To implement a business workflow, users chain annotations of tasks, statuses, requests, approvals, fulfillments, etc. onto relevant business data entities. This form of information and interaction modeling naturally lends itself to declarative programming paradigms and clean, modular designs.
The proposed level of integration is made possible by not giving exclusive control of any information to particular application code – or even a standardized object or event model. An Information Environment hosts software components that are infinitely composable but not coupled to each other or the information they work with. This compares closely to the essence of the original Unix philosophy. As the universal interface of Unix is the text stream, the universal interface of the Information Environment model is the graph of immutable data entities. Compositions between public components have patterns or contracts that describe allowed behaviors. Interactions must also take into account the possibility of conflicts, given the default concurrent nature of operation. This is made tractable by the immutability and multi-versioning features of the data model and is typically handled by lower-level patterns that deal with behavior under conflict conditions. Most of these can be shared across interactions and even be standardized close above the ontology level.
It's important to note that public software components are distinguished from private code that implements them. Most simply, public components must communicate over standardized graph data while private code may use direct local interfacing. The IE model intentionally specifies fewer private implementation requirements, to allow for adequate flexibility and also re-use of existing codebases. Again, a parallel can be found with the old Unix model. Public IE software components are roughly analogous to small composable Unix programs that have a design imperative to be simple, well-documented, and do one thing well. Implementation of public components strongly favors functional paradigms and should itself focus on maximum re-use. The appropriate level of public vs. private granularity is a software engineering exercise. Regardless, all constituent code must be contained in graph data entities and all code references must be grounded in hash identity.
Developing software within the InfoCentral paradigm is entirely unlike contemporary application development. There is a stark absence of boilerplate coding like data wrangling, networking, and lifecycle management. Likewise, data schema and ontology design is removed as a global collaborative effort. Any code deliverables are usually small and independent of specific local project objectives. Integration work, the high-level declarative weaving of modular functionality into specific interaction patterns, typically outweighs writing new component code. Interaction patterns themselves should often be open collaborative works, as many users will need similar patterns. Unlike contemporary software, there is little room or motivation for redundant implementations of everyday schemas and interactions. The resulting consolidation will bring a long overdue simplification and standardization to users and developers alike.
Because the seperable work units are so small and well-defined, the natural economic model for most development is purely labor-based, involving contract work to build upon global, open source codebases. Typically, a project contract would not be considered fulfilled until any new public schemas, interaction patterns, and components have been approved by the oversight communities involved. This provides default quality-assurance and ongoing maintenance, regardless of the nature of the local project.
InfoCentral-style architecture has many other economic ramifications. In contrast to the web, it will result in a shift in revenue streams from advertising to direct service-oriented schemes. While the public data graph is fully open and nominally free, the opportunity for market-driven value-added services is enormous – network QoS, indexing and search, views and filtering, subscription, large-scale analytics, mediated interactions, compute services, etc. Private services built upon the public data graph are fully orthogonal and competitive because no party controls the graph itself. Consumers may easily change service providers because all underlying data is portable and uniformly accessible by default. Even if some services use proprietary code behind the scenes, the services are consumed over open data and the user is still in full control of how the results are used.
Methodology
InfoCentral is intended to be an open and unifying collaborative project. The project is primarily focused on building architectural foundations, not a complete software stack or “platform.” Any implementation work seeks only to help establish a new software ecosystem that will rapidly outgrow the project. Likewise, there is no intention to name any design or specification using “InfoCentral” as though a brand. To do so would undermine the nature of the project's work toward open industry standards.
The first priority of the InfoCentral project is to design and promote a universal persistent data model that can be shared among the dozens of projects working in the space of distributed systems. This is highly strategic in that it provides a neutral, minimal-infrastructure point of unification. It is also a practical initial goal with immediate benefits and early applications not dependent on the more challenging software layers. Most other projects have started with new languages and/or development environments, forcing the most unstable and research-heavy aspects of their designs from day one. The InfoCentral approach creates a wide bridge between all new and existing systems, separating what needs to evolve from what can be agreed upon today. This allows both cross-polination between projects and a pragmatic migration path for legacy systems. It creates something that will outlast the fun tech demos and toy implementations, without hindering their progress in any way.
The new software ecosystem envisioned will require a bootstrapping phase, but a critical advantage over other “grand-unification” proposals is that InfoCentral seeks not to over-specify designs for the sake of expediency. InfoCentral is not a pedantically clean-slate project. While some absolute architectural guidelines are drawn, any existing technology that fits is permissible to reuse – even if temporarily, through an interface that serves as an adaptor to the new world of fluid, graph-structured information. Conversion may proceed gradually, as dependencies are pulled into the new model. It will be possible to start with low-hanging fruit. There are many simple use cases that derive from the basic architectural features but do not require immediate adoption of a complete stack. (For example, the data and network model are useful independent of the more nascent Information Environment research area.) However, as developers collaborate globally to rebuild common functionality under the new model, an avalanche of conversion should begin. As the benefits become obvious and dependency trees are filled out, it will quickly become cheaper and easier to convert legacy systems than to maintain them.
The InfoCentral philosophy toward technological and downstream social change is to provide tools rather than prescription and pursue undeniable excellence rather than advocacy. The only reliable and ethical method of convincing is to demonstrate elegance and effectiveness. When adequately expressed, good ideas sell themselves. In practice, most people expect new technology to provide immediate benefits, have a reasonable learning curve, and not get in the way of expression or productivity. Any violation of these expectations is a deal-breaker for mainstream acceptance and a severe annoyance even for enthusiasts. Likewise, social technology must not require significant life adjustments to suit the technology itself. In most cases, technology must be nearly transparent to be acceptable.
In the long term, the InfoCentral model contains enough paradigm-shifting design choices that substantial relearning will be required. However, this need not occur at once. The key is to make any new system intuitive enough that self-guided discovery is sufficient to learn it. With InfoCentral, the most challenging new concepts are graph-structured immutable data and the fluidity of software functionality. These are shifts toward more intuitive computing, yet they may temporarily be harder for seasoned users and developers to grasp. It will be critical to build many bridges to help with the transition.
InfoCentral will require a strong and diverse base of early adopters, and early contribution must be rewarding. As soon as possible, implementations should feel like a place for fun exploration, like the early days of the web. InfoCentral offers a haven for academics, visionaries, and tinkerers to experiment within. It likewise offers cutting edge technology for consultants who want to differentiate themselves from the masses of web software developers. It offers businesses a competitive advantage if they can find ways to use the technology to improve their efficiency before competitors. For some developers, InfoCentral will represent a departure from the disposable-startup culture and an opportunity to build things that will last and permanently improve the world.
Just for Fun
Most of us chose the field of computing as a career because we somehow discovered the joys of dissecting complex machines, creating new ones, and solving hard puzzles. After extended time in industry and/or pursuing academic directions, this lighthearted enthusiasm can be lost. I hope that projects like this can help many of us regain our original fervor. I believe that a new software revolution is just around the corner, with a creative modus that will satiate the fiercest nostalgia for the long-past golden age. Very little of what now exists will remain untouched, and fresh ideas will be overwhelmingly welcome once again. The means of novel expression will again feel powerful and rewarding, unhindered by decades of legacy baggage and boilerplate.
The InfoCentral Design Proposal
This initial design proposal is intended to be relatively informal and not fully exhaustive. It is a work-in-progress and is intended to inspire collaboration. Many of the topics contained within will be treated elsewhere with greater depth. Likewise, formal research will need to be conducted to refine specifics of the design, particularly those left to implementation flexibility.
The proposal is presented here in a static document form. It represents merely a seed and a historic artifact, as the InfoCentral project is launched publicly. It is the work of one author and by no means the final authority on design matters. Henceforth, this content should be transitioned to a collaborative medium.
What follows is a dense summary outline of the architectural features, characteristics, and rationales of core InfoCentral designs. It has largely functioned as an organizational artifact, but now serves as a way to introduce key concepts and provide a project overview. The outline structure was chosen to avoid continual maintenance of paragraph flow and extraneous verbiage while developing raw ideas. It also makes the relation of points, subpoints, and comments explicit. For new readers, it will be helpful to make multiple passes on this material, as certain subtleties will become apparent only after broad perspective is gained. The design and writing process itself has consisted of countless refining passes. Each section has some overlap, sometimes using generalized concepts that are elaborated elsewhere.
Unlike a formal academic paper, the varying scope and long timeframe of this exploration has made references difficult to manage, so no complete list will be attempted. This publication should be seen as a practical engineering proposal and vision-casting effort, with academic rigor to be added later.
Architecture for Adaptation
General
- The InfoCentral design proposal is intended to specify as little as possible, favoring a layered architecture that can evolve over time. This is in honor of time-tested internet engineering philosophy.
- The high adaptability of the InfoCentral architecture is provided by the base data model being the only invariant. Everything above it (information and software) and below it (repositories and networks) can change arbitrarily. The architecture proposed for these areas ultimately serves to make application of the data model feasible, but it can also be discarded if better approaches emerge.
Scalability
- InfoCentral designs have no inherently centralized components, such as could result in constraints to scalability.
- Global scalability is supported by appropriate choice of distributed data stores and distributed compute resource management.
- Proposed designs are lightweight enough for trivial but compliant implementations, suitable for embedded and sensor applications, the Internet of Things, etc. Nodes have no connectivity assumptions and may operate fully independently. This contrasts to distributed systems designs that require some level of initial or periodic connectivity for operation.
Data store and network abstraction
- The same data model is used for all storage schemes, security models, and network topologies.
- All aspects of InfoCentral designs are implementable in distributed and decentralized fashion, but there is also no requirement to do so.
- The base data model is not tied to any protocol, network or data structure (ex. DHT, blockchain), or any other external artifact.
- The InfoCentral architecture contrasts to projects that mandate a blockchain or Merkle DAG as a network and data model primitive. However, InfoCentral designs can fully support these as higher-order features.
- ex.) Usage of a Merkle DAG to direct distributed replication is an optimization for the network layer. It would be unclean to leak this usage into the data model, though it may be confined as Repository Metadata.
- The generalized InfoCentral network model is an dynamic mesh of layered public and private networks, rather than a single hierarchical network that attempts to approximate universal routing, resolving, and dereferencing capabilities. This is the most fundamental departure from traditional web and internet architecture.
- InfoCentral architecture is natively suited for Information-Centric Networking (ICN) schemes, with convenient transition and bridging strategies for host-based networks.
- The general network model does not impose a single namespace or dereferencing authority. A variety of layered dereferencing approaches will be used, customizable for different needs and usage patterns.
- Nothing prohibits data entities from being tagged with hierarchical routing metadata, internally or externally, for networks that employ centralized or blockchain-based routing authorities. This may prove useful as a disposable transition technology to help with early routing scalability concerns until full-scale ICN is deployed. However, routing metadata may not be used to reference entities within the Persistent Data Model.
- In the hourglass network component analogy, the InfoCentral architecture “thin neck” is the generalized graph of immutable data entities, rather than any network or software artifact. This makes graph data itself future-proof, permitting unlimited evolution of all technology surrounding it.
- Management of reference collections is the center of all data propagation and aggregation strategies. The general model is to collect reference metadata (knowledge about relationships between entities), filter on provenance, and disseminate it to other nodes.
- New data and reference metadata can be pushed, gossiped, broadcast, etc. depending on the network scheme. The base architecture purposely does not specify this.
- The InfoCentral data model is suited for layering multi-party information and managing this at the network level, something that neither Merkle DAGs nor namespaces support. First-class references in InfoCentral always point “up” toward existing entities rather than “down” toward children.
- Merkle DAGs can be implemented by creating second-class references, within data entities, that point the other direction. Network participants are welcome to use these to optimize replication accordingly, but this is tentatively not a baseline feature because it is too subject to changes in design and usefulness. For large data, more natural decompositions will eventually replace the practice of splitting large binary objects. Other uses of Merkle DAGs tend to be application-specific (tree data structures) and should typically be handled at a higher level.
- Filesystems and Git-style DVCS can easily be implemented on top of the base data model, but hierarchical naming and associated data structures are considered a degenerate use case in contrast to unrestricted graph semantics. Likewise, arbitrary-string naming is metadata for human consumption only and is not supported as a first-class feature of the data model. (ie. there are no name fields in the base standards)
- Cryptographic namespaces can be developed on top of the InfoCentral model, as part of various optional routing mechanisms. However, these are logically centralized, typically around hash-identified public keys, and thus have a single point of failure. They are probably best reserved for managing temporary pointers to trusted network information, rather than data identities. (see Permanent Domains)
High-level programming abstractions
- The graph data model serves as a unified persistence standard and a neutral public interface among all software components, while remaining independent of software artifacts. Immutability of graph data entities means that the graph is effectively append-only. This ensures that interactions are conflict-free at the persistence / network level, eliminating the need for complex coordination mechanisms and enabling unrelated software to independently operate on the same data without supervision. Each participant must, of course, have the ability to handle conflicts at the information / semantics level.
- Declarative and functional programming paradigms are the most naturally suited for working with immutable graph data and higher level structures built upon it. Functions themselves are captured within immutable data entities, making re-use and composition trivial.
- The focus on working with shared graph data without artificial boundaries is embodied in the concept of Information Environments, localized computing workspaces that users (both human and AI) operate within. IEs are entirely fluid and programmable workspaces for processing data selected from the global data graph. They are also the nexus of physical interaction, via human-computer interfaces and other devices.
- Although interactions via appending to the global data graph are ultimately unrestrained, users and software components may agree upon declarative Interaction Patterns that codify precise behaviors and allow for contractual agreement on how to use graph data to accomplish meaningful tasks and communication. Collaborative catalogs of generic patterns will emerge, analogous to object-oriented design patterns. Interaction Patterns also serve as the replacement for application-specific protocols and interfaces.
- Use of Interaction Patterns over shared graph data yields transparent data flows, allowing any system to be openly inspected and analyzed. While particular software components may nominally represent black boxes, the high level of decomposition and sharing is a disincentive to hide internals of publicly consumed software. Any code module must have a strict contract by which its operation may be understood and independently verified.
- Graph-based visualizations of software data flows will become meaningful to even less technical users. This is analogous to a spreadsheet, wherein the formulas for every cell are visible to users.
Persistent Data Model
Summary
The Persistent Data Model defines how units of first-class persistent data are contained and referenced. It is the only invariant component of the InfoCentral architecture, around which all other components are freely designed and may evolve with time. To avoid contention and the need for future revision, applicable standards will be as minimal as possible.
Standardized mutability and reference semantics are the foundation of the InfoCentral data architecture. The Persistent Data Model necessarily defines the only model of persistent data allowed for user-space software and the only form of publicly shareable and referenceable data allowed among networked data repositories. In promotion of Information-Centric Networking, it serves as the “thin neck” in the hourglass network model.
Discussion
Standard Data Entities
Standard Data Entity is the data container for all storage and transport. It specifies the data structure but does not mandate the final encoding schemes used by data stores and transport mechanisms. However, any encoding of an entity must yield the same binary data when resolved to the canonical format.
- Standard Data Entities are always treated as immutable.
- Standard Data Entities do not have names or other assigned identities. Their identity only exists as or is derived from the intrinsic properties of their content.
- Entity immutability refers only to the permanent relationship between entity data and derived reference identities. No technical rules forbid the arbitrary deletion of entity instances.
- There is no authoritative network location of an entity. Multiple copies of an immutable entity will typically exist within and among networks, driven by usage popularity. Conceptually, a repository is a transient host of immutable data.
- The Standard Data Entity specification defines header, content and metadata payload, and signature table components of an entity, each of which are contained in standardized blocks.
- While header, metadata, and signature blocks have a standard structure and encoding, content payload block(s) are raw data with no restrictions on encoding, language, or typing. Aggressive standardization should proceed toward the best available designs, however, and no weakly-typed encoding scheme should be endorsed. This would be antithetical to the overall programming regime proposed.
- The header is always plaintext.
- Other than the header, all components of a Standard Data Entity may be individually encrypted. Payload blocks and the entries in the signature block are enumerated in canonical order, such that they may be referenced externally by entities providing keys to decrypt them. This allows a single entity to contain private data for multiple parties and allows for flexible access control.
- ex.) A common access control pattern will be to allow partially-trusted users / systems to read certain metadata blocks while fully-trusted users are given keys to read all payload blocks.
- ex.) Forward secrecy messaging protocols often require a message to be separately encrypted to multiple parties, using pre-keys, ratchets, etc. By containing all copies within a single entity, via multiple payloads, the message gains a single permanent representation for external hash reference and re-use.
- To promote uniqueness, a nonce may be included within an entity's header. This avoids reference hash collisions for entities having unencrypted, unsigned payloads containing common messages or values. Persistent Data Model specifies required nonce usage for certain cases. (ex. empty root entities)
- Revisions are represented by new immutable data entities, inherently having new HIDs. Normally, a revision should directly reference its parent entities. Systems that track references may thereby associate them.
References
All references to other entities within Standard Data Entities must use secure hash values of the canonical serialized data of referenced entities. This enforces the immutability property among Persistent Data and ensures that references cannot go stale.
A secure hash value used in a reference may conventionally be called a Hash Identity (HID) of the referenced entity because it is based solely upon the totality of the entity's intrinsic properties. (ie. the entity's content self-identity) However, as there are unlimited calculable hash values for a piece of data, there is no particular HID that constitutes the canonical identity for an entity. References always specify which hash function was used.
- Hash values are only used for referencing. Authenticity must be provided through other means. (signatures, Message Integrity Codes, etc.)
- References to the same entity using different hash functions are considered to be equivalent in the data model. (This shall also extend to graph semantics in higher layers.)
- When building reference collections, repositories should typically consolidate equivalent references to the same entities.
- In practice, most references to public entities should employ a common hash function, to maximize aggregation.
- Multiple hashes may be used in a single reference, for greater security.
- Each hash function in a multi-hash must use a different cryptographic construction, as do SHA-2 (Merkle–Damgårdand) and SHA-3 (sponge function).
- In practice, an entity may be requested by any known hash value (or truncation thereof) that a repository is aware of, then verified locally using all HIDs contained in the reference, along with other metadata.
- Repositories will typically maintain indexes of HIDs generated using multiple popular hash functions, to allow convenient lookup.
- At the time of this writing, SHA2-512/256 and SHA3-256 are good defaults for public references. However, we may instead favor always using truncated 512-bit hashes, to make upgrading easier later.
- All URI schemes are considered deprecated within the Persistent Data Model.
- Dereferencing of URIs is disallowed, due to identity semantics that are incompatible with the mandate of intrinsic immutable identity.
- External legacy data must first be sampled into Standard Data Entities before it can be referenced.
- URI support may exist external to the Persistent Data Model.
- ex.) When snapshotting or archiving existing web content, it is useful to capture URIs and timestamps of where and when data was retrieved, as a bibliographical record.
A reference may contain metadata for the entity being referenced. This metadata must always pertain to publically-visible intrinsic properties of an entity. (i.e. It must be able to be derived solely from plaintext (or undecrypted ciphertext) information contained in the entity referenced.) Inclusion of metadata in a reference is always optional and does not change the equivalence of references to the same entity.
- Though it cannot add unique information, intrinsic entity metadata is immutable relative to the supplied hash values and may be useful in software and network orchestration.
- Examples of intrinsic properties suitable as metadata include an entity's canonical data size, an included nonce value or signature, a set of flags of contained metadata types, a hash of a signer's public key (only for included plaintext signatures), etc.
- Required metadata field support for references is covered by the Standard Data Entity specification. This will take the form of reserved keys and value data types.
- As a hedge against the possibility that the Standard Data Entity specification would need to be revised in a way that modifies the canonical serialization format, we should require support for a SDE version metadata field, which defaults to 0 if not specified.
- Arbitrary non-standard metadata fields are allowed but generally discouraged for public systems.
Standard Data Entity Specification Development
Because it is the absolute foundation, Standard Data Entity specification revision should be avoided at all cost. Once a public standard is ratified, it must be supported forever. Standard Data Entity public specifications should mirror the historical stability of simple clean standards like HTTP. Extensions must be preferred before any changes that break compatibility.
Entity standardization is largely defined by enforcement at the Repository interface. Any new public Standard Data Entity specification would coincide with a new Repository interface version, which would support all previous versions. Existing hash-based references would therefore be unaffected. A request by-hash over the network is generally content neutral, in that it can proceed without awareness of the entity specification it applies to and what Repository interface will respond to the request. However, if useful, reference metadata could contain a field for the SDE revision of the referenced entity.
For the sake of prototyping, we can build a temporary, unstable, non-public entity specification and associated Repository interface. As a precaution, it would probably be wise to include a tiny header field that indicates the prototype nature and perhaps a pre-ratification version number. Once a public standard is ratified, prototype data can be converted, but re-establishing references among widespread multi-party data will be difficult, especially where cryptographic signatures were involved. Thus, prototype implementations should strictly be used for either non-valuable data or closed-group private applications, wherein coordination of wholesale conversion is feasible.
Large data handling
Large data may be broken into chunks, each held by first-class entities. Merkle hash trees can be used to efficiently aggregate and validate these data chunk entities.
- The root of a Merkle hash tree is the source of hash identity suitable as a reference target. Data chunk entities are not considered to have any independent value when separated from a tree, unless clean decomposition points are explicitly used in place of arbitrary size chunks. (ex. media frame or sample boundaries)
- Large data hash trees are useful for efficiency of transmission and data recovery. They also support efficient revision of large data sets that cannot otherwise be meaningfully decomposed. The general approach is to modify only the needed chunks and create a new hash tree representing the revision.
- Stream data may be supported by growing the hash tree accordingly.
- Repositories and networks may support optimizations for large data handling
- Local optimizations include storing chunks contiguously.
- Typically, hash-tree child HIDs should be stored unencrypted in tree node entities' headers. (or perhaps an unencrypted metadata payload block, TBD) This allows even simple, cache-like repositories to perform large data optimizations.
- Cooperating systems may opportunistically perform Bittorrent-like replication for very large data. This would be entirely transparent to users.
Design Rationales for the Persistent Data Model
- promotes permanence of data identity, for reliable present and future reference
- treats dereferencing as a separate concern from identity, promoting independent network evolution
- hash-based references do not reveal anything about the referenced data, allowing private encrypted entities to be managed across public networks
- simplifies support for diverse versioning schemes and distributed architectures
- provides a default data integrity mechanism
- supports efficient caching and default copy-on-write architecture
- immutability semantics are a natural fit for functional programming
- purely mathematical data identity replaces fragile authority mechanisms
Temporal Data Model
Summary
The Temporal Data Model defines a general model for arbitrary, non-first-class data that is not required to be standardized but may be useful behind the scenes for managing the state of physical interfaces, networking, and data management facilities. It is not related to memory models for user-space software environments.
Discussion
- Temporal Data is subject to arbitrary change and does not have an immutable identity-value relationship. (ie. similar to traditional files or database records)
- Temporal Data is not part of the global persistent data graph. Used properly, there should be no information lost if it is destroyed. It can be regenerated from persistent data or else only has real-time or run-time value.
- Common uses will include local and runtime data, network and resource management, raw streams and sensors data, etc.
- Temporal Data is not first class:
- It is not accessible via standard repository interfaces facing the public network or user-space software environments.
- It is unable to be referenced as the subject or object of a statement or be the target of first-class metadata.
- Temporal Data identity is not globally unique, but rather subjugate to particular repositories or federated networks thereof. Repositories may choose how to privately name and index mutable data.
- Support for temporal data is not mandatory. Developers may choose to exclusively use the Persistent Data Model, even for information considered highly disposable.
- Temporal Data entities are effectively internal data structures of a repository or its physical environment, never publicly distributed between non-federated repositories.
- Temporal Data is not directly accessible by user-space (ex. Information Environment) software components, as these only work with Persistent Data externally. However, indirect access is possible by first sampling temporal data into first-class persistent entities.
- At first glance, an exception seems tempting for real-time interaction, such as games and some multimedia applications. However, the Standard Data Entity model is lightweight enough that even disposable content like operational transform data makes sense over Persistent Data. In addition, here are always security tradeoffs in exposing more public network-facing interfaces and data structures. Information Environment design principles thus strictly prohibit this. If real-time remote protocols are necessary, this falls within the domain of private IE interfaces, which are used strictly among trusted instances and over a secure channel.
- Temporal data must always be linked to persistent entities, via repository-local references. It need not use HID references and may use any form of local addressing.
- Unlike Standard Data Entities, the containers and data structures for mutable data are not defined. Practical standards will emerge to aid repository and system-level developers.
Continuous data sources
Temporal Data may be used to hold temporary continuous data from sensors, human input devices, and other stream data sources.
- Continuous sources are always configured via Persistent Data, such as sample periods and sample pools. Standard configuration schemas will be developed for this.
- On-demand samples can be obtained via Persistent Data, by creating a request entity and waiting for it to be fulfilled.
- Sources may periodically publish a list of samples as Persistent Data.
- Data should often be signed and timestamped by the source itself, such as specialized sensor hardware.
Network orchestration data
The hash-based references of Persistent Data Model do not provide any information about where to locate copies of data entities. Temporal Data may be used to annotate first-class references with dereferencing hints. However, if this information is to be cryptographically signed or contain non-trivial metadata, it is preferable to use Persistent Entity Metadata rather than inventing a redundant container model. This way, it can also be shared between repositories or networks. Persistent Data may simply be allowed to expire if historical record is unneeded.
Design Rationales for Temporal Data
- provides a separation of components and data that are subject to arbitrary change and do not need strong identity
- promotes flexibility in the design of hardware and system-level software components that need not be globally standardized
- provides a space for low-level management and performance optimizations not easily possible under the stricter design constraints of the Persistent Data Model
- Many repository internals do not need strong identity or versioning semantics.
Summary
Although Standard Data Entities are immutable, repositories collect and propagate metadata about which entities reference each other and, if visible, for what purpose. These per-entity reference collections are the sole point of mutability in the data architecture and allow for practical management of data for revisions, graph relationships, external signatures / keys / access controls, and annotations of all kinds. The collection-based model supports arbitrary layering of public and private third-party information across networks. Synchronization of collections may occur across repositories and networks thereof, to propagate revisions and new metadata, manage transactions, etc.
Discussion
- Repositories track known references between persistent entities, by maintaining reference metadata collections per entity. Because hash-based references are made within immutable, hash-identified entities, the resulting graph of entities is inherently directed and acyclic. Reference collections point in the opposite direction, from an entity to known entities that reference it. This allows for top-down traversal patterns.
- Example: A certain entity representing an instance of concept “human person” may be described by a “birth date” property via another entity. The entity with the property references the person entity by a HID. Having a copy of the person entity, in isolation, does not reveal the existence of the entity containing the birth date property. However, a repository that knows about this particular person entity also collects known references to it – perhaps, in this case, including the entity with the birth date property. The reference collection thus allows known related data to be traversed. Of course, it is up to smart repositories and networks to propagate useful information to appropriate locations.
- As with web links, which are also unidirectional, hash-based references do not require multi-party coordination at creation time. Likewise, this comes at the expense of separate work to propagate knowledge of these links. On the web, referenced-by indexing is performed by search engines, but the collected reference graph is usually not publicly exposed. In contrast, the InfoCentral model treats the reference graph as a native feature and promotes this as a pillar of Information-Centric Networking. Unlike web URIs, hash-based references never go stale, thus reducing the overall workload.
- Collectible Reference Metadata may include anything a repository can ascertain about a referencing entity. This includes Intrinsic Entity Metadata. If a repository has keys allowing it access to encrypted entity data, this is also suitable as a source of reference metadata. Care must be taken here. Reference metadata gleaned from private data must not leak onto public networks.
Collection Management
Reference metadata may be arbitrarily collected and disposed of for any persistent entity. Relative to the abstract global data graph, no information is lost if reference metadata is destroyed, because it could always be regenereated from Persistent Entities. It is local knowledge and may use the Temporal Data Model. Schemes for managing reference metadata are repository implementation specific.
- There is no requirement that an entity must exist in a repository managing reference metadata for it. This allows separation of original information from 3rd-party annotations, hosted in separate repositories.
- Care must be taken that equivalent HID aliases are discovered, whereby collected references to an entity use different hash functions. This provides motivation for reasonable default hash standards, especially for public information that will most commonly involve many 3rd-party references.
- If a repository does not have a copy of a particular entity's data, for the sake of HID index building, trusted metadata that supplies externally-calculated hash values may be used. This may also be used as a performance optimization, to avoid hash recalculation among trusted or federated repositories.
- Example: If entity A makes a reference to HID X, entity B makes a reference to HID Y, and entity C makes a multi-hash reference to HIDs X and Y, a repository should consider entity A, B, and C to refer to the same unknown entity having HIDs X and Y, assuming that there is reason to trust the source of entity C.
- There is no requirement that an entity must exist in a repository collecting its relation to another entity. A repository may even hold nothing but reference metadata.
- In some networking schemes, when a repository does not store its own copy of a known entity, it may collect repository IDs for locations where that entity may be fetched.
- Each copy (or null stub) of a persistent entity, across various repositories, has its own reference metadata collection.
- Any reference metadata collection synchronization between repositories is optional and implementation specific, though standards will exist for all common cases.
- It is inherently impossible to achieve a globally consistent view of all available reference metadata for an entity.
- Some entities and associated reference metadata may intentionally not be shared outside a repository, such as private revisions or annotations. A repository's access control rules determine what is exposed over its public Persistent Data interface.
Entity Reference Types
Collectible Reference Metadata always includes the type of a reference and often includes the type of metadata provided by the referencing entity, if visible. Knowledge of entity reference and metadata types can be useful for repository management purposes, such as allowing subscription to events for desired types. Formally, Entity Reference Type refers to the manner in which a reference is made between entities.
- Hard References
- Definition: A hard reference is a reference to another entity held in an entity's plaintext header, using a field called the Hard Reference Anchor.
- It is always a singular entity reference. The hard anchor is considered the subject of data or metadata contained in the referencing entity.
- Indexing of hard reference metadata is typically not discretionary. If a repository allows an entity to be stored, it must index its hard anchor. Likewise, it should index discovered hard references from trusted sources, whether or not the entities are stored.
- The Hard Reference Anchor, as with all header fields, may not be encrypted. However, both the referenced and referencing entities may otherwise have encrypted payloads. In cases where access patterns are considered a side-channel information risk, network-level protections must also be employed.
- The rationale for only allowing a singular entity reference in the hard reference anchor is that multiple references (ie. multiple subjects) represent bad information architecture (insufficient decomposition or specificity), would require complex set-revision support for header data, and could be more easily abused to cause excessive network churn.
- Soft References
- Definition: A soft reference is any reference within an entity that doesn't use the Hard Reference Anchor.
- Repository indexing of soft reference metadata is optional and requires access to and understanding of an entity's metadata and/or payload content blocks. Entities may simultaneously provide a hard reference and any number of soft references.
- Having a large number of soft references in one entity is acceptable because their indexing is discretionary. For instance, if such an entity is revised, a repository may remove some or all of the previous revision's soft references from applicable collections. Likewise, soft references from an untrusted or unapproved entity author will commonly not be indexed.
- Virtual References
- Definition: A virtual reference is a reference that does not explicitly exist in the entity for which it is being associated.
- Virtual references are not part of the first-class data model, because they are not grounded in hash-identified entities. They can be considered association hints generated by a local repository or higher-level software components.
- A repository may report virtual reference metadata via its public interface. The reference type must be marked as such.
- A virtual reference may exist toward many entities in a certain collection. In the case where identical metadata is to be applied to all entities, this eliminates the need to store many redundant copies – where each entity would have the same metadata but merely a different entity hard reference.
- ex.) If every entity in a certain collection is encrypted with the same key or has the same ACL, each entity can have a virtual reference pointing to the relevant permission entity. Thus it is not necessary to produce many copies of a permission entity that each reference a single entity in the collection. Because a virtual reference may be more easily lost, this should be used with caution and only in well-defined private repository scenarios where the information is unlikely to ever be shared or distributed externally.
- Annotations from older revisions can be virtually referenced, if determined to cleanly apply to more recent revisions. Ideally, such annotations would always be updated in coordination with the annotated entities, but the nature of distributed, global, multi-party interactions obviously makes this impossible.
- Virtual references can be used to aggregate or highlight related information. These are typically generated automatically, as an optional service of a repository.
- A virtual reference may be materialized by creating equivalent persistent entities, whose references are first-class. This is necessary for public export to a non-federated repository or system.
- Local References
- Definition: A local reference is used to attach temporal data to an entity. It is not publically sharable outside the repository that created it.
- Local references may use any suitable reference scheme. There is typically no reason to use public HIDs if a repository-local index key can be used.
- Local references are most commonly use to attach repository or network data management metadata.
- allows easy management of multi-source layered data and metadata
- supports external annotation
- supports private and/or unofficial revisions
- supports repository and network management artifacts, without interfering with first-class persistent data
- eliminates the need for conflict resolution in maintaining collections of references and associated metadata
- Conflict resolution is always an information-space concern, not a raw data management concern.
- dramatically simplifies core server architecture
- A typical simple repository will internally consist of a lightweight graph database, to store HIDs and reference metadata, and a key-value storage back end, for bulk entity data.
- suitable for data-streams and broadcast / gossip-based protocols
- lends itself to CRDT structures and protocols
Summary
The Standard Data Entity design provides all entities with the ability to hold both “self subject” metadata about their own payloads and external metadata about other entities. The subject of external metadata is always the hard reference anchor.
As a point of terminology clarification, Intrinsic Entity Metadata refers to the raw data properties of a Standard Data Entity in its canonical encoded form, absent of any external decoding, decryption, context, and interpretation. Persistent Entity Metadata refers to metadata stored within an entity that relates to the realm of Persistent Data at large and is subject to external context and interpretation. However, the term “metadata entity” is inappropriate, because there is no such distinction. All entities may contain data and/or metadata.
There are five standardized categories of entity metadata: General, Repository, Anchors, Signatures, and Permissions – with the convenient mnemonic “GRASP”. Within each category, all common types of metadata shall be standardized. These should cover nearly all possible applications of metadata. Non-standard custom types are allowed but discouraged for public data. These must be explicitly designated as such, to avoid collision with future extensions of standardized types.
General Metadata is basic data about an entity, such as types or timestamps. It does not include any intrinsic information about entity data, as this can always be derived.
Repository Metadata is used to provide history related to an entity and its context or to give directives on how an entity should be managed. It primarily involves revision and retraction data, but may also include other related state that needs to be persistent, like transaction contexts or expiration conditions.
- Revisions are normally hard-anchored to a root entity, which serves as a collecting point for revision data. As a fallback, a common ancestor may be used if the root is unknown. (This should be considered a temporary revision branch.) A root entity typically contains original data, but may also be an empty stub.
- Specification of a revision's parent(s) is part of its (self) Repository Metadata rather than using the hard anchor. These may be used as soft references. (ie. for each parent)
- Disposable, repository-local data structures such as vector clocks, metadata collection synchronization states, active subscriptions, etc. are typically managed as Temporal Data, outside the Persistent Data Model.
Anchor Metadata connects an entity to specific data within other entities. It is used to link comments, markup, links, entity and property graph composition relationships, discussions and other interactions, and all other forms of annotation.
- All annotations have a hard reference to the entity being annotated. They may also contain one or more media-specific anchors to the annotated data, such as text or audio positions. This constitutes anchor metadata. Annotation content itself is stored in a content payload block.
- Direct Annotations are entities that contain annotation content in their own payload. (ex. a footnote, markup, or other very small data that is unlikely to be independently useful in a different context)
- Link Annotations are entities that reference external annotation content. (ex. an external entity providing a comment) This is a soft reference. A Link Annotation frequently has anchor metadata (such as media-specific anchors) for both the subject being annotated and the target content.
- Both Direct and Link annotations may need to be updated if the entity annotated is updated, to maintain consistent reference and anchor locations.
- Some software components may attempt to automatically apply old annotations to new revisions, by comparing contexts. Virtual metadata may be used here. Typically, the user should be advised when this occurs, in case the approximations are inaccurate.
- In most cases, it is best to create Link Annotations that directly apply old annotation content to new revisions, making the relation first-class and sharable. This can be done by a third party and does not require the original annotation author's involvement. Of course, that author's signature truly only applies to the original context and the new link is the work of the third party. Annotations known to apply strictly to a particular revision should be marked as such, at creation time. (ex. editor's notes on a particular draft revision)
- Annotations and other anchors should point to the most abstract entity that logically makes sense. For example, a general annotation about a document should point to the document root, not one of the textual revisions. Referencing that which is least likely to change is usually optimal.
Signature Metadata is used to make various attribution, approval, and authentication assertions about entities, whether self or external. It is also used to parameterize cryptographic aspects, such as required signing key IDs. This metadata is then expected to be covered by appropriate cryptographic signatures appended to the containing entity.
- Because an entity's cryptographic signatures cover any Signature Metadata, as part of the entity payload, there is no need for separate internal and external signature schemes. An external signature is simply a signed entity whose subject (hard reference anchor) is the external entity it makes assertions about.
- Obviously, external signatures of encrypted entity data cannot provide meaningful claims of authorship without proof of knowledge of the plaintext, such as an HMAC. Typically, such claims are instead made internally.
- Cryptographic signatures themselves are always contained in an entity's signature block, a simple table data structure appended at the end of an entity in canonical form. Signatures provide authentication of the primary data of the entity, not including the signature block itself. (Obviously, a signature cannot sign itself.) Signatures themselves may be encrypted.
- Signatures always cover the canonical representation of header and payload data, in whatever final encoded and/or encrypted form. A plaintext signature is thus able to be checked without first decrypting or decoding payload data.
- Authentication of encrypted data requires expected signing key ID(s) in an encrypted metadata block. This prevents another party from trivially signing the same encrypted data. This authentication extends to all payload blocks using the same key.
- When an entity contains only public plaintext data, attribution claims must be strengthened using notarization, shared ledgers, and ad-hoc weaving into the global graph of hash-referenced entities. (similar to how a blockchain transaction is strengthened as new blocks are added past its own) Obviously, cryptography alone cannot prevent multiple parties from claiming authorship of the same plaintext and then debating who created it first.
- For messaging use cases, surreptitious forwarding is avoided if a signed entity contains any secure reference to the intended recipient(s). This may include a hard anchor reference to a private mailbox or group chat root, as the header is covered by signatures.
- It is nominally possible to create derivative entities by arbitrarily adding, removing, or re-ordering rows in the signature block. Such derivative entities would necessarily hash to different IDs. A malicious party could use these legal permutations in attempt to fork metadata collections for otherwise identical, trusted data. To prevent this, the intended signing key IDs and table entry order should usually be provided among the Signature Metadata. It may also specify whether the key IDs are a closed set, such that an entity with additional signatures would be considered invalid.
- An alternative mitigation is to have repositories alias any derivative entities, since their header and payload data is identical. Derivatives with appended untrusted signatures could be dropped or heuristics could be used to detect aberrant behavior here.
- Signatures may be chained, such that a signature may cover not only the entity header and payload data but also preceding signature(s) thereof. Signature table entries have a 'chained' field that specifies a set of previous signatures by simple numerical index.
- Chained signatures sign a hash of entity data concatenated with previous signature(s) in their canonical form, encrypted if applicable. This allows for various interesting security properties, such as public notarization of a privately signed entity.
- In order to construct an entity with chained signatures, intermediate hashes of previously-signed data are sent in a signature request interaction. (Such as with a trusted public notarization service.)
- Multiple independent signature requests may be processed simultaneously, such as two notarizations of the author's signed data. The entity author will create the final signature table once all relevant signatures have been collected.
- Signature Metadata will typically specify all signature chains.
- An entity is only considered valid if signatures from all keys specified in Signature Metadata are included in the signature block. If some signature table entries or Signature Metadata are encrypted, it may not be possible for a public repository to tell whether an entity is valid. However, if chained signatures are used, a trusted plaintext signature can effectively vouch for those which are not visible. This scenario should typically be specified in plaintext signature metadata, so that it is expected and unambiguous. This allows public partial validation of entities having encrypted signer IDs.
- Because signature checking is somewhat expensive, some large public repositories (eg. ISP-level caches) may not perform this validation step. It is far more important that entity consumers fully validate received data, to avoid the metadata collection forking attack discussed.
Signature Aggregates
Signature Aggregates are lists (or Merkle DAGs) of entity references that are signed at once, within one entity, rather than creating multiple entities that each hard-reference a single entity to be signed. In appropriate situations, this is an alternative and potentially more efficient means of external signature.
- Aggregates themselves are typically hard anchored to a well-known entity that represents a user's official publication collection.
- Signing is a one-time interaction that is unlikely to cause churn in normal usage. Aggregates are not treated as versioned sets. A retraction would be published separately, as a single-entry revision upon the aggregate.
- Disadvantages:
- Signature aggregates are published after amassing a large number of entities to sign. They are rarely useful for signature applications where greater immediacy is needed.
- References in a signature aggregate are indexed as soft references, because they are not the hard anchor. Propagation is typically not as straightforward, as soft reference indexing is discretionary. In addition, prior awareness of the user's publishing node is required.
- Applications:
- Aggregates are most appropriate for signing a large number of related proximate entities, where data consumers are already aware of the signing party. (ex. among an academic, business, or community network)
Permission Metadata provides access and authorization security artifacts such as cryptographic keys, MACs, and parameters, Access Control Lists (ACLs), roles, management rules, etc.
- Permissions are always external metadata, to separate access management from entity creation.
- Symmetric keys for entity payloads and/or signatures are typically encrypted to particular users' or repositories' public or pre-shared keys. These are contained in standard entities provided adjacent to the entities they protect.
- There is no requirement to use long-term keys. Permission metadata shall include standardized support for ephemeral key schemes and ratchets, for use cases needing forward secrecy and/or repudiability. Appropriate key management protocols are outside of the base Persistent Data Model spec, but implementations can be orchestrated over Persistent Data.
- For efficiency, Permission Metadata is ideally distributed only to those able to use it. However, this is not always easy, especially if key IDs are not visible.
- ACL and role metadata designs will be standardized, to ensure portability across repositories.
- While specification of user / group identity must be standard, authentication strategies are left to repository / network implementations.
- Repository directives are network or persistence level rules for entities and collections. Active directives may be public or hidden.
- access times: specific time periods when an entity may be fetched
- expiration times: a time or limit, after which an entity will become inaccessible or be automatically deleted
- download limits: rate controls, maximum download count, etc.
- challenges (CAPTCHAs, proof-of-work, etc.) required for read access or publishing new entities
- This can be used to protect against DoS, spamming, etc. by moderating write access to publicly visible reference collection points (ex. for message inboxes)
- Challenges are orchestrated over Interaction Patterns, parameterized by the repository.
- available subscription / interest scopes for an entity (ex. “revision metadata for only this entity”, “all metadata for adjacent entities, up to some maximum traversal depth”)
Notes
Metadata surrounding encrypted yet publicly accessible data must be used carefully and should usually be encrypted itself. The final link of the security chain is hiding user key IDs, to avoid exposing participation and communication patterns. However, users must be able to efficiently find and retrieve Permission Metadata keys that grant them access to particular data. A variety of options exist to provide this link, such as classic user authentication and ACLs on a private repository. An option more suitable to open public repositories is to index key IDs to Permission Metadata entity HID references and encrypt this index with a final group key known only to the appropriate participants. A further option is to privately push keys to specific user repositories after publishing the related entities publicly. If the overhead is acceptable for the use case, a variety of double ratchet key management algorithms may be used. Permission and Signature metadata shall standardize support the appropriate fields needed.
Root Entity Anchoring
Summary
Root entities serve as anchoring points within the global graph. They represent particular persons, organizations, objects, ideas, topics, revisioned documents, ontology classes, etc. and are widely referenced by statements, metadata, etc. Anchoring root entities can be difficult, however. Anyone can create a root intended to anchor some concept or instance, and this can easily lead to unintentional duplicates and divergent data. Consensus must evolve over time, usually through organized community processes. For instance, expert groups may form around the management of different ontologies and eventually become de facto authorities for generating these, ideally aided by data-driven NLP / ML techniques since the work is so vast. In other cases, root entities anchor artifacts with a limited scope. These roots must often be created by individuals, who do not have the weight of a trusted community's signature behind them. On the other hand, limited scope reduces potential for ambiguity and incentive for abuse.
Discussion
- Root entities themselves should usually be anchored in some way that unambiguously and uniquely establishes their context. A truly empty root entity (with a nonce only) could be abused by using it in a different context than the original. This could result in nonsensical or hazardously conflicting metadata and statements should the collections used in different contexts be merged.
- Including type information in a root entity is helpful but insufficient because a malicious user could simply re-use a root entity for the same type of entity elsewhere.
- Including a natural language description in a root entity is helpful, if humans can assist in the verification, but it is best to avoid unreliable manual processes. Still, it is not bad as a fallback for particularly important root entities, and NLP advances may someday grant this approach more utility.
- Inclusion of initial data in a root entity, such as a first document revision, typically falls under the same category as natural language. Automatic derivation of context is less straight-forward when it relies upon specific higher-level processing of entity data.
- There are comparatively few cases where new independent roots are needed. Normally, large trees of entities (including many other roots) will branch out from each strongly-anchored root entity. Context also becomes self-reinforcing as graphs of information grow and weave together. A global body of trusted metadata around even a weakly-anchored entity provides abundant context, so long as this metadata remains widely available. Thus, to an extent, root anchoring is most necessary as a bootstrapping measure. Public-key signatures may often be sufficient for this reason. For high-profile information, it is wise to use multiple anchoring techniques, in case one fails.
- Root entities that do not contain any original data, besides a nonce, may be referred to as “stub root entities” or simply "stub entities."
Anchoring Methods
There are, currently known, five methods of unambiguously anchoring an entity, listed below in decreasing order of preference. These may not be individually sufficient in all situations, but may be combined. It should be strongly noted that only the first two currently reside entirely within the Persistent Data model. The third and fourth could be implemented within Persistent Data in the future. The fifth is the only approach that definitively requires out-of-band validation.
- Include a hard or soft reference to another entity that is already well anchored and that provides sufficient context by association. (Multiple references increase context.)
- Include one or more cryptographic signatures that establish authorship context. If the signing key is ever retracted, the anchoring is weakened or lost. Key expiration is not a problem, so long as the signature was made before the expiration date.
- Register a hash or signature within a distributed public ledger or similar network. (ex. those using cryptographic block-chains for distributed consensus) An address within this network or ledger would then need to be provided within the root entity. Ideally, such a system could be implemented on top of Persistent Data, making it a first class identity anchoring scheme that needs no out-of-band validation.
- Register with a centralized identity authority or digital notary service. Then, include a token or signature with the root entity that can be externally verified accordingly. Such a service could also be built upon Persistent Data.
- Reference an existing, authoritative, permanent identity created by a traditional centralized naming authority. (ex. government-issued official person or place names / numbers, ISBN, DOI, etc.) This should primarily be done as a transitional measure, as part of importing existing information or migrating legacy systems.
Property Graph Data Model
Preface
The Semantic Web / Linked Data vision remains deeply inspirational, despite relatively low industry uptake. Many hurdles, both technological and economic, have prevented its full realization. Most involve added development costs and/or market disincentive to publish open data. We believe that some of these issues can be fixed by small improvements to the data model and supporting network models, making it easier and cheaper to produce open data and supporting software.
Contemporary software architecture is also an impediment to the Semantic Web. Object-oriented encapsulation and rigid, long-lived class hierarchies are a poor fit for working with independent, multi-sourced data having flexible classes and ontologies. Likewise, composition via endless custom APIs and adapters is fragile and negates the benefits of native semantic data. We believe that the Semantic Web ultimately needs a new software architecture that takes full advantage of the data model and enables new development methodologies and economics. The InfoCentral PGDM is intended to be used within such a regime, as will be elaborated later in this proposal.
Retooling of the Semantic Web for content-based addressing would benefit Information-Centric Networking and decentralized internet efforts. Existing standards are designed around the classic web model of centralized publishing authorities and mutable resources with human-meaningful names. While not strictly tied to these assumptions (ex. alternate URI schemes), their allowance could be an impediment to the evolution of fully decentralized, machine-friendly architectures. Some aspects of Semantic Web designs are explicitly incompatible with the InfoCentral architectural guidelines. The greatest mismatch involves the InfoCentral mandate of immutable data via hash referencing. This has sprawling implications. Obviously, it conflicts with the Linked Data recommendation of HTTP URIs. Likewise, the use of networkable reference metadata collections conflicts with some assumptions around RDF data stores and related notification APIs and vocabularies.
Due to the immense complexity of knowledge representation, this aspect of the InfoCentral Initial Design Proposal is expected to be the most subject to change. Likewise, the Semantic Web effort by no means covers the extent of classical graph theory or the engineering of practical graph databases. Superior data models may emerge in the future and there should be no problem using these within the InfoCentral architecture. This highlights why we have pressed for such strict separation of concerns.
Introduction
The InfoCentral Property Graph Data Model is a recommended set of specifications for persisting semantic graph-structured data on top of the Persistent Data Model. It borrows heavily from the Semantic Web effort toward universal knowledge representation standards but aims to be both simpler to understand and better aligned to decentralized networks. When complete, PGDM will support a superset of the expressive power of RDF and its existing semantic extensions. Designs borrow from Notation3 (another superset of RDF) and propose an enhanced n-ary relation extension. The choice of name "Property Graph" vs. "RDF statement graph" or similar is complex. Technically, what is being modeled is a labeled property graph, though entities with PGDM content are often expected to be ingested into RDF-style graph stores in higher software layers. As such, PGDM is something of a hybrid design.
Some features of RDF (and dependent standards) are effectively factored into the Persistent Data Model, aligning them to the supporting storage and network management layers. Besides simplifying some optimizations, other data models can also benefit from the low-level (non-semantic) graph-like features of the Persistent Data and Collectible Metadata models. Likewise, a single mandatory binary serialization ensures canonical representations match and eliminates needless wrangling of competing languages and formats. (Human readability is a UI concern that does not belong in data architecture.)
Some of RDF's features and special cases are discarded, such as labeled blank nodes. In cases where these are transmitted to standards built upon it, we will likewise propose amendments. (ex. It will be necessary to create a derivative of the OWL RDF-Based Semantics, ie. "OWL PGDM-Based Semantics") However, a goal is to enable direct translation of everything built with current tools to the InfoCentral model.
Discussion
InfoCentral promotes extensive decomposition of information into graphs of small, immutable, uniquely-identified entities. Entity payloads may hold statement triples (subject, predicate, object/value), strictly using HID-based references and strongly-typed values. This promotes a number of critical properties:
- improved re-use and composability of information, across past and future schemas
- high precision and stability of reference, for clean annotation and composition across data sets
- default versioning support
- complete separation of property graph data from storage and network artifacts
Root Entity Usage
Root entities often serve as the logical identities of graphs of versioned and composited information. For example, a “text document” root is an entity around which text components and their revisions are collected. Likewise, a “human person” root entity would establish a permanent identity around which a vast amount of information is collected about an individual person. A HID reference to the root entity is the subject of many statements, as with URIs in the Semantic Web. At initial retrieval time, the entire known graph adjacent to a root entity is typically transferred.
- Finer decomposition of information reduces contention during multi-party editing and interaction.
- Because it is often undesirable to fetch the entire version history dataset of an entity or subgraph, views should be used to consolidate recent version data. Consumers of a view may drill down later, following references to relevant predecessor data, etc.
Identity and Reference Semantics
- Standard Data Entities containing property graph statements may only have a single external subject, specified by the hard reference anchor.
- Entities containing arbitrary RDF documents are not forbidden, as there are no rules about payload content. However, this data is not considered first class, as it does not follow the InfoCentral Property Graph Data Model. Such usage should be reserved for legacy data import and archival.
- Standard Data Entities may use themselves as the subject of statements, such as within a root entity that holds the original versions of statements.
- Standard Data Entities may have self metadata, wherein the subject of statements is the entity container itself, not property graph vertices. Self metadata is limited to the General or Repository categories, such as typing and provenance data. Internal Signature metadata is also considered self metadata, from the property graph perspective, though it is handled specially.
- The 'self' subject has special identity semantics, because an entity obviously cannot contain its own hash values in the same way that an RDF document on the web can contain its own HTTP URI. In a concrete graph representation, any valid reference HID may be considered an alias subject of self statements contained in an entity.
- If there are statements with different subjects that actually refer to the same real-world object, an alias may be annotated across redundant root entities and left for future resolution. Care should be taken to avoid this by attempting to locate an existing subject identifier when creating statements. If a provisional root entity is locally created as a subject, say during disconnected operation, rebasing should occur later as soon as an established root can be located.
- This is part of a general theme of non-isolated database systems, wherein record creation is factored across many responsible parties, inverting the control of most information. For example, instead of a business managing personal information for a customer record, it would simply link to this information elsewhere -- perhaps sourced from customers' own repositories or a third party identity management service.
- The Semantic Web uses Named Graphs as the unit of (context / data container) identity above RDF triple self-identity. This allows a form of annotation that does not require reification of individual statements, although only at the granularity of Named Graphs. The InfoCentral model goes further by disallowing all explicit reification syntax. To make a statement about a statement or collection thereof, a HID reference must always be used.
- Standard Data Entities containing statements are equivalent to tiny RDF Named Graphs that are restricted to holding statements about a single subject.
- Individual statements within a Standard Data Entity may be referenced using an index number, via an Anchor Metadata field in the referencing entity. This does not give statements first class identity in the Persistent Data Model, but does guarantee an unambiguous, immutable subjugate identity, since it is grounded by a HID reference. It is therefore considered first-class identity within the Property Graph Data Model. This is similar to RDF reification shorthand, whereby statements may be given a resolvable URI using the rdf:ID property, which is externally referenceable using a fragment identifier.
- The immutability of containing entities makes statement identity implicit via a numerical index. Unlike RDF documents (real or notional), there is no need or ability to manually assign IDs to statements, since the index suffices. This also means that there is no need to rely upon original authors to add ID properties. Precise third party annotation of statements is possible by default.
- Multiple entailment regimes are possible regarding statement triples and reifications thereof, as implied by subjugate statement identity. Following existing RDF standards, the default is to treat index-referenced (implicitly reified) statements as particular concrete instances of abstract statements, such that collected metadata may increase their context / effective dimensionality. This ought not replace proper information modeling, however.
- Anchor metadata may also specify a set of statements, thereby allowing for multiple statement subjects within the hard-referenced entity. However, these will all share the same predicate/object values in the referencing entity's payload data. Combined with the single-entity subject limitation, this should be sufficient to avoid excessive revision churn.
- Statement component paths are also supported in anchor metadata, allowing references to statement predicates and values. (the statement subject is always the hard reference)
- These take the form: [statement #]/[predicate|value]
- Alternatively, we may simply count these while indexing (ie. 3 per statement)
- The ability to give statements (and predicate instances) their own properties renders this implementation a property graph by definition. However, this feature is currently outside of the RDF data model and tentatively is not to be preferred. It may be restricted to certain metadata in the final design.
Blank Nodes
NOTICE: This section is highly subject to revision. Blank node handling is a very complex topic that will need careful review by an expert in this area. Readers are directed to the exhaustive academic review of the topic here.
While implicit blank nodes are sometimes permitted with restrictions, labeled blank nodes are disallowed in the InfoCentral Property Graph Data Model. Where possible, every resource should have a concrete first-class identity, so that greater interlinking of datasets is possible. This follows the pragmatic thinking of many in the Linked Data community. A HID reference to even a stub root entity will suffice.
Justifications for restricting usage of blank nodes
- Blank nodes force versioning to be coarser and more costly to compute deltas for, especially with complex nested structures. This greatly complicates support of concurrent and possibly conflicting modifications, which are highly necessary for the collaborative public interaction model that InfoCentral promotes as default.
- Blank nodes in different subgraphs can only be combined using heuristics that are prone to failure. While stub root entities may be aliased later, to similar effect, this combination is external and cannot lose information in the process.
- Naming and reference of blank nodes within a document allows cycles in the graph. This introduces greater than polynomial time complexity (graph isomorphism problem) for comparing graphs and doing entailment checking.
- In contrast to HTTP URI minting and maintenance (domain registration, server configuration, etc.), use of hash-based identity effectively makes skolemization free. A stub entity may be generated, hashed, and added to the graph as required.
- Conversion is easy and low-cost. Blank nodes in existing Semantic Web data are trivial to skolemize during import to the InfoCentral model, by creating stub entities.
- Blank nodes are allowed for statement objects. This permits existential statements with grounded subject identity. For example, contrast “A hasChild blank” with “blank isParentOf B”. To express the same semantics, instead say “B hasParent blank”. Rules languages make bidirectional mapping trivial.
- Fully abstract existential statements (and all universal statements) should probably be stored in logic representations that are outside of the Property Graph Data Model. These are commonly used for creating inference rules.
- ex.) hasParent(?x1,?x2) ∧ hasBrother(?x2,?x3) ⇒ hasUncle(?x1,?x3)
- Interest in storing incomplete facts about anonymous subjects should presumably be less common in the intended collaborative data regime. However, it is possible that PGDM could support a form of anonymous resources by still using skolemization. The stub root entity would be tagged with its intended purpose as an abstract identity and treated with different identity semantics. (ie. it is part of an existential statement, not an instance)
- ex.) {John}.hasChild(?x1); isMale(?x1); (ie. "John has a child that is male.")
- The labeled blank node ?x1 would become an abstract stub root using this feature.
- It would be invalid to later alias an abstract root to a concrete instance, as there is no way to determine whether the abstract statements were in fact about the same subject. For instance, "John has a 10-year-old son named Steve" does not imply that a concrete individual "name(Steve), hasParent(John), age(10)" matches. Amusingly, John may have two sons named Steve who are identical twins and go by the suffixes Jr. and Sr. even though they were born minutes apart.
- This feature is still being debated. It could replace some usage of implicit blank nodes, thereby increasing decomposition.
Wider perspective
The Semantic Web envisions a world of predominantly independent publishers of incomplete and often duplicate graph data. This is intended to be aggregated ad hoc, often using logical inference to fill the gaps. In contrast, InfoCentral envisions a world of predominantly socially-networked collaborative publishers, who actively improve and consolidate information in a manner similar to a Wiki. Collections of information grow around well-known concept roots. Given the significant cost to create and maintain a stable HTTP URI, the appeal of blank nodes in Semantic Web architecture is understandable, even if highly undesired by data consumers. When using only hash-based identities, there is no such motivation to compromise.
Alternatives for uses of blank nodes
- Multi-component structures
- Use a root entity as the parent of any collection. This ensures that multiple parties can externally contribute to the contents, via "member of" predicates or negations thereof.
- Reification
- Reference existing statements using the HID of containing entities, optionally with an index to particular statement(s).
- Complex attributes
- Refactor the attribute as a proper type, having well-defined properties. An instance of a complex attribute, such as a coordinate or postal address, should have a typed root entity that its properties are contained in or appended to externally.
- Protection of inner information
- Use cryptography and/or access controls to secure entity contents. Surrogate identities may be used to protect references to root identities of private information.
- ex.) multiple temporary customer IDs used when shopping vs. a more permanent personal identifier
Collections and Containers
The concrete RDF syntax for Persistent Data uses root entities instead of blank nodes for collection and container vocabulary. This supports multi-party contribution. The collection or container root becomes the subject for membership statements, contained in entities that reference it. Thanks to network-visible reference metadata collection, even simple passive repositories can aggregate data for such collections.
There are generally three collection patterns to choose from:
- A single versioned “collection” entity that contains a list of references or values (membership statements)
- This method is suited to collaborations wherein multiple parties may be modifying the collection as a unit, adding or removing elements over time. Revisions may conflict and require resolution.
- Semantics should indicate whether this is considered to be an open or closed set.
- This collection pattern is structurally identical to RDF-style collections that use blank nodes. The membership list is syntactic sugar. In verbose form, the containing entity is the implicit subject of “member of” statements for list entries.
- Since it is referenceable, a collection entity can still be the subject of external membership statements, either indexed or unordered relative to the existing list.
- A tree of collection entities that contains a set of references or values in its leaf nodes
- With a sufficiently large collection, this method may be more efficient and/or produce less conflicts than a single collection list entity.
- If the collection itself is to be treated as a logical unit, a Merkle hash tree must be recalculated upon every modification. This may represent a significant trade-off.
- Independent references to a collection stub entity
- This method is best suited to collect unordered independent submissions of new information, wherein the parties involved have neither bearing on each others' submissions to the collection, nor any right to remove each others' submissions. A common example is a messaging inbox. Each message will have a hard reference anchor to the inbox root entity.
- Semantics for duplicate membership must be specified, ie.) whether a multiset (bag)
Ontologies
Data schemas for InfoCentral data are envisioned as components of globally-collaborative ontologies, themselves composed of immutable entities.
- To provide typing information for their contents, entities reference ontology nodes.
- References for typing purposes must point to concrete revisions, not ambiguous roots under which revisions are collected.
- To save space, statements may use indexes instead of full HIDs for specifying predicates contained within a referenced ontology node. Toward this end, some ontology nodes may also have lists of useful related predicate references contained in other nodes. These would be indexed along with its own predicates.
- Ontologies may have private branches, as with any versioned information.
- Globally popular ontologies will evolve over time, ideally guided by community development rules that guarantee forward and backward compatibility. Ratified revisions should be accompanied by bidirectional mappings and/or logic, such that new software can automatically use old data and vice-versa. Degraded operation is permissible, but it must be handled gracefully. For example, a missing attribute may cause a new feature to become unavailable, but the software must not cease to function otherwise.
- Strict normalization and other ontology engineering rules should greatly avoid the need to periodically re-factor schemas for commonplace concepts in business and personal domains.
- Specialized domain concepts are often notoriously complex and difficult to formalize, especially amid active research. (ex. classification criteria in medical and biosciences) Stub root entities, with their globally immutable identity, can assist useful ontology design patterns here. (ex. View Inheritance)
- In the worst cases, the only feasible option is to apply data conversions in one direction. Nevertheless, it should be considered a good pattern to reference all past data and conversion logic in metadata, such that a complete history is maintained. Provenance should be able to be traced at any time.
- Typing is always included within a data entity, not added as an annotation. If the ontology changes later, future data revisions will embed new typing as needed. Old schemas should be kept permanently, ensuring reliable typing for old data.
- Ontologies may be further layered with annotations of rules and logic useful for composing and validating interaction patterns amongst graph data. This serves as an intermediate, declarative layer, building up to high-level software components.
Encoding
A compact, strongly typed, extensible, binary serialization will be the only standard and will be treated as a module of the Persistent Data Entity encoding scheme. Unlike the classic web, there is no preference for immediate human readability of data at storage and transport levels. This legacy design was a beneficial concession in the era of rudimentary text-based tools but is no longer valid with modern tooling. Human readability is a UI concern that does not belong in the data model.
Embedded language-native serializations allow for richer abstract data types (ex. sum types) and programmatic encodings. However, code and data should always reside in separate entities. For instance, a value that was generated using some particular run-length encoder may reference the applicable code module by HID.
Canonical binary serialization is required only for consistent hashing. A wide range of optimizations are expected for storage and transmission, such as differential updates and compact batch encodings. Likewise, entities' property graph data will often be ingested into local data structures for efficient indexing and querying.
The InfoCentral graph data model is expected to have greater total overhead compared to many existing Semantic Web systems. Most of this is due to finer decomposition of data, along with the mandatory versioning scheme of the Persistent Data Model. However, the overriding design concerns of InfoCentral are interoperability, decentralization, and future-proofing. Persistent information should be archival-quality by default, even for personal sharing and interaction.
Practical performance can be regained through local indexing, caching, denormalization, and views. In the proposed model, any relational engine or graph store is ultimately a view, backed by the Persistent Data layer. There is local work to continuously map any relevant new data to views currently in use. However, this overhead is well worth the advantages of a universalized data / identity model and the network models it enables.
Data Store Standardization
Summary
A Standard Repository is a data store or network thereof that supports the Persistent Data Model and at least the base Repository public interface. Repositories may participate in any number of networks, and these networks are themselves logical repositories if they expose a public Repository interface. Data entities and reference metadata are propagated within repository networks, as well as between networks and standalone repositories. The Repository interface serves as the sole public data exchange interface in the InfoCentral architecture. However, repository networks may internally use their own private interfaces, such as designs optimized for local clusters, cloud services, or meeting QoS criteria in a large P2P system. This allows wide room for innovation in Information-Centric Networking research while ensuring baseline compatibility.
Discussion
General
- The most basic repository implementation is a passive store of Standard Data Entities, having add-only collections of known reference metadata for each. Such a repository is effectively an unmanaged cache. Besides the plaintext entity header, it is unaware of the content of data entities and makes no effort at access control.
- Repositories may support varied data management strategies for multiple internal data collections.
- Given the immutability of entities, a log-structured storage backend will often make the most sense for raw entity storage. Periodic log sweeping can efficiently handle expired and deleted entities.
Interfaces
- The base public Repository interface is minimalistic. Proposed mandatory standard operations include:
- store(EntityData) – stores a data entity
- retrieve(EntityReference) – fetches a data entity by HID reference
- retrieveMetadata(EntityReference) – retrieves the collection of known reference metadata for an entity, optionally including network hints and other disposable metadata used only for repository and network management
- getRepositoryInfo() – retrieves a reference to the root entity of a particular repository's or network's information, including QoS attributes, usage and storage contracts, etc.
- Standard but optional public interface extensions will include differential transfers, a durable store operation, multi-entity transactions, etc.
- Optional but standardized capabilities involving management, security, indexing, and query are generally accessed over Persistent Data, although there are no hard restrictions against private Repository interface extensions.
Subscription
Subscriptions to reference metadata collections or published views use interaction over Persistent Data rather than an additional repository interface. Updates are pushed to requested locations under parameters agreed upon during subscription request and acceptance.
- This model reduces interface complexity and eliminates the need for session state and/or active connections. It is lower-overhead and easily recoverable from all types of failures.
- Repositories may advertise all manner of contracts related to subscription services, from limited open public sessions to paid priority access.
- Because public subscriptions are exposed as open graph data interactions, public networks may intelligently aggregate updates for popular subscriptions. For instance, this may be used to support streaming data propagation from an originating source to millions of clients, without multi-cast modes of transmission or specialized CDNs. This sort of optimization is always entirely transparent to subscribers.
- Private repository interfaces may expose active subscription support. This will commonly be used internally by repository networks or by clients watching for updates arriving locally.
Application to the Persistent Data Model
- Repositories and networks thereof are considered ephemeral hosts of Persistent Data, which is conceptualized as being immutable and self-existent.
- Only Standard Data Entities may be shared between Standard Repositories, as first-class data.
- Standard Repositories and networks are expected to check compliance of entities against applicable standards and to refuse storage or transmission of any that are malformed or that do not match the expected hash value(s).
- The repository interface will support transmission and storage of differential updates among revisions. Implementations may vary.
- While paths are not supported at the network level, reference and entity metadata may provide for alternate lookup and network management schemes.
- So long as data entity immutability and a mapping to the original reference HIDs is maintained, other forms of entity reference may use truncated or alternate local identifiers, within a private address space.
- Federated repository networks or even large-scale Information Centric Networks may use their own internal address space, such as 64-bits of a hash value. Full reference HIDs are always used for verification purposes on entity arrival, ensuring intentional partial collisions are not a security risk.
Application to Software Architecture
The software layers directly above the Repository interface hide all transport, storage, encoding, access, and security concerns from user-space software components. This package of capabilities is known as the Data Management Foundation (DMF). It ensures that user-space Information Environment (IE) software components can interact purely within the abstraction of global graph data, ignoring physical concerns. DMF and IE represent the basic role dividing line within InfoCentral software architecture.
Permanent Domains
Summary
Domains are an optional facility for repository or network identity, never data identity. They are managed via entities holding domain records, which are signed by the keys associated with the domain. The root entity for a domain's records provides the permanent Domain ID, via its hash values. The root entity must contain a list of public-key IDs that may sign its records. Domain record entities reference the root and describe variable aspects of the domain, such as associated repositories or network addresses. Any system that implements a Standard Repository interface may be referenced by a domain record as a member data source. Domain metadata may be used amidst the global data graph to annotate recommended data sources by domain IDs.
Discussion
- Domain IDs are established by root entities holding a public key, signature, or hash reference thereof. These should not be the personal keys of the domain owners, but rather new keys generated specifically for the domain. Domain IDs are valid as long as the underlying public keys are valid.
- Domains may be anonymous, in the sense that the identity of users who hold corresponding private keys is not revealed by default.
- Human-meaningful naming metadata may be used for domain labels but is not unique and may not be used for domain lookup in the style of DNS.
- Domains may have a permanent public contract that specifies policy aspects of information hosting associated with the domain. This may include permanence and licensing policies, etc. (ex. archival (no deletion), Creative Commons license only) Contracts are always established at creation time via inclusion in the domain's root entity.
- Repositories may optionally manage information per-domain. Commonly, there will be one or more officially-designated repositories or networks for each domain. While this may suggest a notion of first-party information sources, it does not confer identity or authenticity to the information hosted within the domain. Domains exist as a quality-of-service mechanism only, not an authority mechanism.
- Standard Data Entities may have domain metadata, to designate recommended sources of data and metadata for themselves and perhaps adjacent entities. Domain metadata must be signed by trusted parties to be considered useful, as any user may create and attempt to propagate it.
- Not every entity in a graph needs domain metadata. It is usually sufficient for only a handful of root entities to be annotated. Domain metadata is by no means required for entity lookup in most Information-Centric Networking strategies, but is a more direct routing avenue for information that has a primary source. In no way do domains restrict 3rd party data sources.
Adaptable, Multi-Paradigm Security Model
Summary
The primary focus of the InfoCentral security model is to enable secure interactions over the global-scale public data graph, while making the details as invisible as possible for users working at the information-space level of abstraction. The lack of traditional application and service boundaries in the software architecture requires that user end-points be largely responsible for their own security needs. Public distributed systems must almost entirely rely on cryptography. Almost all content is signed; all private content is encrypted. Private networks and traditional access controls are still supported, while providing the same security abstractions. Likewise, distributed network designs may use their own access and propagation schemes to guide data visibility.
Discussion
- InfoCentral largely treats security as an orthogonal concern to the Persistent Data Model, providing a generalized framework but few mandates. Security models and primitives can easily be changed in the future, allowing adaptation to new research, tools, and threats.
- InfoCentral promotes resiliency as much as security, recognizing that attacks and failures will inevitably occur in any system. Lack of dependency upon centralized infrastructure, the ability to downgrade and operate at best-effort, and the inherent stability of immutable data structures are characteristics of this design.
- Through use of secure-hash IDs, extensive cryptographic signing, and promotion of globally-available schemas / ontologies, InfoCentral promotes ubiquitous data authentication and validation, eliminating a wide range of attack vectors. Unlike most contemporary systems, it is not trivial to modify existing document or database content undetected, even after a complete access control breach or key compromise. Undesired new content may be added, but this can be passively logged and later revoked.
Entity Encryption
Because the InfoCentral architecture lends itself to a shift toward client-side computing, the costs of strong cryptography are largely offloaded to end-users. What could be economically prohibitive for narrow-margin cloud services supporting billions of homogeneous users is trivial for local hardware operating upon a much smaller yet more diverse data set. Hardware cryptography acceleration is also likely to assist here, especially for energy-constrained mobile devices.
- Encryption of the payload content of private data entities is promoted as the first line of defense, though it is not mandatory. Access Control Lists, private repositories / networks, and smart-contract distributed replication schemes are supported as an alternative or supplement to entity encryption.
- Entity payload encryption supports use of untrusted storage and untrusted networks for distribution and redundancy of private but less sensitive information. Access pattern analysis may reveal clues about the content of encrypted entities and the nature of private user interactions. If this is considered a meaningful risk, transport encryption and/or routing indirection must be used as well. It would be possible to avoid double encryption by using a new transfer protocol that only encrypts the command stream, entity headers, and any payload data not already individually encrypted to specific users. On the other hand, a full layer of TLS encryption provides an extra measure of protection against interception, should a particular user's private key ever be compromised.
- Data backup is made extremely simple by encrypted entities. Traditional databases and filesystems must have encryption carefully applied to exported data as part of backup protocols. This becomes far more complex if it is unacceptable to use a single key for the whole backup. (different users / security levels / etc.) When all private data is contained in encrypted entities, these complications are non-existent. Incidental metadata may, of course, have other security properties and need to be handled separately.
- Any encrypted payload data is necessarily included in calculating HIDs of an entity. As such, content key(s) cannot be changed after an entity is created, as this would change its hash identity. Compromise of a content key is therefore unrecoverable, beyond attempting to revoke the affected entity from circulation. This may be easier on some networks than others.
Access Controls
Other access control methods are supported via optional data repository features.
- Traditional ACLs are suitable for large, shared data sets where use of PKI among all parties is intractable and secrecy needs are comparatively low.
- Even in open public networks, some forms of access control are necessary to avoid denial of service attacks and to fairly allocate resources. Propagation of new entities should typically be limited by network or user trust, proof of resources for distributed storage systems, cryptographic challenges, or micropayment schemes. In the long term, AI activity monitors among Information-Centric Networks should be able to screen out abusive usage without explicit access control mechanisms.
Public-Key Infrastructure
Open distributed systems lend themselves to extensive use of public-key infrastructure (PKI) cryptosystems because it allows for the convenient establishment of trust chains among otherwise untrusted users and systems.
- Certain Permissions Metadata, such as content keys encrypted to appropriate parties, are always stored externally from the entities they protect. This allows them to be replaced later without disruption of the original content. In the case of a compromise, entities with Permissions Metadata for affected keys should be purged. If key IDs are visible, this may be done automatically by cooperative repositories. The same would apply in the case of a defeated cryptosystem, if an appropriate identifier is visible.
- Signatures may be stored either internally or externally to the entities whose content they authenticate. In the case of a compromise, existing entities that rely on affected internal signatures would need to be re-signed externally by an appropriate party.
- This provides some motivation for alternative or supplemental signature schemes, like Merkle or blockchain-based, for internal attribution signatures of the very most critical entities.
- Private encrypted entities released into an open public network could be archived by an adversary, with hope that a future cryptanalytic breakthrough or successful key intercept will render their plaintext. The nature of this risk must be considered on a case-by-case basis.
- The simplest mitigation against compromise of a user's private key is to use transport encryption between nodes directly exchanging sensitive entities, instead of using open public distribution or a more efficient transport protocol that does not re-encrypt encrypted entity payloads.
- Some scenarios may be able to use ephemeral key schemes to provide forward secrecy for communication participants. This applies to Permission Metadata protecting the symmetric content keys for entity payloads.
- Future advancements in routing management may allow more direct paths for data tagged private. While not robust against an adversary that controls the network, this may suffice for many applications.
- While there is no publicly-known reason to believe that the math behind current public-key algorithms is at imminent risk, many reputable experts expect it to fall to quantum computing breakthroughs soon enough to demand prioritized research on replacements. It thus seems preferable to avoid including PKI as part of any architectural foundation that cannot be changed later, should no direct, quantum-safe replacement become available in time. With all InfoCentral designs, it is possible to swap-in a different (albeit far less convenient) security model, without any changes to the data model.
- For example, a provisional suite could be based entirely on symmetric key cryptography and Merkle tree signature schemes. Both are currently considered quantum-safe.
Layered Data Security
- InfoCentral designs allow for various layered security schemes, with divisions of roles among networked systems.
- Outer layers feature untrusted systems that do not possess any decryption keys. They effectively serve as passive network nodes. Infrastructure and service providers may someday run these to reduce trunk bandwidth for public data, replacing the need for specialized Content Distribution Networks. They are also highly suitable for local mesh networks.
- Middle layers feature systems of varying trust, which possess keys to decrypt medium security data. This allows them to perform indexing, some database operations, and more intelligent orchestration of replication within and across repositories, etc.
- Entities may have multiple payload blocks encrypted with different keys. This could allow a trusted search service to access low-risk data for indexing, while only the end user(s) may decrypt the entire record.
- Metadata may be encrypted with different keys. For example, an entity holding image data may be encrypted only to end users, while annotated tags and photo metadata are also encrypted to the public key of a search service.
- Inner layers feature fully-trusted end user devices, which possess keys to decrypt the highest security data. Typically, there is no security segregation necessary among user-space code in the Information Environment.
Public Interfaces
Systems fully exposed to the public internet (outer layer) have a very minimalistic interface / protocol, similar to HTTP. Ideally, the standard repository interface for public graph data will become the dominant public protocol used on the internet. In combination with formally-defined interaction patterns over graph data, this will eliminate the need for application-specific network protocols.
While the basic Repository interface is a public interoperability requirement, repository networks are free to privately extend it in any way that does not impact the data model. In contrast to public network operation, such intra-repository communication, among distributed topology systems, usually happens over a separate private secure channel.
- Repository operations not covered by the public interface are accessible only from a private interface for authenticated DMF clients and components.
- In conjunction with DMF data validation, the outer layer acts as a sort of rudimentary content-based firewall. In no case may user-space / Information Environment code communicate directly over the network. This reduces the exterior attack surface to a codebase simple enough to be proven correct.
- Rate-limiting and access-pattern detection can be used to inhibit read-based denial-of-service attacks.
- Signature-based network trust of data entities may be used to combat abusive write-patterns within open public networks, via trusted-source validation (for authenticated writers) and proof-of-work protocols (for unauthenticated writers, a common usage pattern)
Social technology philosophy and security implications
- In open free societies, the importance of authenticating public records and interactions is often equal or greater than fully hiding private information. As we increasingly rely upon intelligent machines in the future, this will likely become even more critical. Accurate and highly durable historical records are needed to prevent malicious uses of AI. Default data authentication and validation are powerful tools against such abuse, especially when following explicit interaction patterns.
- Transparent activity in the public data space should be promoted, as a mirror of how robust free societies operate in physical venues. What someone does and says in public is not hidden from anyone, and they should be willing to accept credit or blame. The InfoCentral metadata model allows users to conveniently retract prior statements, though not fully hide them. (A parallel cultural acceptance of this will be necessary.) Security in this arena means ensuring proper attribution, along with preventing denial-of-service in timely replication.
- Removal of content that infringes upon legal rights is supported through revocation metadata signed by the original author and/or trusted legal entities. While acting upon this metadata is still the decision of repositories, this nonetheless provides a mechanism for operators to maintain compliance with applicable laws in their jurisdiction.
- Removal of content for other reasons is left to negotiation and contracts.
- Some systems may arbitrarily allow a user to revoke entities prior created, particularly if they have never been read by another user. (ex. This would allow “un-sending” a message entity referenced to a private inbox) Obviously, there often can be no absolute guarantee that further replication has not occurred, such as in storage network scenarios with loose consistency.
- Some systems may explicitly disallow revocation, such as for entities related to public discussion. Making such rules known is often a good case for using Permanent Domains records.
Software Architecture
Overview
InfoCentral's proposed software architecture can be divided into two top-level role categories: the Data Management Foundation (DMF) and the Information Environment (IE). Most generally, the DMF is responsible for managing raw data spaces (storage, transport, encoding, synchronization, and security concerns), while the IE is responsible for managing information spaces (semantics, language, and knowledge) and the composable, user-facing code modules that operate within them. All aspects of DMF and IE are distributable.
DMF components typically reside on both dedicated servers / network hardware and local user devices. IE components predominantly reside on local user devices, but may also live in trusted systems, such as personal, community, and business systems that serve as private information and automation hubs. Large public DMF instances may also have an adjacent IE to support auxiliary services like indexing.
The reason for distinguishing DMF and IE is to strictly separate certain data management and processing concerns. This will be particularly critical with the shift toward Information Centric Networking. Whether using ICN or host-based networks behind the scenes, IE components should never have to worry about data persistence, networking, or security concerns. The IE should be able to interact with the global graph as if it were a secure, local data space.
- Certain aspects of DMF will be visible to the IE as Persistent Data, no different than any other published source of information. Likewise, the DMF will be configured via Persistent Data. Components in IE will be used to administrate individual repositories or networks thereof, configure federation or participation in P2P networks, manage QoS and payments, etc. However, the IE is never involved with the mechanics of DMF concerns.
- IE components that work with DMF management data do not need to be separated since everything within the IE is considered a single trusted user context.
- At the DMF level, typical access control strategies will be used for management data. For example, directives or configuration changes must be signed by a trusted user.
As a final note, this proposed software architecture is not the only possible design for working with the Persistent Data Model. The data model is truly independent of software artifacts.
Development considerations
DMF components could be developed using any of today's popular languages and tools, including system-level implementations. The IE requires new high-level languages and tools to fit the paradigms of fully-integrated information spaces without application boundaries. Assuming that adequate role separations can be achieved, there is no specification of where and how DMF and IE components are implemented. Eventually, a hardened, minimal OS / VM is envisioned as the ideal host for both IE and DMF local instances.
While it would theoretically be possible to write traditional applications that directly use the DMF, this would largely defeat the purpose of the new architecture, while still requiring its increased overhead relative to traditional databases. Instead, there are plans for special IE implementations that serve as adaptors for legacy systems during transition.
Major component categories
- Data Management Foundation components
- Repository and Networking
- Encoding and Cryptography
- Low-level Database Features (collections, filters, transactions)
- Information Environment components
- Data Types, Semantics, and Ontologies
- Code Module Management
- Interaction Patterns
- Human User Interfaces
- Agents / AI
- Legacy support and onramps
- Applet Information Environments
- early developer tools and proof of concept demos
- embeddable onramps from the web to the Graph (GIG)
- developed using current Javascript in-browser frameworks, to remain client-side
- Web Portal Information Environments
- host users from the web, strictly via generated interfaces
- Adaptor Information Environments
- provide application-specific APIs via legacy languages and protocols, not a full environment
- function as limited portals between the InfoCentral information spaces and specific legacy systems
Data Management Foundation
Summary
The Data Management Foundation provides all necessary functionality and interfaces for persisting, exchanging and managing Persistent Data, while hiding network, storage, and security details that could result in inappropriate dependencies and assumptions. The central abstraction of DMF is to make the global data space appear local to components in the Information Environment, with the exception of generalized Quality-of-Service attributes. Real-world DMF capabilities are often layered across trusted and untrusted system, with entity and transport cryptography, smart routing, and access controls used to safeguard content traversing mixed environments.
Repository and Networking
- Repository and network management responsibilities are intermingled because networking will evolve dramatically in the coming decades, from passive host-based to smart content-based. Repositories will become integral to network operation, rather than merely being devices on the edge of the network as today's internet servers are.
- InfoCentral will start with OSI layer-7 implementations, taking a similar form to other internet applications, like P2P networks and most CDNs. As immutable, graph-structured, hash-referenced data comes to dominate computing, however, some of the intelligence can eventually be pushed down to lower layers.
- Since DMF provides a unifying generalization of Information-Centric Networks, it can bring together existing distributed systems / storage projects without risk of locking in any aspect of their designs. Practical differences will be represented as standardized QoS and economic attributes within the Persistent Data Model. This will allow each to fairly compete as service providers under the new architecture.
- Some examples at the time of writing include IPFS, MaidSafe, Etherium, and Synereo.
- Projects that directly embed a significant amount of distributed system logic into the storage and network layers (eg. Synereo) may not initially seem to fit cleanly with the DMF / IE role divisions. However, if these projects' data units are captured in Persistent Entities, an IE can host the code used to manage their data in accordance with blockchain mechanisms, smart contracts, etc. This would add some overhead compared to a tightly integrated implementation, but it strictly decouples all application data from particular network architectures, allowing future migration.
- Combined with the IE software regime, DMF supports use cases where disconnected or highly latent network operation is required. This includes operations like pre-caching, guided by usage pattern hints, as well as priority queues for remote data propagation.
- DMF provides facilities for IE to subscribe to / register interest in newly available metadata surrounding existing entities and/or new entities matching type or signature filters. The mechanics of meeting interest among local and global networks is hidden from IE, but priority, interest duration and cost parameters may be established.
- Repository and networking concerns also include optimizations for versioned data (differential updates, etc.), compression, caching strategies, etc.
Encoding and Cryptography
- Encoding means packaging of data for wire transfer and storage. DMF implements the Standard Data Entity specification for these purposes and validates received entities against it.
- Cryptography must be handled before IE components can make use of data. Components handling cryptography must reside on systems with appropriate trust levels for the data involved. As a general rule, they should be located as close as possible to the end users of data, to minimize exposure. (ie. the most local DMF instance)
- While nothing absolutely prohibits IE components from doing custom cryptography, say for research and prototyping, all standard features are the exclusive responsibility of DMF. Raw ciphertext and signatures from Standard Data Entities will not be available to IE – rather only the user identities of valid keys and signatures. Likewise, because there is too much room for error, IE components have no control of the generation of nonces, IVs, etc.
- There is no mandate to repeatedly decrypt individual entities every time their data is accessed. However, at end-points, decrypted entity data that is ingested into local DMF or IE data structures must be stored securely. (Disk encryption, etc.) These local data structures should normally be treated as a cache, wherein infrequently used data is periodically purged.
- Some DMF implementations may choose not to store the original canonical data of ingested Standard Data Entities. They may instead regenerate the canonical form of entities upon request or cache expiration. (ie. removal from cache to archival storage) This requires careful storage of all relevant encryption keys and metadata pertaining to the original entities, as well as any deltas, if applicable.
- DMF has full visibility and understanding of Repository metadata, but may have varied understanding of the payloads of versioned entities, thereby affecting the availability of possible delta encodings. Support for a general-purpose binary delta encoding scheme will be mandatory. VCDIFF (RFC 3284) is a likely candidate.
- For exchange across public systems, deltas must always be calculated using the canonical entity encoding for any plaintext data blocks. Deltas of encrypted blocks must be protected by the same key as the original. Obviously, a DMF instance without the appropriate keys cannot generate these.
- For internal storage and among private repository networks, any delta encoding may be used.
- While DMF performs all cryptographic checks of entity data, this does not imply trust of the value or validity of contained information itself. All user and information trust is within the domain of IE.
Low-Level Database Features
- Basic queries – support filtering by coarse data entity characteristics such as types of data and metadata, internal and external signatures, and fields from the plaintext header.
- Simple views – allow clients to work with filtered data sets matching basic query criteria
- Collections – allow clients to operate upon entity collections without directly managing the necessary metadata and filters used behind the scenes
- Entity transactions – allow clients to write new data entities with certain repository / network guarantees, such as multiple-entity write / replication consistency or durability
DMF Capabilities and Specializations
DMF capabilities should be limited to networking and data management, providing clients with a coarse view of entity data. Features such as statement graph or relational semantics are only supported within IE.
Some features like indexing and query are ultimately split between layers. For example, DMF supports coarse indexing by payload and metadata types, provenance, etc. whereas the IE supports fine indexing involving data types, attribute values, entity relationships, and all aspects of higher-level data structures, etc.
While DMF capabilities are standardized, there is wide room for specialization to suit varied needs.
- It is intended that almost every physical device will have a local repository, for the sake of performance, local context, and disconnected operations. However, sharing large repositories will be commonplace, both publicly and privately. In a compute cluster, many nodes may share a single logical repository.
- Managed repositories with varied specializations are an opportunity for market-driven quality-of-service differentiation, without sacrificing interoperability goals.
- Smarter repositories may require more access to data entity content and therefore more security. Layered approaches are possible, such as untrusted simple repositories in the public cloud combined with hierarchies of trusted smart repositories among private servers that hold applicable decryption keys.
- Public repositories and P2P storage networks may experiment with a wide variety of access control strategies and associated economic models. Block chains, credit systems, and other mechanisms are very much in consideration.
- An InfoCentral repository interface could be implemented on top of many existing storage networks, to leverage the unique characteristics of each while unifying upon the standard data model.
- In cases like IPFS's BitSwap protocol, a P2P network may be used only for raw entity storage, with additional repository functionality provided separately.
- Attention economy systems like Synereo's Dendronet can be used to drive some of the ICN propagation for certain types of content, such as social media data it was specifically designed for.
Low-latency operations
In some multi-player games and multi-media interactions, real-time interactive data must be shared, perhaps even involving hundreds of small entities per second. Admittedly, the exclusion of application-specific, low-latency network protocols makes this more challenging. However, a wide variety of DMF specialization techniques can make this more feasible in practice.
- Not every entity needs its own user signature. Signature aggregates can be pushed out periodically to validate rapidly streamed updates. In this scenario, IE is potentially informed after the fact when data it has already received fails signature checks. Therefore, this technique must not be used for security-critical information received over an untrusted channel. It is acceptable, however, for low-value data like game and media-streaming interactions, wherein latency is far more important and validation can occur out-of-band to prevent cheating or spoofing.
- Real-time interacting users may agree use the same repository or federated network for a given interaction. Alternatively, DMF directives can be used to immediately propagate certain data directly to particular remote repositories.
- Within a single repository or federated network, DMF extensions can allow optimizations for low-latency data streams. These do not in any change the data model.
- Entities' own HIDs do not always need to be transmitted and/or processed immediately and may easily be parallelized and hardware accelerated. The IE does not need public HIDs to begin with, due to private ID mapping.
- As with differential updates, transmission of full canonical Persistent Entities is not required if an extension interface allows for equivalent representations. For example, a data stream may involve thousands of tiny entities referenced to a particular HID. The reference need not be repeatedly transmitted per entity if a DMF extension allows a session whereby the reference is explicitly specified first. Other duplicate header fields and metadata may also be handled in this manner.
- Regardless of optimizations, it is critical that DMF continue to perform its validation duties, as a first line of defense to shield IE from malformed data.
DMF to IE Scope and Pairing
- An Information Environment instance may only be paired to a single DMF instance, which will typically be device-local. This paired DMF instance will then communicate with any number of other local or remote DMF instances, orchestrating interactions among them without exposing this complexity to IE. As such, an IE's paired DMF serves as it's sole connection to the outside world of Persistent Data.
- A DMF-IE pair is considered a unique, user-oriented security context, within which there are normally no further segregations. All data that is decrypted by a DMF instance is available to the paired IE.
- IE may be used to configure its paired DMF but is not involved in operational mechanics.
- A DMF instance may not have multiple IE clients. If multiple IE instances are needed in the same physical device or context, each must be paired with its own isolated DMF instance. Software components residing therein must use the Persistent Data Model to interact. This does not preclude optimized communication between local DMF instances.
- For example, a device may host a standard, UI-oriented IE client as well as a specialized IE client related to a P2P network that the device uses to share resources. The DMF instance paired with the UI-oriented IE may treat the P2P-related DMF as one of its networking partners, no different than an ISP or cloud-hosted DMF instance.
- A P2P-related IE instance may consist of a single logical IE spread across many devices, with a private interface used to directly communicate with other nodes across the network. While this provides many complications, it does not violate any architectural principles because the scope and embodiment of an IE instance is unspecified. (In contrast, separate IEs cannot directly communicate.) In this approach, each P2P node would be paired with its own local DMF node for Persistent Data operations, while other communication between nodes would occur out-of-band. It is a matter for debate whether this level of IE specialization brings justifiable advantages over P2P systems that operate entirely over Persistent Data.
- One possible design is to limit P2P node interaction to pure functional interfaces with no available I/O side effects. By limiting I/O to a local concern, focus is purely upon distribution of computation. If many nodes wish to share the result of a distributed computation, this must occur over Persistent Data, with the I/O performed singularly by the initiating party's DMF.
- In general, allowing multiple IE-DMF pairs on the same device is considered a pragmatic concession. Early development efforts may benefit from this flexibility. IE instances may use competing VM designs, have special developer features, or be used to encapsulate or adapt existing software systems.
- By temporarily segregating P2P network idiosyncrasies, development convergence may be hastened. The Persistent Data Model serves as the common ground.
- A DMF instance may be distributed, typically across a server farm. All of the architectural guidelines and role boundaries still apply to the logical DMF instance as a whole.
- A single locally-distributed IE instance paired with a single locally-distributed DMF instance will be a common implementation scenario for large-scale service and infrastructure providers. This does not carry as many challenges as widely-distributed P2P-style IE instances.
Treatment of Persistent Entity IDs in IE and DMF
- The Information Environment consumes Persistent Data but does not ultimately need to use first-class public reference HIDs internally. Using much smaller private entity IDs may provide efficiency benefits during implementation. IE merely needs unique entity IDs that are bound to HIDs in its paired DMF instance.
- It is up to DMF how to provide data entity identifiers to the IE. IE components are agnostic to ID schemes, so long as they adhere to the underlying entity immutability semantics.
- Private IDs may either be derived from public HIDs or generated in a manner that ensures uniqueness within the local IE-DMF pair context. Private IDs cannot be based upon entity data because a reference HID may point to an entity that does not locally exist.
- When IE components create new entities that reference existing entities, perhaps known to them only by private IDs, DMF will translate the references back to one or more full HIDs when generating the Standard Data Entities.
- Private ID mapping must always occur in the DMF instance paired to an IE. Most of the time, this will simply be the device-local DMF instance.
- Regardless of the ID scheme, there may be value in use of temporary private IDs within IE for new entities before they have been fully generated by DMF.
- IE is also abstracted from any reference metadata used to validate an entity, as this is a security concern outside its role. Depending on local rules and the directives given, DMF may generate reference metadata when creating references within new entities.
- DMF will attempt to alias all known HIDs for an entity to a single local ID, hiding this practical reality from IE when possible. Unfortunately, if different hash functions are used to reference an entity that does not exist locally, the equivalence of the references cannot be known. This will only be discovered later when the entity is fetched from the network and locally hashed with various functions or when a trusted reference arrives that includes multiple involved hash values. Thus, the IE must be informed by DMF and take appropriate action when an alias is later discovered, assuming it has already traversed entities containing the involved equivalent references. This is unavoidable, whether a private ID scheme or full public HIDs are used as local entity IDs.
- Unknown reference equivalence can be minimized by popular use of a common hash function in most references, variously bolstered by a second hash as desired for extra security.
- Private ID generation methods have numerous trade-offs to weigh, particularly for distributed implementations. These must also be considered against the option of simply using full HIDs throughout, removing the need to maintain a public-private ID mapping between IE and DMF.
- Some common concerns include non-blocking ID generation, maintenance of bi-directional public-private ID mapping across DMF nodes, and reliable IE notification of discovered ID aliases.
- Example: In the case of a distributed DMF instance paired with a distributed IE instance, a coordination issue can arise with some methods of private ID generation and mapping. If two different IE nodes simultaneously request an entity from two different DMF nodes, contained references may result in different private IDs being issued. In order for an IE node to load-balance or failover to a different DMF node, either duplicate private IDs generated for the same references must be aliased, the ID generation method must be node-independent and deterministic, or inter-node coordination must ensure that this cannot occur.
Implementing DMF
- There are no definitive restrictions on how DMF is implemented, so long as all standards requirements are met. Repositories and surrounding DMF components may be developed using any system or language platform. Alternatively, local DMF components may exist within the same runtime environment hosting the paired IE.
- A pure functional programming environment that natively distributes code and data as hash-identified entities can also serve as a defacto repository and communicate with other nodes via the Repository interface / protocol, perhaps exclusively. This is considered an abstract long-range recommendation. Under this regime, abstract IE and DMF roles remain, but concrete components may be more tightly interwoven in practice.
- The software interface between DMF and IE is undefined at the architectural level. It does not in any way affect abstract operations over Persistent Data or the roles that DMF and IE play independently of how their concrete implementations interact. Within these roles, certain required features are covered by applicable standards, but interfaces to these features are not. Pragmatically, higher-level standards (more recommendation than mandatory) may someday evolve to cover such interfaces, in order to maximize pairing compatibility between DMF and IE implementations or components thereof.
Database Functionality
Summary
In the InfoCentral model, database functionality is a layer on top of the global, persistent data space. It can be used by the local software environment as a basis of what is currently known, trusted, and deemed useful from the global space. Database features are provided exclusively as views, signifying query-oriented derivation from the base entity data. Hierarchies of views can be used to divide and share database management responsibilities among software components. Low-level views, such as those provided by DMF, provide simplistic selection and filtering of entity data. High-level views, the province of IE, provide convenient data structures built using entity data and allow selection and aggregation based on complex attributes and semantics. They may implement any database semantics needed by high-level software components, whether document, relational, or graph-based. Views may also meet additional local / federated-system requirements, such as particular consistency models, that cannot be imposed upon the global data space.
Discussion
- Views under the InfoCentral model do not hide underlying entity data. It is always possible to drill down through views to the backing persistent entities. Inclusion of selected attributes and calculated aggregate data is permissible so long as the entities supplying the involved data are referenced.
- Views are nominally local, ephemeral, and private. However, in many cases it is worth periodically materializing aspects of views under the Persistent Data Model. This allows the work to produce them to be easily shared and is another quality-of-service opportunity. Common examples include indexes, statistics, and summaries.
- A materialized aspect of a view may not contain a copy or subset of an entity's data without a reference to the entity it is sourced from. In many cases, a view provides only entity references and metadata.
- Calculated aggregate data is likewise allowed, so long as the functions and entity data sources used are unambiguously specified.
- ex.) the sum of a numerical attribute, among all entities referenced
- ex.) the number of (cryptographically signed) “like” annotations for a comment
- Views may be used to capture common access patterns and data structures backed by Persistent Data. More generic views should be widely reused and built upon. A suggested pattern is to stack a specialized view onto a general-purpose view or stack thereof. The highest-level software components generally obtain data through top-level IE views, to fully decouple from data access and selection logic. (Such is commonly still seen in lower-level general IE views, let alone DMF views.) Said otherwise, the highest-level components trust the data in the views they consume and do not care how it was obtained.
- General-purpose IE views perform tasks such as:
- Validation and ranking of data sources (accepted signatures, trust networks, etc.)
- DMF still performs all cryptographic operations, such as checking signature validity
- Validation of data against trusted schemas (sane attribute values, valid semantics, etc.)
- Best-version assessment (both automatic and interactive conflict resolution, heuristics)
- Mapping of entity data onto semantic-graph data structures
- Specialized, top-level IE views might provide:
- a categorized messaging inbox
- a date-ordered list of publications by an author
- the latest sensor readings, averaged from a number of stations
- A view could provide standard RDF backed by the Persistent Data Model, similar to building RDF middleware on top of native graph databases.
- Because they hide subscription and update concerns, IE-level views represent the most abstract mechanism for software modules to communicate with each other via shared graph data.
- This abstraction does not prevent IE users and components under their purview from observing how lower-level concerns are affecting overall system operation. It merely defends the highest-level software components from these details.
- Some IE-level views may only pertain to data with no relevance outside the scope of local components using it. The DMF becomes involved only when data must be persisted and/or propagated. It serves this role independently of the IE, decoupling these data management concerns. IE-level views need only indicate a priority level for this management.
Implementation
- Database functionality involves components among both the Data Management Foundation and Information Environment roles. DMF should mostly be involved with low-level concerns, like durability and synchronization, but will also participate in coarse filtering (types, metadata, cryptographic trust, etc.) and related optimizations. The IE is always used for complex query capabilities and any provision of graph or relational semantics.
- It is always architecturally preferable to implement database functionality via transparent code within the Information Environment. This avoids the hazards of optimization black boxes and code/data “impedance mismatches” seen with traditional database engines. It also allows for rapid evolution of data types, query execution strategies, etc. within a native code environment. As a rule, anything not directly useful for entity-level data management or networking should exist within the IE. Components in IE may still provide hints to DMF to guide low-level optimizations.
- IE views and related features consist entirely of code that is driven by popularity, not standardization. This is intentional, within the wider software development paradigm, to avoid design lock-in. Any query languages will always be implemented in IE, for instance.
- While DMF views are treated categorically, IE views are not considered a special type of software component and can be thought of as recommended patterns.
- Persistent Data is always used to configure a view, as well as advertise its capabilities, semantics, and active subscription end-points. (i.e. entity metadata collections belonging to a view for which subscription may be requested, under some defined scope and access ruleset)
- Example of subscription to a DMF view for an entity with HID1:
- Query backing the view: “repository and anchor metadata for HID1 that is signed by trusted parties PID1 or PID2”
- Initial response: “repository metadata: HID2, HID3; anchor metadata: HID4”
- Later update: “new repository metadata: HID5”
- As a design pattern, views present a definitive starting point for converting existing systems to the InfoCentral model. An IE is used as an adaptor and old application code is treated as a user. The IE thereby serves as a bridge to rescue raw data from legacy systems that confound it with application logic and arbitrarily mutable state. The general concept is that everything in the InfoCentral model must remain pure (no mutable state, strong typing, rich semantics, etc.) and any interactions with legacy systems must follow strictly-defined patterns that do not permit impurity under any circumstance. (ex. Any hard errors would be seen only in legacy systems.) This is strongly analogous to when systems using Event Sourcing interface with external systems that do not. (In this analogy, InfoCentral views are similar to Application State calculated from replaying events from a clean slate.) However, the overall conversion story is not to provide permanent two-way interoperability, as with an API, but rather to create onramps for escaping legacy architecture. (This topic will be treated in great depth elsewhere.)
- The DMF – IE view interface, as with the general interface, is not defined architecturally. Ultimately, this interface is only concerned with IE notification of data availability. As prior discussed, DMF views themselves are always specified and configured over Persistent Data. (ie. declaratively) This allows unlimited flexibility of implementation. However, user-space IE components must be fully insulated from such details, by consuming only a standard notification interface. It should be trivial to adapt between low-level backends, but we expect movement toward integrated local IE / DMF designs for efficiency.
Provision of ACID properties
Traditional ACID transaction properties can be provided by layering optional capabilities upon the base public DMF standards, in conjunction with appropriate logic within IE.
- The necessary capabilities may be provided using either private DMF interface extensions or entirely using the Persistent Data model, via metadata provided and consumed by DMF instances. The latter is more costly, due to entity overhead for small metadata, but allows for fully public, standard operation. Much of this entity data may also be virtualized.
- A transaction itself is represented using a root entity that is referenced by all of the other entities involved with the transaction. This ensures that context of transaction elements cannot be lost. The transaction root should also be annotated with an entity containing a list of all involved entities. These root and list entities should be time / TXid stamped and signed. As such, the version history has no ambiguity of intent, authority, or ordering, when a need for conflict resolution arises. If any transaction element entity is missing, the entire transaction is considered incomplete. This provides the means for supplying the Atomicity property.
- Consistency and Isolation are handled using multi-versioning, since this is a default feature of the Persistent Data model. DMF must only assure that transaction component entities have been received and accepted by the appropriate parties. This concern does not require understanding of transaction payloads themselves. It is up to IE components to process the contents of successful DMF-level transactions.
- There is inherently no such thing as global consistency because many repositories are independent. However, flexible view-based consistency semantics are possible, either at the level of a single or federated-set of repositories. This typically requires synchronization features that extend the base repository interface but are abstractly presented as QoS parameters at higher levels. DMF-level consistency constraints can be placed on metadata collections within a single repository or network that serves as the authoritative location of a transaction manager. Trusted code in an adjacent IE will then sign transactions submitted, acting as the official arbiter.
- Durability necessarily involves the repository layer, via an extended interface that provides guarantees about when a new entity has been written to non-volatile memory.
- Support for consensus protocols at the data management level (Paxos family, etc.) requires extensions to the standard repository interface or a private interface among federated repositories. This would typically used for ensuring adequate replication of transaction data via quorums. In contrast, information-space consensus protocols are implemented as Interaction Patterns entirely within Persistent Data and among IE components.
Introduction
Working with distributed, collaborative, graph-structured information spaces will require new development tools, paradigms, and methodologies. To benefit from the dynamic layering and composeability of information, user-space software functionality must itself be fluid and composable rather than contained in traditional static applications. To ensure arbitrary safe, reliable composition, we need a reasonable long-term path toward formally verifiable code. In the InfoCentral architectural vision, the framework to support these goals is known as the Information Environment (IE).
Definition and Scope
- An Information Environment is a data processing, manipulation, and presentation workspace within which users and software components interact with the global information space and thereby with each other.
- An Information Environment may or may not involve human user interfaces or even human users. Machine users will also be hosted by the IE. This provides additional motivation for formally verified software, but AI will itself eventually provide tools to help achieve this.
- There is no specified scope for the host of an IE. A local network of diverse UI devices unified as a single user experience may be part of a single logical IE. Likewise, a computation-intensive task may operate within an IE hosted across a large cloud infrastructure or P2P network. Boundaries of an IE are ultimately determined by task and trust.
- A single device may host multiple IEs with strict separation. One instance may provide private services to a local user while another may allow sharing of spare CPU time with millions of users on a P2P network based instance.
Base Data Types, Semantics, and Ontologies
- These must be standardized in a language-neutral manner, comparable to Semantic Web, as these concerns are universal.
- Because these are irreducible and mandatory, implementation should probably be within the local IE framework itself, not its hosted components.
Code Module Management
- This category is largely IE implementation specific, but general functions will include fetching entities containing code, local compilation and/or analysis, and various debugging features.
Interaction Patterns
- Interaction Patterns build upon ontologies, specifying valid patterns of adding new data to the graph, for communication among multiple parties.
- Because this is a new concept to explore, the implementation of Interaction Patterns will initially be loosely specified. They should eventually be language-neutral declarative specifications, but this will take time to evolve. Code-heavy implementations are expected during the early stages.
- IE-level distributed algorithms and protocols over Persistent Data may be described via Interaction Patterns. At this level, we are abstracted from messaging and persistence concerns and everything is natively concurrent.
- Formal reasoning methods like pi calculus can be used to validate algorithm implementations against the specification in a pattern.
Human User Interfaces
- The basic UI design concept of the Information Environment is that appropriate user-facing software functionality automatically comes alongside the information currently being worked with. The lack of “application” boundaries means that information is never trapped within particular software or task contexts and new functionality can be overlaid at any time, resulting in a complete fluid workspace. This does not mean that contexts are non-existent, rather that they are independent and do not own data. Each context / Interaction Pattern is responsible for deciding what data it considers useful and how conflicts will be handled.
- Many everyday IE usage patterns involve ad-hoc data management, using generic viewing and editing widgets available for different data types and device / environmental contexts. Widgets for interacting with various data types and graph structured information are ultimately sourced from the global collection of software components.
- General interactive modes for a data type / structure, such as “text editing” are abstracted from context-specific renderings of those modes, such as graphical vs. voice-oriented text editing.
- Data management and editing widgets may be composites of functionality from many components, resulting in practical replacements for contemporary text and multi-media editing applications. Such composited widgets never define new higher-order data types for their own purposes but rather compose simpler data types into graphs of annotations, metadata, and related elements.
- ex.) A simple raw text editor has no annotation or formatting capability. These would be layered upon the raw editor but do not define a new richer type of text, in the fashion of markup, etc. Instead, raw text data entities are referenced by various metadata-holding entities. In order for these to be used, a composite viewing widget must support the surrounding metadata types so that it can render a useful representation for the user's current context. As a trivial example, consider a visual vs. audible rendering of text emphasis.
- There are many approaches to compositing functionality. Card-based UIs are a currently popular technique that may be a useful starting point.
- For specifically-designed user interactions, typically those having a foundation of Interaction Patterns at the data level, local IE frameworks may attach to graphs of human-UI-related Persistent Data entities and render a concrete UI in a device and environment appropriate manner.
- Persistent Data related to UIs is usually limited to requests for information from the user and the results of high-level queries suitable for rendering as visualizations. It need not be used for specifying and configuring UI widgets, which are ultimately a device-local concern. Nevertheless, some UI frameworks may do so in order to persist user-custom UI state across sessions or devices.
- Data conflicts are embraced as an inevitable side effect of independent interactions over shared graph data. Most resolution should occur automatically, within the domain of views and lower level rules-based logic. However, some conflicts will need to surface to users. UI design for conflict resolution scenarios is no different than any other interaction and local frameworks will render appropriate final UIs.
- Zoomable User Interface (ZUI) metaphors
- Graph-structured information lends itself naturally to zoomable UIs, which visualize the default breadth-first traversal pattern and make excellent use of available summary metadata. ZUIs also help to visualize relations between data.
- Zoomable cards envisions distinct facets of data around an interaction that can be independently drilled down upon for enhanced detail and context.
- Navigation breadcrumbs provide an alternative to page-oriented forward and back navigation, which loses branch histories. Navigation actions form new path segments in a user's history graph, which can also be annotated and further imbued with context.
- Dynamic level-of-detail (LoD)
- Summarization may include calculated aggregates, trusted subsets, abbreviations, etc. Natural language summaries will be a good use for AI.
- LoD will in some cases depend upon network quality, making best use of what is available at the time. Low-resource devices can also offload processing of summarizations and wider network fetches to personal servers or cloud services. This will be especially critical for real-world mobile usage, given the latencies and power consumption patterns involved.
- This will provide yet another QoS market-differentiation opportunity.
- Spatial memory assistance will be critical for human users, to promote data workspace familiarity and fast transitioning between tasks. Spatial data will be stored as a local, personal annotation to the history graph and surrounding context, typically involving precise graphical layouts and positioning. Aural versions may also be possible, using 3D sound cues, etc.
- Implementing video games within the IE involves some substantial leaps in abstraction over tradition game engine design. It is more appropriate to think of a video game interaction as a graphical rendering of a limited subset of a virtual world, with a complex assortment of Interaction Patterns and human UI input stream mappings applied.
- It is considered unsafe and architecturally impure to directly expose any form of local graphics drawing interface to external (existing in global graph) software components. (compare WebGL) Instead, scene description is used as a declarative abstraction, whereby a local renderer decides how to perform drawing of the scene. (compare SVG and VRML)
- Bitmap graphics operations should be directly supportable, as there is no safety or API dependency concern. In this case, a local renderer need only periodically refresh a display area with a constructed buffer.
Software Agents and Artificial Intelligence
- The IE will serve as the interface for software agents and increasingly-general AI users.
- Because the IE represents a view of a filterable subset of the global data graph, it serves as a useful abstraction for AI testing, development, and containment.
- AI users themselves may exist with IE, though this is not a requirement, given the extreme performance demands of neural networks. Software agents are a more likely candidate to be directly hosted by IE in the near future.
- IE will at very least be the gateway for crystallized knowledge ingested and produced by AI users. Deep neural nets characterized by low-precision or probabilistic calculations and massive working memory do not seem like a current fit for the formality and overhead of IE architecture.
Software Architecture Principles
General
Simplicity should be an absolute guiding preference, with a goal of minimal dependencies and maximal re-use of code and data. The Information Environment concept is intended as the modern successor to the Unix philosophy.
Data Storage and Sharing
All persistent and shared data must use the Data Management Foundation. There is no provision for the IE to directly access data storage or networking facilities.
- IE components may give DMF hints for how to treat specific data sets, using standardized metadata. This allows for tagging of replication, QoS and security priorities, though DMF is ultimately responsible for choosing among suitable optimizations toward these ends.
- Unlike centralized internet services, the localization of most user-space processing means that a reliable network connection is not required for continuous local functionality. Hints may be used to promote the most critical data being locally available during latent periods and replicated first upon reconnection.
- System-level DMF tools will allow local storage and hardware management, including immediate replication of chosen data to a particular device or network repository. Management operations will be configured and orchestrated over Persistent Data and thereby have visibility to appropriate IE components. This does not violate role boundaries because the IE does not actually perform any of the work.
Software Component Interoperation
Information Environment software components may only interoperate publicly using Persistent Data. There is no provision for user-space remote APIs, an artifact of the application model of software. Private interoperation, such as function composition, is allowed within an IE instance. This influences the scope and granularity of private vs public software compositions and interactions. The general rule is that Persistent Data should be used for any data that may be useful in other contexts later or if shared.
Computation may be distributed through an implementation-specific IE protocol. The most obvious applications are local, real-time processing tasks and large scale processing across a distributed IE instance that spans a cloud service or P2P network. In both of these cases, intermediate data does not leave a single logical IE instance. To be accessed externally of that IE, it would need to use Persistent Data.
Services
Services are formalized, automated remote interactions over shared graphs of Persistent Data entities. Interaction Patterns are used to specify service behaviors and contracts, and metadata is used to advertise patterns, end-points, and network QoS factors.
- As with all graph interactions, services do not own the data. By leaving external users as much control as possible, excluding only that which could allow abuse of a service, the responsibility of providers is minimized while 3rd-party innovation is unhindered. (After all, most enterprises do not specialize in IT.)
- If blockchains and smart contracts are used to orchestrate aspects of service Interaction Patterns, any data involved will itself use the Persistent Data Model. Consensus mechanisms are merely another QoS concern relative to the shared data graph. This promotes flexibility, continuous innovation, and hybrid approaches.
Examples
- General: A client anchors a request entity to a service's known “request root.” The request contains a reference or description of some desired task, relevant data, and a form of payment, if required. After the service accepts and processes the request, it annotates the request entity with the result, notifying the client via a prior-agreed callback mechanism. (DMF hides this final detail, as it is QoS related.)
- General, multi-party: Many participants operate upon a shared public graph of information, using Interaction Patterns mediated by one or more trusted parties. (or a quorum of participants in a fully-distributed orchestration)
- Concrete: A business may provide a customer wait queue by advertising that a local service (human or automated) will manage reservation requests that are anchored to a specified request root. Customers' IEs also subscribe to the wider graph of information around the wait queue root, including other customers' requests and current resource data. (ex. repair shop bays, restaurant tables, etc.) This information can be used to independently determine place in line, estimated wait time, etc. Typically, algorithms and parameters suggested by the business would be used, but since client-side software is under full user control, it may provide custom functionality not originally intended by the service operator. Within a reservation request, a customer's name or ID should be encrypted to the public key of the business. Yet customers in wait could still anonymously interact, such as playing trivia games or even auctioning their position with other customers, assuming the business's queue service advertises that it will honor such agreements.
Other Boundaries
- No direct interactions between IE instances are allowed. Architecturally, this discourages hard dependencies on interfaces backed by hidden remote code and favors local processing whenever possible. While services accessed via requests over Persistent Data are arguably similar in practice, all of the data surrounding interactions is captured in a standardized way, and services themselves are required to be defined by strict contracts and robust data semantics. Transparent logging of incoming Persistent Entities makes replay trivial in the case of troubleshooting or abuse analysis. Hidden code is still possible behind a service's Interaction Patterns, but these restrictions make dependencies as minimal as possible. Because it will be so easy to replace services, market forces will usually provide pressure toward driving them with shared open code. Proprietary business data used to parameterize service functionality may certainly still be hidden regardless.
- Example: An insurance provider operates a service that generates quotes based upon submitted client information and coverage needs. The formulas for calculating these quotes may involve proprietary business data, hidden by the service, but the interaction pattern for using the quote service is fully open. As captured by the Persistent Data Model, the user has full transparency and a historic record of the interaction with the service.
- If a private IE wanted to invoke the services of a P2P network IE, even with a node on the same device, it would still need to use Persistent Data.
- Interaction Pattern related software modules do not participate in generating UIs, as this is the responsibility of the components chosen by the local IE. Patterns may only request information, which may involve interactions with a user.
- ex.) “Ask the human user for their body height,” is interpreted by one IE as a voice dialogue and by another as a GUI dialogue, with language and units appropriate for the current user. (Obviously the request itself uses a native semantic encoding, not natural language.) A request may also be answered automatically, if suitable data is available in the graph of the user's personal information.
Programming Paradigms
- The Information Environment model will motivate development of new programming languages to better meet its unique requirements and development paradigms. Especially during earlier stages, some new languages may simply be subsets of current popular languages, with particular features removed to allow architectural purity.
- Declarative programming styles are strongly preferred over structured and imperative styles. They are far more suitable to creating fluid information environments, where functionality is dynamically orchestrated and data is rarely homogeneous.
- Structured programming lends itself to designs with complex interdependencies, which reduce code reusability. It tends to favor top-down rather than organic organization, in turn requiring traditionally-structured development teams and economics.
- Object-oriented programming clashes with the intentional orthogonality of data and logic throughout the InfoCentral information architecture. The notion of an object encapsulating a persistent data structure is a poor abstraction for distributed systems, often incompatible with even local concurrency needs among loosely-coupled components. Likewise, native data semantics replace object classes as the means of modeling the real world. With this concern out of the way, it becomes possible to focus upon lightweight, easily re-usable units of code, supported by robust type systems.
- Among declarative paradigms, description such as constraints, pure logic, and declarative DSLs is heavily favored over code. Said otherwise, when given a choice, one should favor complex data and simple, generalized code.
- Pure-functional style should be used almost exclusively for code.
- Software components (down to the level of functions) must be fully independent of each other. They may only depend upon type signature + contract pairs (by HID reference), never implementations thereof. The execution environment will automatically chose trusted implementation code units, though annotation of recommendations or even local requirements are allowed.
- Contracts allow implementations matching a function type signature to be tested and verified. Their programmatic aspects should allow them to serve as basic unit tests.
- Implementations may have various runtime tradeoffs. These will be explicitly annotated, to assist dynamic selection for the task and environment at hand.
General Language Requirements
Although declarative coding paradigms generally make more sense given the full architectural perspective, there are no hard requirements. With the exception of a few generalized constraints, language design will be driven by popularity.
Code containers
- Units of code, whether source or compiled, always exist within Standard Data Entities.
- Source code is not stored as human-written plain text but rather as a data structure produced by a semantic editor. Code comments exist only as annotations.
Naming
- All referenceable units of code, such as functions, classes, and modules, must be hash-identifiable, rather than using an authoritative symbol namespace.
- This contrasts to the style of domain-rooted namespace used by Java and others.
- Metadata may be used to provide human-readable labels, as visible within code editors.
- Arbitrary internal naming may be used within a containerized code unit as long as external hash references to required subunits remain possible. For example, this may be provided using a similar Anchor Metadata pattern as subjugate statement IDs. Every code language must provide an indexing scheme to enable this. For the most part, however, it is preferable to factor code into irreducible units per entity. Some language code container standards may choose to mandate a particular level of granularity, such as per-function.
Code Management
- There are never user-visible compile, build, or install steps in either production or deployment workflows. The IE implementation may choose its manners of optimization, such as pre-fetching and pre-compiling new or likely-related code during idle periods.
- Import of closed-source compiled code into an IE is not allowed, as this limits inspection, formal verification, and dynamic modification features integral to the core architecture. Pre-compiled code may be imported from trusted parties as long as any requisite source and debugging information remains available. (ex. mobile IEs might grab compiled code units from a home server.)
- This requirement is purely technical and is not rooted in philosophical purism. A direct comparison is the necessary inability to hide JavaScript code from users of web applications. This, of course, has nothing to do with matters of licensing.
Interaction with Data Management Foundation
- Information Environments may provide various management hints to the DMF. However, this is intended to be a very loose coupling in most cases. In specialized situations, an IE may refuse to work with DMF components that do not advertise certain capabilities.
- End-to-end communications will typically pass through multiple DMF instances before reaching a user on another IE. This makes it more difficult to provide certain guarantees available in traditional centralized messaging solutions. Various strategies exist to promote desired behavior in the average case, without resorting to federation.
- Example: Sending a message to an inbox implies a specific metadata replication intention. The message cannot simply be replicated to a local repository. It must somehow find its way, across myriad networks, to a repository where the recipient is known to be watching a inbox collection entity for new anchor metadata references. The IE code which produces a message is entirely uninvolved. It is the DMF's responsibility to handle this replication, such that IE components may remain naive to network operation. However, the IE may hint to the DMF the nature of this metadata – that it is private and is time-sensitive. The DMF will examine metadata for the inbox collection entity being anchored to and see that it has specific repository (or permanent domain) hosting metadata, with a high trust ranking. The DMF will attempt replication to one or more repositories automatically, without involving the IE further. This does not imply that the IE cannot see the repository hosting metadata for the inbox collection entity, only that these concerns are fully separated. Another IE component may handle the information space of known private inboxes, perhaps including access control mechanisms, but will still not take part in network operations.
- In some cases, there is not a single authoritative repository to replicate toward. Consider open publication of comments or annotations for a popular journal. The journal publisher may not accept anchor metadata in their own repository. Even if so, they may subsequently censor third party content. This highlights the importance of popular repositories with clear policies for this sort of “overlay” content, an excellent case for community-run P2P networks. Obviously, there is no method to force users to overlay 3rd party content from even highly popular repositories. However, systems should be designed to promote continuous discovery of useful sources by default, employing both social networking and AI. Few users will choose to turn off a feature that is consistently beneficial to being well informed. Likewise, any choice to live in a “filter bubble” should be made consciously.