Many of the answers below are taken from the Hypercore protocol discussion forums (currently on Discord and the older forum on Gitter). All interpretations are ours, and so are the possible mistakes and misunderstandings. Please send corrections as pull requests, or request commit rights. Or just reach out to us over the email posted on github or on @tradles. Questions with partial answers are marked with Need help with this. Hypercore needs your help, please help us make it fully documented so that all P2P projects start benefiting from it.
This section is for general questions. See other sections for questions specific to individual Hypercore modules.
Hypercore is an open source P2P technology, but we saw other P2P technologies, BitTorrent and Bitcoin. What is unique about Hypercore?
Hypercore’s key USP is streaming. You can think of it as video streaming, but now for videos and also filesystems, databases, messages, IoT signals, and any other structured data constructs. With streaming, you get:
This streaming point needs to be repeated again and again, as streaming data, just by itself, without any other wonderful Hypercore capabilities, may create a new class of applications, much like Netflix re-invented the movie watching. This paradigm shift is one reason why Hypercore is hard to grok for app developers, it just requires full rethinking of our current architectures.
Note, when reading Hypercore docs you will find many references to sparse replication. This is the capability used for streaming, allowing a peer to efficiently request individual blocks from remote peers, instead of loading the whole remote dataset, be it a video file or a database.
Another USP of Hypercore is that it implements essential patterns of distributed systems in a reusable way, so that systems and applications are not forced to re-invent the wheel.
WAL. All distributed systems need a Write Ahead Log (WAL), be it databases, orchestration engines, like Zookeeper or etcd, or event streaming systems like Kafka. Every system implements its own WAL today. Hypercore generalized this pattern as an append-only-log and consistently uses it in its higher-level data structures such as Hypertrie, Hyperbee, Hyperdrive.
Time travel. Hypercore provides universal history for all data structures which work on top of it, which gives them a universal undo-redo, rewind-replay. This capability is common in editors, but now it can be used in any application including retracing actions in a filesystem and a database. Use cases are plentiful:
Recovery Same history capability can be used for classic database point-in-time recovery, DB and filesystem snapshots, and versioning for data integrity assurance, VM snapshots, filesystem and DB snapshots, container image layers, etc. etc.
We have gotten used to how Gmail automatically refreshes UI when new email arrives or when you view email on mobile, and it is marked as read now. This capability is pervasively used by Google in Docs, Slides, and other apps. Distributed apps need to match that and Hypercore’s real-time updates, including on changes on other devices or by our teammates, provides us with this capability.
Many applications (like chat, group chat, photo apps, social media apps and collaborative apps) need an ability to handle file sync, especially for very large media files. Those providers hold large files in the central holding area, and we never really know of they delete them afterwards, potentially violating our privacy. They also have significant limits on file sizes. Hypercore is geared really well to enable these scenarios without a central server / provider and without limits. And on top of this it adds partial sync or streaming, so data do not need to be loaded fully to be viewed / explored.
Distributed apps need these and therefore apps using hypercore become simpler to write.
Need help with this: How would one implement in Hypercore forwarding a message with a large video from one chat to another (both in a one-on-one and a group chat).
Take a look at Jared Forsyth’s criteria for the above and the various products he reviewed using these criteria. The framework for this type of software is very hard to create, but it is a fresh direction away from massive aggregation of our personal data, and it is close to become a reality. Isn’t it why you are here?
Hypercore is built to give you full control of your data. This means, it continues to work even when you have no connectivity, when your other peers are offline, and when hosting / cloud provider closes your account. It also allows portability to other machines or hosting providers.
Note for example, that Google Docs only has support for offline work in Chrome, but not in other browsers. This is how these services tie you up. The new phenomenon is de-platforming, when app provider, like Twitter closes your account, or App Store blocks your app downloads. Facebook and Twitter famously killed rich ecosystems of apps on their platforms.
By relying on providers without a mobility we give up on self-sovereignty and core freedoms and become slaves of the platforms, morphing our behavior to their demands, and facilitating creation of mass surveillance systems, like happened with WeChat and other platforms in China.
There is a new hot area in Big Data world for querying static databases. In AWS it is Athena, based on Apache Presto engine, and SQL SELECT. CSV Files (or files in JSON, Parquet, ORC, Avro formats) are shoved into S3 and then queried by Serverless applications and by Business Intelligence packages like Tableau.
This requires no databases servers and shows where Hyperbee can be very useful.
Database with data stored in S3 (or the like) is very tricky to make performant. AWS Athena uses above specific formats, helps split large files into smaller chunks, has tools to prepare columnar indexes for each S3 object to enable search queries, adds extra metadata to help decide which S3 object contains the right subset of data for the query and provides a farm of servers that queue and execute your queries. S3 Select provides a simpler alternative to above, avoiding the use of server farm, but also with decreased functionality.
Why is this important to understand in view of Hyperbee streaming? Because Hyperbee provides a similar capability implemented in a completely different way. It not only does not require servers, like S3 Select, its underlying protocol is optimized to load byte-ranges from the [S3 object] offset, and efficiently traverses the btree (in Hyperbee) of the trie (in Hypertrie) to avoid extra round trips when executing queries for remote data (S3 in this case). This is a so called sparse mode in Hypercore.
What other applications that can we think of that can be enabled by such a server-less DB, a DB that redefines how querying is done (via sparse data propagation), a DB that embeds a replication mechanism?
Some pointers to possible answers can be found when we compare a P2P source control system Git that replaced SVN and CVS which used central server. Those entrepreneurs that think “Is it possible to make a big business on this?”, please note that Microsoft bought Github for $7.5B.
BitTorrent. Hypercore can do what BitTorrent does and more. Hypercore can do discovery and accelerated file download with bandwidth-sharing like BitTorrent, that is, the more viewers watch, the better it works.
But Hypercore can do more - it is built as a data and communications framework for modern decentralized applications.
WebTorrent. WebTorrent is awesome, it pioneered BitTorrent in browsers and its is a great success, but its mission statement was just that, a BitTorrent for the Web.
Note that WebTorrent’s tech can be helpful to Hypercore, as it perfected peer discovery (via DHT) on the Web and it allowed a number of innovative streaming clients to emerge, which could be helpful for Hypercore applications, like Beaker Browser.
Secure Scuttlebutt (SSB) is a peer-to peer communication protocol, mesh network, and self-hosted social media ecosystem. Hypercore has been growing towards a more generic model of structured data (file systems, databases) synchronized over many devices.
Both are cool open source P2P data projects that have existed for roughly the same 5-7 years.
You can review Reddit discussion that makes some good points.
Some key differences, described here, are:
Feature | IPFS | Hypercore |
---|---|---|
Addressing | Content-based addressing | Public-key based addressing |
Addressing stability | Dynamic address, changing as block changes | Static stable address |
Addressing granularity | each block has unique address (block’s hash) | Address (pub key) corresponds to a collection of objects / files, each consisting of many blocks |
Filesystem | Low. Files are composed of lists of linked blocks, no notion of file directories, metadata, | Great. Files are managed as full-blown filesystem, with stable API and a daemon that provides REST API and POSIX-compliant OS extension, so it appears to users as a regular folder |
Storage size efficiency | Great. Same block can be reused on your machine, even between files. This is usually called deduplication, or dedup. But when block changes, old block remains in storage. | Low. No block-level dedup. Change in one byte, creates a new version of the file. File-level dedup can be achieved with additional management level, called corestore. |
Mutability | Low. Originally developed for static content, but is being re-designed now. Specifically, the IPNS component attempts to provide stable hash-based address for the file (the tip of the list of blocks), but it needs to be refreshed periodically. New project IPLD is in development and aims for structured data, such as adding primitive data types, maps, lists | In Hypercore editable content was a prime design objective, supported by the internal data structures, its protocol, Change Data Capture system, APIs, etc. Structured data are supported by Hypertrie and Hyperbee |
Hyper-linking | Yes. Formal specification, called CID. | Partial. URLs work in Beaker, Agregore and Gateway browsers, but formal definition for links for structured data is still in works |
Human-friendly naming | No. IPNS is not human friendly. | No. Hypercore’s hyper:// URLs includes hash and is not human-friendly |
Databases | TBD. OrbitDB, ThreadDB, AvionDB | Two variants: Key-value store (Hypertrie) and LevelUP-compatible DB (Hyperbee). Embedded DB, does not require DB server. Provides unique streaming capability to greatly improve storage size and startup time. Using Hypercore data structures community produced replicated databases KappaDB and multi-hyperbee |
Availability | Fairly high. IPFS community sponsors hosting of content, there are also commercial providers like Infura and Textile, and Filecoin protocol to reward hosting | Low. See for example Our Networks page referring to both IPFS and Dat URLs and Dat URL does not open. Same here. Homebase attempted to provide hosting. Datdot project is developing blockchain-based reward system. Perhaps we need a breakthrough hosting model here, could VPS-made-simple be it? |
Startup speed | Slow | Great. Hypercore provides sparse replication for any type of data, that allows immediate streaming of both structured data and media content. |
Discovery | Uses DHT. Avoids dependency on a more centralized DNS system. IPFS uses a DHT for every single data chunk globally, which works great for dedup. However, IPFS architecture creates an enormous overhead of DHT traffic compared to the other protocols. It also fails to benefit from the assumed knowledge that peers who have one chunk of the repository you’re interested in, are likely to also have more chunks you’re interested in. | Uses DHT (via Hyperswarm service). To avoid overloading DHT, topic in DHT is usually the whole Hyperdrive with potentially millions of files. To find a file via the DHT, URL becomes drive/file. In addition, Hyperswarm provides a flexible mechanism to design your own DHT-based discovery system, e.g. discover communities, teams, people, etc. |
Directory structures and file metadata | TBD. Somewhat. IPFS simulates directories by creating files with links to other files. | Full. Hypercore does full POSIX-compliant file system emulation and therefore can be mounted natively (via FUSE) to be viewed in File Explorer, Finder and to be used from the command line as a normal file system. |
Some notes on IPFS goodies:
Review these and structure its content for summary below: https://medium.com/decentralized-web/comparing-ipfs-and-dat-8f3891d3a603 https://blog.cloudflare.com/e2e-integrity/ https://docs.ipfs.io/concepts/usage-ideas-examples/#usage-ideas-and-examples
Rough outline:
Both blockchain and Hypercore provide verifiable data structures. But blockchain is limited to very small storage and Hypercore is limited by absence of pubic timestamping and data history, and the absence of verifiable computations. See further on this in section “Can Hypercore’s author change history”.
Here are the cases where blockchain and Hypercore already help each other:
This use case is theoretical but is plausible:
Light client to blockchain full node, using streaming hyperbee, to avoid dependence on centralized infrastructure, like Infura. Most blockchain full nodes support a very limited amount of queries by default. Many apps need a lot more, and for that infrastructure players, like Infura, have created a SaaS infrastructure. But this re-centralized the decentralized P2P setup and re-inserts a trust into a specific company.
Yes. Community is very active and helps newcomers and developers building on Hypercore. Join it on Discord, open issues on Github, and follow core developers on Twitter @mafintosh, @pfrazee, and @andrewosh.
It is a fact that Hypercore is 7 years old and still has no runaway apps built on it. So what gives, if it is so amazing, and it is! Here is my take, aside from a general statement that making a P2P framework work smoothly is super-hard:
Many P2P apps struggle as they lack availability, durability and work in the unforgiving networking environments.
Availability. For example, in a P2P collaborative editing app competing with Google Docs, once you close your laptop, your collaborators can’t get your latest content, unless they were online when you made edits. With Google Docs, if you had a connection at the time of the last edit, the changes are available to others, even if you went offline right after. This is especially important for team work across the timezones. So some master nodes that “seed” the content are always needed in P2P applications (e.g. Hashbase), but these so called super-nodes often re-centralize things and introduce challenges for permissioning, data sovereignty, and data privacy. Availability problem remains unsolved.
Durability. We are spoiled with Google (and others) taking care of preserving our content. We pay a steep price of giving them everything on us, but this convenience is very hard to achieve in P2P world. Your peers may be good friends but there is no guarantee they will not lose your precious content. Many solutions are being tried, including those with Cryptocurrencies incentivizing users to keep content, but they all have technical and convenience frictions. Besides, who wants to be responsible for disseminating a potentially illegal content? Durability problem remains unsolved.
Networking. Current Internet, with its routing and firewalling system is just hostile to P2P connections. Although Hypercore’s Hyperswarm offers an ingenious NAT hole punching, there are too many edge cases, when it does not work on mobiles, it needs workarounds in browsers and is often blocked by VPNs and corporate firewalls. This does not mean it can’t be used, we just need a fallback to a trusted server acting as a proxy. But this comes at the same price of decentralization. Besides, if the user wants to send data to someone else, both devices need to be online simultaneously.
Reliable P2P networking remains unsolved.
Is there an answer to those perpetual problems of P2P? We believe there is. In crypto world the answer was found with the notion of miners. This is why some P2P projects are attempting to repeat this approach introducing their own blockchains. IPFS team’s Filecoin, Storj, Theta.tv and a number of others are examples. But they are all focused on data storage.
Hypercore is so much more. It is a foundation for apps, it is made for storage, content distribution, messaging, decentralized databases, etc. And it feels like it is a good match for analytics and AI as well (more on that later).
Perhaps the answer to perpetual P2P reliability problems is not in copying the blockchain’s mining model or just offering the crypto-incentives to host files. Maybe the answer is orthogonal, instead of looking to incentivize third-parties to keep our files, we could do it ourselves, with an always-available cloud peer, a companion to the sometimes-available personal devices we own already.
Viewed this way, cloud peer is not a hosting provider, it is just a different type of a personal device. It does does not have a screen, but it is capable in a different way, it complements our other personal devices with its 100% availability, a durable storage and elastic / expandable compute and data store.
Taking it a step further, the cloud peer could be a place to run many Hypercore apps that can’t run on personal devices. Think cloud app store, unencumbered by the domineering Apple and Google app stores.
This will make Hypercore shine!
Both classes of P2P systems, blockchains and P2P data are nudging towards mass market adoption. Cryptocurrencies made huge strides in creating a new foundation for global financial system, especially evidenced by the rise of DeFi in 2020.
In addition, many new projects are employing tokens as incentives mechanisms to avoid points of centralization that exist today. Examples are VPN (Orchid), Routing (PKT + CJDNS), social media (Steam), live streaming (Theta.tv), Web Browsing (Brave), Storage (StorJ, IPFS FileCoin). Several high-profile projects were shut down (Telegram TON) or have hit high resistance from governments (Facebook Libra).
Crypto-currency P2P projects are a huge step ahead of data-centric P2P projects like Hypercore and IPFS, as they have found their native hosting model, in the form of miners. This makes them independent of the Cloud providers, which is essential for their survivability in the face of regulatory scrutiny.
P2P Data projects do not experience such a resistance, but they have not invented their own sustainable infrastructure. IPFS is looking to Filecoin to incentivize storage providers, while meanwhile subsidizing hosting via ipfs.io.
Hypercore community has produced Hashbase, Homebase, DatDot and other hosting solutions, but they have not reached significant maturity and adoption. Personal cloud peer could be an alternative to the aggregated hosting model.
Almost every P2P project is still largely held back by numerous overlapping infrastructure needs for this novel tech to hit a wider market. E.g. one common problem is the management of the ownership keys.
The difference between Hypercore and IPFS is that it has moved relatively slowly with its marketing. The core team chose to patiently and somewhat stealthily build the foundational technology and avoid starting the “hype cycle”, that is until it is ready for prime time. Initial releases of Hypercore (then called “Dat”) in 2016-2018 had scaling issues, which have been addressed by subsequent releases.
Specifically, Hypercore team focused on the performance of its unique “streaming database” design (see below). Team prepares for marketing push at the end of this year (2020), starting with the Beaker ecosystem (Beaker is the Web and P2P Browser and authoring platform built on Hypercore)
Each project building on Hypercore is stretching Hypercore’s flexibility and contributes back solutions that are not yet available in the core. Then Hypercore team generalizes them and makes available for everyone. See some of the projects and their notable contributions:
Hypercore goes into great length to provide data integrity. For that it uses a Merkle tree hashing into it each block that is added to the append-only log. On every change the root of the Merkle tree is signed by the private key of the of this Hypercore (note that this also creates a limitation of a single writer, see later how it is overcome). When Hypercore is shared to another peer, with the help of Merkle branches it is possible to to prove authenticity and integrity of a subset of blocks, without sharing the whole Hypercore. This allows to accept partial data from the untrusted peers (as they can’t fudge the data). This capability supports a number of potential use cases, like distributed caching and CDNs, bandwidth sharing, distributed files systems, streaming databases, audit trails and supervision protocols, etc. See an interesting discussion in which Hypercore integrity guarantees were challenged and defended.
Append-only log also allows to recover the state of Hypercore at any prior a point-in-time, a highly desirable function in databases. It allows to preserve Hypercore backup snapshots at a particular point in time.
In addition, Hypercore supports versioning of data elements, a capability highly sought after in enterprise systems. Versioning allows protect data from accidental overwrite by a human being or a broken or malicious program. It also provides auditability and regulatory compliance.
An actor could decide to revert Hypercore to a previous state, and share this fork. This could also be used in the attack where attacker aims for the initial data gets to either get deleted or destroyed by backups. Another possibility is for the author to rewind and serve different version of the history to different peers. See a community discussion on this subject.
The required protection can be achieved by sealing Hypercore’s root on public blockchain, utilizing its immutability and secure timestamping properties. Hypercore also does not guarantee long-term write-once storage. See explanation how audit trails benefit from such services added on top.
As Hypercore signing key is rotated with multi-key, we need a proof that the new key is a valid successor from the old one. Different applications might use different algorithms for such a transition, and recipients of hypercore need a way to verify the code for this algorithm was not altered and was executed properly, but without running the code themselves. Smart contracts is one way of doing this, and Zero Knowledge provable computation is an emerging new option.
Why can’t recipients run the code themselves, like they do when verifying Merkle tree and signature in Hypercore today? Because the key rotation algorithm may involve processes that recipient can’t repeat, like contacting a 3rd party for key recovery, or not having access to some private data that was used by the algorithm’s code, but can’t be shared. For reference on Zero Knowledge see this question on StackExchange.
Need help with this.
No. But a community solution and other open source projects exist that can possibly be adapted.
Note that corestore makes this easier as it introduces Master key (and generates deterministically the keypairs for Hypercores it manages). It is much easier to manage one key than many, one per Hypercore.
Key recovery is essential need for any P2P applications, and the same need for Bitcoin, as the user may only rely only on themselves for key management.
Community solution: secret into N parts and allows restore with M of N replicas
Full blown framework for this exists, called Dark Crystal
A number of implementations of Shamir secret sharing in JS exist
For reference, see how open source app Consento does it.
Yes, for ephemeral session encryption keys. No, for Hypercore log, but can be added on top with the help of Hypercore-multi-key module which allows to switch to a new keypair. It is your responsibility to sign the new key with the old to establish the secure continuity, and to verify this signature on receiving nodes to prove the legality of key rotation. Perhaps this can be added as a hypercore extension?
Yes. Each Hypercore feed has a corresponding public / private key pair.
Yes, but it has limitations:
Level of granularity is a hypercore. For example, if you like to give file access to one person and not to the other, you need to put this file in a separate hypercore to be able to achieve that. This works for small number of files, and is an approach used by Hypermerge for Pushpin. Note that you start hitting limits on performance with too many hypercores file handles used and limits on replication of many hypercores between nodes.
Flexibility. Access control is based on revealing the Public key for your hypercore to a peer who needs access. Since Public key is baked directly into the hypercore data structure, once it is revealed, there is no taking back the access.
See a community discussion on this subject with one idea being to use per-file encryption. You can replicate all of the hypercore but have separate keys for individual records or files. This fits the project management apps, small team collaboration with light-weight documents, but is not suitable for large file sizes.
Need help on this.
Yes. Managed by Corestore, or by the community-provided multifeed.
Need help with this.
URL looks like this hyper://<public-key>[+<version>]/[<path>][?<query>][#<fragment>]
where public-key
is the address of the hypercore feed, version
is an optional numeric identifier of a specific revision of the feed (also called index or seq, a block number in the append only log), and path
query
fragment
are akin to HTTP URLs (though query
has no defined interpretation). Formal schema is defined by this specification. Beaker browser is the primary way such URLs are used.
There is a proposal for Strong linking which would add a hash
to URL. This is a hash of the hypercore at a specified version
. This would lock down the history of the hypercore at that version
.
To understand why this link is strong, you need to know that every time a new block is added to hypercore feed, a new root hash corresponding to all the data in hypercore, up to current position, is calculated and saved in hypercore (this hash is also referred to as the Merkle tree root hash or just tree-hash). Now, should the author, by mistake or intentionally, rewind their hypercore to version - 1
or earlier, and fill it with different data, the hash of the hypercore at version
will be different and it will now not match the hash given in the URL.
Note that the hypercore can and will continue to be appended to after the point referred to in the URL. This is normal and fine, and URL will continue to work, referring to a particular point-in-time in data history. But it does not mean that the strong URL creates an obligation for the author not to rewind the hypercore. It only means that such change can be discovered by whoever has received strong URL.
Applications can use this module to construct and verify strong links.
Use cases for strong links include listing files in a module’s manifest, cross-linking between JSON attributes in Hyperbee object and a file in Hyperdrive (e.g. for a file attached to an email).
You can create conflicting forks of a hypercore log by first copying the hypercore feed directory to another machine along with its private key and then writing into hypercore copy while making updates in the original. There is an idea how to address this with self-healing hypercores.
If you rewind the feed that is replicated, replicas will stop syncing.
Yes. see Hypercore archiver as a starting point, but more work is needed:
Backup to s3 is not supported yet. This underlying module’s does not have the write method implemented yet. This is work in progress, tracked by this issue.
Hypercore requires the underlying transport to provide the following guarantees:
Those are satisfied by TCP, WebSockets, WebRTC, QUIC and uTP
Hypercore itself adds:
1) Yes for Hypercore and 2) no for Hyperswarm.
Hyperswarm uses uTP over UDP to connect to DHT nodes and for NAT traversal (hole-punching). It can use Noise protocol, but doesn’t today as it will cause extra round trips (RTT). Hyperswarm also does not authenticate peers. Note that TLS 1.3 achieves 0-RTT on re-connections, and QUIC achieves 0-RTT on the first connection. See analysis and mitigation of the replay-attacks for these 0-RTT protocols by CloudFlare. Cloudflare open sourced QUIC / HTTP/3 implementation in Rust (so it may be able to run in WebAssembly in Node and browser). See alternative 0-RTT in Noise, which removes dependency on SSL certificates.
Hypercore uses Noise protocol for authentication and encryption. Noise is the protocol designed as part of Signal Messenger and is now used by WhatsApp, WireGuard, Lightning, I2P, etc.
A new channel is open for each Hypercore and multiple channels use the same connection, which is great. What is cool is that with the help of Noise, each channel gets its own encryption and keys are rotated to achieve forward secrecy (attacker who cracked this session’s key will have to crack it again for the next session).
Note, as always with end-to-end encryption, you need to watch out for the cases when you introduce a proxy in the middle, for example to deal with overly restrictive firewalls. The best approach is for the Proxy to be blind, just passing encrypted streams between peers.
Normally updates are pulled by the peers. Protocol supports a Push-ing data as well but it is not exposed in the API today
Hypercore is not like Kafka, which is one big log. With Hypercore you usually have many Hypercore logs. So you need a way to manage them and discover what hypercores other people have shared with you.
The bootstrapping mechanism for this is to find peers, a Hyperswarm. But it is not enough, thus several discovery systems were designed, and the main one is corestore. Simpler one, is multifeed created by community, but it assumes all feeds are public.
Yes. Hypercore is transport-independent. One can use TCP/IP, WebRTC to peers, WebSockets to server.
Not directly, but community solutions exist. See the issue for this.
It is a hard problem, note that WebTorrent works in the browser, but “DHT in browser”, even after 7 years of discussions, is still not realized.
Current solution, advised by Hypercore team, uses 2 servers for signaling.
Summary of a problem and an alternative solution:
No UDP in browsers. Other transport protocols create connection establishment delays (extra round-trips), which make DHT too slow to be practical. Delegation to signaling servers challenges privacy.
Corporate firewalls may block UDP. Although the hope arises with QUIC / HTTP/3 gaining traction as it is using UDP on TLS port 443.
No peer discovery on Cell Phone networks. Cellphone networks employ symmetric firewalls that block direct P2P connections (although UDP works, NAT hole punching does not). This affects mobile apps and PCs on HotSpots. With 5G proliferation, more applications operate on cell networks, making progress for direct P2P connections unlikely.
DHT state needs stability. Peers that come and go (browser tabs) lose DHT state and need to recreate it (although this can be overcome with caching state in browser’s database). Peers that change their IP address too often, destabilize DHT. This is the case of cell networks.
Porting Web and mobile environments. See a number of issues still pending resolution to make Hyperswarm and Hypercore work in react-native. These problems can be solved.
Yes. The protocol is formalized with protobuf and supports defining extensions.
See community video that explains the Extensions system. Community projects like Cobox and others are using it already.
Possibly useful are abstract-extension and hypercore-extension-rpc.
Yes, offered by community solutions. You will need explore their limitations. See some below:
hypercore-encrypted, a wrapper around hypercore.
Need help with this.
Hypercore uses Noise-protocol which implements the Noise_*_25519_ChaChaPoly_BLAKE2b handshake, meaning Curve25519 for DH, ChaCha20Poly1305 for AEAD and BLAKE2b for hashing.
Handshake and transport encryption are documented by @Frando as part of his implementation of Hypercore in Rust.
Yes, offered by community solutions. the above terms refer to an encrypted replica kept by a friend or a services provider, like SpiderOak, but can’t be read by them.
Current solutions are provided by the community:
No. Erasure coding is used to recover data from a subset of overall amount of replicas, and is especially important in hosting across many providers, where none of them has a full version of data, so even if the encryption is cracked, they are not able to read the data.
Open Source S3-compatible object storage, e.g. provided by Min.io has erasure coding. Cloud providers sometimes offer a virtualized file system over multiple replicas. Open Source Ceph offers it and so does AWS with EFS (note that Ceph is not easy to manage).
Somewhat - you can clear() your content locally, but if someone replicated it already, you can’t force them to clear. Also, internal data integrity records are still kept, but they do not leak any data (Merkle tree hashes are kept, so you can keep appending data to your log even if you clear the contents). Use cases:
Use cases for embedded replicated streaming DB are plentiful.
A database that is automatically syncing between all personal devices, but without the help of Apple, Google or any other central provider. For Cloud this could be a serverless personal-use replacement for AWS DynamoDB (Azure Cosmos, etc.), while providing complete isolation of data in a multi-tenant execution environment. Some use cases:
Note what Bitfinex is doing with trading data and signals (see above), and extrapolate it to other types of structured data.
Hyperbee could very well be a killer app for accessing blockchains.
One of the persistent problems with the blockchains is that mobile and web applications have to rely on full-node servers as trusted gateways, a contradiction to blockchains’ trustless value proposition. SPV wallets were designed to solve this problem, but they are (https://www.reddit.com/r/ethereum/comments/avk7ew/is_spv_of_eth_value_transfers_possible/ehg5wud/), are not so lightweight and are anemic as they can’t answer all the questions client apps have.
This led to the emergence of services like Infura. Hyperbee could be more lightweight and more flexible than SPV. SPV protocol is usually confided to specific proofs, e.g. that a transaction was included in the blockchain. But it can’t answer queries like show me ‘all transactions involving a specific blockchain address’. Hyperbee can run arbitrary queries against the blockchain node (providing indexes were added). To avoid trusting one Hyperbee, imagine a number of independent Hyperbee providers that all return chunks of data for the same query, a core capability of Hyperbee. With this We have restored a trustless access to blockchains.
Hyperbee uses Hypercore as an underlying storage and a replication mechanism. The cool thing is that one replication stream can carry many Hypercores, which can carry Hyperbees, Hypertries, and Hyperdrives.
To manage multiple hypercore feeds, with permissions, use the corestore.
Hyperbee, like any other Hypercore-based data structure is single-writer. That means when it is replicated, it is replicated as-is, and eventually reaches the same state. See multi-hyperbee that builds on top of hyperbee and offers consistency in a multi-master scenario (each node / peer making changes to objects in the same multi-hyperbee, even the simultaneous changes to the same object).
Yep. Needs to be wrapped into Hyperbeedown and fed into LevelUP.
This is awesome as there are many databases that work on top of the LevelUP API exposed by LevelDB. One example is AWS DynamoDB emulation on top of LevelDB. See its replacement with Hyperbee. Some tests are still failing, but it is getting there.
Hyperbee is still in Alpha, but perhaps we can stress-test it on loading the whole of the Ethereum blockchain and indexing it in different ways. This Hyperbee could provide a valuable service to the community. We could even put its snapshots in S3, or IPFS for that matter, and let it be streamed. Note that Google BigTable provides this service.
There is also a number of benchmarks for LevelDB (e.g. here) that community can help running with Hyperbee, since Hyperbee is LevelUP compatible.
We need your help!!
Hyperswarm is a key element of Hypercore system that allows to discover network addresses of the peers by topic names using DHT.
Hyperswarm also allows peer’s network address discovery on local network (LAN) via mDNS broadcasts. nDNS is a protocol used by Apple Bonjour for AirDrop and is standardized by RFC 6762.
Ideas that fit Hyperswarm’s mission to help discover peers and connect to them without using any servers:
To be precise, DNS system has another function which Hyperswarm does not replace. DNS is providing a friendly recognizable name for the IP address. We register today this so called domain name via some domain registrar, which is a commercial entity that is working with the root registrar (.com, .io) to rent domain names. This part is very hard to decentralize. DHT does not help here. Namecoin was the first to solve Zooko triangle puzzle, and Ethereum ENS smart contract is now well on the way to be adopted as a decentralized solution for this problem.
Avoid central signaling servers. For example, a video chat over WebRTC requires a STUN server but with Hyperswarm it is avoided, increasing privacy and avoiding dependency on service providers.
Connect to peers sitting behind a firewall, such as home routers, which otherwise can’t otherwise connect directly to each other. This can be used for video chats or any other P2P traffic (Hyperswarm’s huge value here is so called NAT hole punching, the algorithm is in DHT-RPC package). Keep in mind this does not work on mobiles (and behind some corporate firewalls), and requires a fallback to a relaying proxy (e.g. This post says 30% of P2P connections need TURN proxy).
Relaying proxy is a potential loss of privacy point. What if we could use a personal cloud peer, not a 3rd party service as such a proxy?
Server-less Contact Tracing on DHT. See this idea described in detail in this paper.
Hyperswarm is also a Publish Subscribe system, in a way. Need help on this.
Need help with this.
No. Once you write your first Hyperswarm and print all the new peers joining it, you will likely notice all kinds of peers that have nothing to do with you. Who are they? For a website and public media like Twitter or YouTube-style application, it is totally fine. But in a security-focused application you might get concerned.
Any hypercore-savvy person will argue that this is ok, as you will not be sharing any data. To access the Hypercore feeds you still need to know their publicKey. But the fact is, you still need to connect to all peers to figure out if you even want them. This is not efficient and can present some surveillance challenges.
This can be very useful:
To know if peers are readers or writers, or filter them out with some cryptography-based primitives, and avoid connecting to those that you do not trust.
Load balancing between peers. It would ridiculous for the Router design to expect the Router to connect to peers to determine who to forward request to.
Can Sybil attacks and DDOS on DHT, mentioned in Hyperswarm blog, be prevented if DHT itself could be selective about the peers?
No. Let’s explore what is revealed. Hyperswarm announces IP and Port of the peer to allow other peers in P2P network to connect with them. Hyperswarm’s DHT holds that data, so any observer could simply collect this information. The observer will also learn the topic this peer is advertizing. Aside from that no other information is leaked. Is it worse than DNS? In DNS servers also announce their name and address to the world. But clients do not, while in Hyperswarm they do. On the other hand, topic name is more private than in DNS, it is just some hash, not a human-readable name.
So what can be done to protect IP addresses in DHT?
Hyperswarm can be improved to encrypt data in DHT, and this way only the peers that know some shared secret could find each other.
Potentially I2P can be used in the future.
Yes. One potential approach is to have Hyperswarm peers sign data in DHT, and refuse to accept unsigned data. Other measures could include the approach used by Bitcoin, to prove that you have spent some CPU time (e.g. 3-5) when announcing a topic in DHT (crypto-puzzle). This is the area of active research.
A resilience to DDOS could be enhanced by creating a large network of provably legitimate DHT nodes.
Hyperdrive provides many of the hard to create components to replicate the functionality of Dropbox and Google Drive. Beaker Browser adds the UI to it.
Hyperdrive is a library and can also run as a service, that is accessible via an API and can show up as a normal directory on your disk (this part works on MacOS and Linux, with Windows in works).
Dropbox, Google Drive, etc. alternative without a central server. These systems are used by millions of teams and everyone’s privacy is compromised. In addition, Hyperdrive adds magic powers of media streaming and bandwidth sharing with peers (Hyperdrive is helped by a companion Hyperspace service (daemon), which runs like a Dropbox service in the background).
Distributed, replicated file system, an alternative to NFS, Samba/cifs or sshfs. Distributed file system is essential component of Cloud services, e.g. many serverless applications can’t be built without it. Hyperdrive could provide a better isolation of personal data in a multi-tenant Cloud environment.
A building block to create a real alternative to Object Storage (S3).
Distribution of software and large datasets to / from / between Data Centers, as described in older eBay paper, and any case of Big Data fan-out.
Underlying mechanism is built into Hypercore, and works for all data structures that use it: Hyperdrive, Hypertrie and Hyperbee. You can share the read-only version of your whole hypercore with others, by giving them the public key of the Hypercore.
Hyperdrive itself is actually 2 hypercores for directory structure and metadata, and for file content. So to share it you use the above URL (need confirmation for that).
Hyperdrive also support mounts. It allows to include other people’s drives under your your own Hyperdrive as a folder. Mounts are still read-only though, But this allows people to continue editing files on their Hyperdrives and all people that mounted it will see updates in real-time.
Hypertrie also supports mounts, which allows a Key-Value store supported by the whole team. Mountable Hypertrie is actually what Hyperdrive uses underneath for mounts.
There is no inherent size limits. As a demo Hypercore team put a complete Wikipedia mirror with tens of millions of files on Hyperdrive and it reads very fast.
The whole of Wikipedia was loaded into Hyperdrive and it provides a decent speed for finding articles. This also stressed Hypertrie, as Hyperdrive used Hypertries for managing file system directory structure and file metadata.
Need help with this: what other public datasets would be good to load into Hyperdrive? How about the whole of the Web?
Hyperdrive provides many key primitives (lego blocks) needed in distributed systems. But it lacks certain others that you will need to build yourself for a full P2P application, and to avoid frustration it is better to be aware of them upfront.
To reach the same state peers in distributed systems need to synchronize clocks in view of computer time drift and network disconnects from other peers. Typical solutions include generation / causal clocks, and newer hybrid clocks like HLC).
Any volunteers to help us build it?
Devices have different storage capacity (cloud vs mobile, storage durability (e.g. browser vs desktop app vs cloud), and different networks (fast, metered, capped, etc). CPU and RAM capacity might also be factors. Replication and storage management algorithms might take all above into account. For example, for sharing media from mobile, replication algorithm should only upload each block once, to the peers with a better connection.
As apps have a need to understand each other. Data modeling emerges as a necessity as automation needs arise, as AI needs to know what data it is trained on, as searching in database needs a guiding UI. If that does not happen, then data models get buried inside the apps. Data models become a top priority in systems that allow users to interact with the data directly. Hypercore leaves this area to what it calls a “userland”.
Full apps will need some form of identity management. Hypercore provides the basic elements, a keypair per each core (and in corestore master key for corestore and generated keys per core), but identity of a peer is much more than the identity of the core.
Hypercore is engineered as a set of small single-purpose primitives (lego blocks) to be highly composable. This is a methodology used by Linux community and it allows to have simple mental model about the building blocks, and to create purpose-made systems. This is opposite to systems that attempt to serve many use cases upfront and become over time very hard to manage and secure. The example is OpenVPN which recently is being replaced with Wireguard and a family of single-purpose modules that are built for it. In Hypercore this approach is especially evident in the case of multi-writer.
Hypercore, and data structures on top (Hypertrie, Hyperbee and Hyperdrive) are single-writer primitives. This means only one private key can have access to write into each.
But when hypercore is replicated to personal devices (phone, tablet, PC, cloud peers) each device needs to have its own private key, which means now multiple writers need to write into hypercore data structures. Same need arises when you want to collaborate with peers, with shared documents, files and databases, as you want peers to edit same objects and search across them.
To support such use cases multi-writer modules can be composed on top.
*Note that supporting multi-writer in core modules has been requested many times but it turns out one size does not fit all. HyperDB is an abandoned multi-writer database that became too complex as it tried to provide discovery, networking, authorization, conflict resolution, etc. in one package, serving many masters and satisfying none.**
So simple compositions, that are themselves composable is a better approach, see below:
Allows several nodes share and write to the same drive. This is useful for multi-device support or in a team. It supports not just one, but a set of shared drives. At the moment it provides a simple last-write-wins (LWW) conflict resolution. It scales well on writes, sames as the hyperdrive and adds a fairly small performance penalty on reads (which grows O(n) with the number of drives). Drives are added to the shared set via an API at start. Multi-hyperdrive is network-agnostic. No authorization mechanism for individual files is provided.
Note the difference with hyperdrive mounts. Mounts allow read-only access to peer’s drives, while multi-hyperdrive allows both sides to write. With Mounts, the path to a files changes, with a mount point added to it, e.g. drive with path /parlor
mounted at /fred
will require need to be accessed via /fred/parlor
. Multi-hyperdrive will keep the path the same, which is more natural, but it has a downside. If you are sharing between your own devices, this is perfect. But if you are sharing in a team, directory path fred
may already be used by someone.
Builds on top of multi-hyperdrive and allows to add / remove shared drives (writeable bi-directionally) to the set.
A single replicated hyperbee, not a set, no discovery, networking agnostic, no authorization. Provides convergence to the same state with automatic conflict resolution (CRDT), effectively creating a leaderless multi-master. Scales well on reads (same as hyperbee). Simple, one replicated hyperbee, not a set, no discovery. Network-agnostic, no authorization mechanism.
A union of Hyperbees can be easily constructed utilizing another lego block, a streaming sort-merge.
Need help with this:
Cobox community has created a number of compositions:
Normally a single person will not be using 2 devices simultaneously. Yet because of the loss of connectivity changes made on each device may need to be merged. This includes documents, filesystems and databases. It becomes much more difficult in multi-user scenarios.
In distributed systems, of which P2P is a subclass, reaching the same state is a hard problem with a long history. The reason it is hard was only recently formally described as a CAP theorem. The holy grail of distributed systems is to reach the ACID guarantees of SQL databases - Atomicity, Consistency, Isolation and Durability. But SQL databases mostly operated on a single machine or on a closely managed cluster. Over the Internet the connectivity can be spotty and malicious actors abound.
Handling bad actors became a specialty of blockchains, and it was a huge win for the P2P movement. Yet, since blockchains serve as shared databases for the whole world, they come with limitations. They have high transaction costs, low throughput, can store only the miniscule amounts of data, and can’t hold or process private data. To overcome these limitations some applications re-centralize, adding web servers, application servers and DB servers. Others try to remain pure P2P by using IPFS or Hypercore.
Algorithms, that tackle bad connections and reliability issues with compute and storage, but not bad actors have evolved from the highly complex Paxos to a simpler RAFT, to PBFT, and finally, in the last 5-7 years, to CRDT. CRDT is very lightweight and allows to operate leaderless multi-master, allowing each master to merging edits on the edge without any central coordination. This means no operators to run central service (Zookeeper, etcd, etc.) and handle complex cluster failure modes. CRDT, combined with HLC clocks, increases throughput with wait-free transaction ordering by avoiding any coordination between masters.
Note that CRDT is quietly being used by AWS DynamoDB and Azure Cosmos - and if it is good enough for those web-scale databases, it is good enough for P2P.
Here is a great introductory talk on CRDT (here is another one) and an advanced one.
For NodeJS the prime candidate is Automerge, but there are others like YJS and Delta-CRDT (please share if you know a better one). CRDT is implemented and used by OrbitDB that runs on top of IPFS.
CRDT matches perfectly multi-device and collaborative editing use cases of P2P:
CRDT provides new data types with magic properties that allow automatic merging of independent edits. Anyone who sent Word documents by email to their teammates or lawyers knows the “joys” of redlining. Any developer knows the chores of merging conflicts that Git could not auto-merge. Good news for the humankind, no more conflicts with CRDT. But it also means that we need to adapt our databases to keep history of changes, sequence them properly, use stable IDs of our peers and coordinate clocks. Hypercore multi-writer modules will incorporate CRDTs for this by the end of 2020.
P2P needs a mechanism to match Google Docs (and Slides, Sheets, Diagrams, etc.) that allow multiple people edit the same document simultaneously. Google Docs uses an older Operational Transforms algorithm that is highly complex and allows only 2 concurrent edits (which Google overcomes by having a central server quietly merging in the background). A special branch of CRDT for sequences (LSEQ is one of them) was developed recently. Hypercore multi-writer modules will incorporate CRDTs for this by Q1 2021.
Personal cloud “device” is always on. This resolves a common P2P issue when you edited a document, closes your laptop or an app on the phone. Cloud peer can make your changes available for others. But consensus still need to be reached and without a Google in the middle.
Note that CRDT resolution works smoother when clocks between machines are well synchronized. NTP existed for years, and now there is a new iteration NTS, published by Cloudflare. Normal clocks are not enough though. Need causal clocks too. See more on that later.
We know it is single-writer. But can the same writer accidentally screw up the Hypercore while being executed from a second processes on the same machine, like from a Nodejs cluster process or a Nodejs worker thread? If so, it will present a significant design challenge in Serverless environment.
For reference, note that LevelDB is not multi-process safe, but LMDB is.
Need help with this.
Module | Author | Description |
---|---|---|
Hypercore-peer-auth | Franz Heinzmann (@Frando) | Verifies that remote hypercore is an original author (is in the possession its secretKey). Also tells you which remote hypercore has just connected to you. This comes useful when you join a hyperswarm topic and get connections from many peers. Hyperswarm itself does not tell you the identity of the peer. Note, that since each hypercore has its own identity (keypair), you can designate one of the hypercores of the peer to represent the identity of the peer. Hypercore-peer-auth module implements an extension to hypercore protocol, and is an example of how you can create your own extensions. Note that there are channel extensions and stream extensions. Channel is encrypted, so its extensions are secure, not so with stream extensions. |
Multi-key | Mathias Buus (@mafintosh) | Allows to rotate a keypair for hypercore |
Streaming sort-merge | Mathias Buus (@mafintosh) | This works great with several hyperbees. It is composable, so you can use more than 2 hyperbees |
Visit the Hypercore Protocol site.
In the summer of 2020 there was a Dat Conference (Dat was renamed to Hypercore this year). You can see the breadth of discussions that took place, both on tech and the opportunities.
Workshop at the 2020 summer Hypercore / Dat Conference with sources and video.
Hyperbee ‘P2P indexing and search’ workshop at the 2020 fall Nodeconf conference
Kappa workshop is a great basic intro, we forked it to update to new materials and shift focus to core Hypercore modules.
Read old FAQ (before project was renamed from Dat to Hypercore).