aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorpukkamustard <pukkamustard@posteo.net>2020-11-23 08:42:52 +0100
committerpukkamustard <pukkamustard@posteo.net>2020-11-23 08:42:52 +0100
commitbcfaec61621479974f0aa8f76d86f610968a4e3b (patch)
treeae1ff9bc7cfedf5874ca34292cf38784a7c95a32 /doc
parent08babc22fb54ab20c38d88394f7cf5b6e2bd3abd (diff)
eris.adoc: minor fixes
Diffstat (limited to 'doc')
-rw-r--r--doc/eris.adoc87
1 files changed, 47 insertions, 40 deletions
diff --git a/doc/eris.adoc b/doc/eris.adoc
index 6feb7cd..b49e0e4 100644
--- a/doc/eris.adoc
+++ b/doc/eris.adoc
@@ -7,7 +7,7 @@ pukkamustard <pukkamustard@posteo.net>
:sectanchors:
[abstract]
-The Encoding for Robust Immutable Storage (ERIS) is an encoding of arbitrary content into a set of uniformly sized, encrypted and content-addressed blocks as well as a short identifier - the _read capability_. The content can be reassembled from the encrypted blocks only with the read capability. The encoding is defined independent of any storage or transport layer. Together with content-addressable storage, ERIS can be used as a building block for robust and decentralized applications.
+This document describes the Encoding for Robust Immutable Storage (ERIS). ERIS is an encoding of arbitrary content into a set of uniformly sized, encrypted and content-addressed blocks as well as a short identifier (a URN). The content can be reassembled from the encrypted blocks only with this identifier. The encoding is defined independent of any storage and transport layer or any specific application. We illustrate how ERIS can be used as a building block for robust and decentralized applications.
== Introduction
@@ -17,16 +17,14 @@ Availability can be increased by caching content on multiple peers. However most
An alternative to identifying content by its location is to identify content by its content itself. This is called content-addressing. The hash of some content is computed and used as an unique identifier for the content.
-Content-addressed content is much easier to cache as the content is completely decoupled from any physical location. It is much easier to ensure availability of content-addressed content than it is for location-addressed content.
-
-Authenticity of content is automatically ensured with content-addressing (when using a cryptographic hash) as the identifier of the content can be computed and be checked to match the requested identifier.
+Caching content-addressed content and making it available redundantly is much easier as the content is completely decoupled from any physical location. Authenticity of content is automatically ensured with content-addressing (when using a cryptographic hash) as the identifier of the content can be computed to check that the content matches the requested identifier.
However, naive content-addressing has certain drawbacks:
-- Large content is stored as a large blob. In order to optimize storage and network operations it is better to split up content into smaller uniformly sized blocks and reassemble blocks when needed.
+- Large content is stored as a large chunk of data. In order to optimize storage and network operations it is better to split up content into smaller uniformly sized blocks and reassemble blocks when needed.
- Confidentiality: Content is readable by all peers involved in transporting, caching and storing content.
-ERIS is an encoding that addresses these issues by splitting blocks into small uniformly sized blocks and encrypting blocks.
+ERIS addresses these issues by splitting content into small uniformly sized and encrypted blocks.
=== Objectives
@@ -56,20 +54,20 @@ ERIS is inspired and based on the encoding used in the file-sharing application
ERIS differs from ECRS in following points:
Cryptographic primitives :: ECRS itself does not specify any cryptographic primitives but the GNUNet implementation uses the SHA-512 hash and AES cipher. ERIS uses the Blake2b-256 cryptographic hash <<RFC7693>> and the ChaCha20 stream cipher <<RFC8439>>. This improves performance, storage efficiency (as hash references are smaller) and allows a convergence secret to be used (via Blake2b keyed hashing; see <<_convergence_secret>>).
-Block size :: ECRS uses a fixed block size of 32 KiB. This is inefficient when encoding small content. ERIS allows a block size of 1 KiB or 32 KiB, allowing efficient encoding of small and large content (see <<_block_size>>).
-URN :: ECRS does not specify an URN for referring to encoded content (this is specified as part of the GNUNet file-sharing application). ERIS specifies an URN for encoded content regardless of encoding application or storage and transport layer.
+Block size :: ECRS uses a fixed block size of 32 KiB. This can be very inefficient when encoding many small pieces of content. ERIS allows a block size of 1 KiB or 32 KiB, allowing efficient encoding of small and large content (see <<_block_size>>).
+URN :: ECRS does not specify an URN for referring to encoded content (this is specified as part of the GNUNet file-sharing application). ERIS specifies an URN for encoded content regardless of encoding application or storage and transport layer (see <<_urn>>).
Namespaces :: ECRS defines two mechanisms for grouping and discovering encoded content (SBlock and KBlock). ERIS does not specify any such mechanisms (see <<_namespaces>>).
Other related projects include Tahoe-LAFS and Freenet. The reader is referred to the ECRS paper <<ECRS>> for an in-depth explanation and comparison of related projects.
+ERIS is being developed in close collaboration with the https://datashards.net/[Datashards] initiative.
+
=== Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <<RFC2119>>.
We use binary prefixes for multiples of bytes, i.e: 1024 bytes is 1 kibibyte (KiB), 1024 kibibytes is 1 mebibyte (MiB) and 1024 mebibytes is 1 gigibytes (GiB).
-TODO a glossary of terms used.
-
== Specification of ERIS
=== Cryptographic Primitives
@@ -80,7 +78,7 @@ The cryptographic primitives used by ERIS are a cryptographic hash funciton, a s
Blake2b <<RFC7693>> with output size of 256 bit (32 byte). We use the keying feature and refer to the key used for keying Blake2b as the _hashing key_.
-Provides the functions `Blake2b-256(INPUT,HASHING-KEY)` for keyed hashing and `Blake2b-256(INPUT)` for unkeyed hashing.
+Provides the functions `Blake2b-256(INPUT, HASHING-KEY)` for keyed hashing and `Blake2b-256(INPUT)` for unkeyed hashing.
==== Symmetric Key Cipher
ChaCha20 (IETF variant) <<RFC8439>>. Provides `ChaCha20(INPUT, KEY)`, where `INPUT` is an arbirtarty length byte sequence and `KEY` is the 256 bit encryption key. The output is the encrypted byte sequence.
@@ -96,7 +94,7 @@ We use a byte padding scheme to ensure that input content size is a multiple of
`PAD(INPUT,BLOCK-SIZE)` :: For `INPUT` of size `n` adds a mandatory byte valued `0x80` (hexadecimal) to `INPUT` followed by `m < BLOCK-SIZE - 1` bytes valued `0x00` such that `n + m + 1` is a multiple of `BLOCK-SIZE`.
`UNPAD(INPUT,BLOCK-SIZE)` :: Starts reading bytes from the end of `INPUT` until a `0x80` is read and then returns bytes of `INPUT` before the `0x80`. Throws an error if a value other than `0x00` is read before reading `0x80` or if no `0x80` is read after reading `BLOCK-SIZE - 1` bytes from the end.
-This is the padding algorithm implemented in https://libsodium.gitbook.io/doc/padding[libsodium]footnote:[This padding algorithm is apparently also specified in ISO/IEC 7816-4. However, the speicifcation is not openly available. Fuck you ISO.].
+This is the padding algorithm implemented in https://libsodium.gitbook.io/doc/padding[libsodium]footnote:[This padding algorithm is apparently also specified in ISO/IEC 7816-4. However, the speicifcation is not openly available. So, fuck you ISO.].
=== Block Size
@@ -121,7 +119,7 @@ Using the hash of the content as key is called _convergent encryption_.
Because the hash of the content is deterministically computed from the content, the key will be the same when the same content is encoded twice. This results in de-duplication of content. Convergent encryption suffers from two known attacks: The Confirmation Of A File Attack and The Learn-The-Remaining-Information Attack <<Zooko2008>>. A defense against both attacks is to use a _convergence secret_. This results in different encoding of the same content with different convergence secret.
-If no convergence secret is specified a null convergence secret is used (32 bytes of zeroes).
+If no convergence secret is specified a null convergence secret MUST be used (32 bytes of zeroes).
The convergence secret is implemented as the keying feature of the Blake2 cryptographic hash <<RFC7693>>.
@@ -147,11 +145,11 @@ An encoding of a content that is split into eight blocks is depicted in <<figure
.Encoding of content as tree. Solid edges are concatenations of reference-key pairs as described in <<_collect_reference_key_pairs_in_nodes>>. Dotted edges are encryption and computation of reference-key pairs as described in <<_encrypt_block_and_compute_reference_key_pair>>.
image::eris-merkle-tree.svg[Merkle Tree,opts=inline]
-The block-size, the level of the root reference and the root reference-key pair itself are the necessary pieces of information required to decode content. The tuple consisting of block size, level, root reference and key is called the _read capability_.
+The block size, the level of the root reference and the root reference-key pair itself are the necessary pieces of information required to decode content. The tuple consisting of block size, level, root reference and key is called the _read capability_.
The encrypted blocks and the read capability are the outputs of the encoding process.
-A pseudo-code implementation of the encoding process:
+A pseudo-code implementation of the encoding process is provided in the following. Note that the pseudo-code implementation is naive and given for illustration purposes only. It is RECOMMENDED that imlementations use a streaming encoding process (as described in <<_streaming>>) which allows encoding of content larger than the available memory.
[source,pseudocode]
----
@@ -186,6 +184,7 @@ ERIS-Encode(CONTENT, CONVERGENCE-SECRET, BLOCK-SIZE):
The sub-process `Split-Content` and `Collect-RK-Pairs` are explained in the following sections.
+
==== Splitting Input Content into Blocks
Input content is split into blocks of size at most block size such that only the last content block may be smaller than block size.
@@ -317,7 +316,9 @@ Where the block-storage can be accessed as follows:
A streaming decoding procedure can be implemented where the content can be output block wise and does not need to be kept in memory for unpadding. For an example, see https://gitlab.com/openengiadina/eris/-/raw/main/eris/decode.scm[the reference Guile implementation].
-Random access is possible by only decoding selected sub-trees.
+==== Random Access
+
+A decoder that allows random access to the encoded content can be implemented by decoding selected sub-trees.
=== Binary Encoding of Read Capability
@@ -336,7 +337,7 @@ We specify an binary encoding of the read-capability 66 bytes:
The initial field (block size) also encodes the ERIS version. Future versions of ERIS MUST use different codes to encode block sizes.
-TODO using 1 byte to encode level limits size of content that can be encoded. Add a comment on that.
+Note that using a single byte to encode the level limits the size of content that can be encoded with ERIS. However, the size of the largest encodable content is approximately 1e300 TiB, which seems to be sufficient for any conceivable practical applications (including an index of all atoms in the universe).
=== URN
@@ -352,20 +353,6 @@ For example the ERIS URN of the UTF-8 encoded string "Hello world!" (with block
=== Namespaces
-== Implementations
-
-A list of known implementations that satisify the test vectors:
-
-|===
-| Name | Programming language | License | Notes | Homepage
-
-| `guile-eris` | Guile | GPL-3.0-or-later | Reference implementation | https://gitlab.com/openengiadina/eris/
-| `elixir-eris` | Elixir | GPL-3.0-or-later | | https://gitlab.com/openengiadina/elixir-eris/
-|===
-
-== Acknowledgments
-
-[appendix]
== Test Vectors
=== Machine Readable
@@ -412,7 +399,6 @@ Implementations MUST verify that the content encodes to the URN given the specif
=== Large content
-
In order to verify implementations that encode content by streaming (see <<_streaming>>) URNs of large contents that are generated in a specified way are provided:
|===
@@ -424,7 +410,21 @@ In order to verify implementations that encode content by streaming (see <<_stre
Content is the ChaCha20 stream using a null nonce and the key which is the Blake2b hash of the UTF-8 encoded test name (e.g. `KEY := Blake2b-256("100MiB (block size 1KiB)")`). The ChaCha20 stream can be computed by encoding a null byte sequence (e.g. `CHACHA20_STREAM := ChaCha20(NULL, KEY)`).
-[appendix]
+== Implementations
+
+A list of known implementations that satisify the test vectors:
+
+|===
+| Name | Programming language | License | Notes | Homepage
+
+| `guile-eris` | Guile | GPL-3.0-or-later | Reference implementation | https://gitlab.com/openengiadina/eris/
+| `elixir-eris` | Elixir | GPL-3.0-or-later | | https://gitlab.com/openengiadina/elixir-eris/
+|===
+
+== Acknowledgments
+
+
+:sectnums!:
== Changelog
[discrete]
@@ -438,21 +438,28 @@ Initial version.
Major update of encoding that removes the _verification capability_ - ability to verify integrity of content without reading content.
-[appendix]
== Copyright
This work is licensed under a http://creativecommons.org/licenses/by-sa/4.0/[Creative Commons Attribution-ShareAlike 4.0 International License].
-[bibliography]
== References
-- [[[content-addressable-rdf]]] openEngiadina. https://openengiadina.net/papers/content-addressable-rdf.html[Content-addressable RDF]. 2020
-- [[[rdf-signify]]] openEngiadina. https://openengiadina.net/papers/rdf-signify.html[RDF Signify]. 2020
-- [[[Polleres2020]]] Polleres, Kamdar, Fernández, Javier David, Tudorache & Musen. https://epub.wu.ac.at/6371/1/IPM_workingpaper_02_2018.pdf[A more decentralized vision for Linked Data]. 2020
-- [[[ECRS]]] Grothoff, Grothoff, Horozov, & Lindgren. https://grothoff.org/christian/ecrs.pdf[An encoding for censorship-resistant sharing]. 2003
+[bibliography]
+=== Normative References
+
- [[[RFC2119]]] S. Bradner. https://tools.ietf.org/html/rfc2119[Key words for use in RFCs to Indicate Requirement Levels]. 1997
- [[[RFC4648]]] S. Josefsson. https://tools.ietf.org/html/rfc4648[The Base16, Base32, and Base64 Data Encodings]. 2006
-- [[[RFC7049]]] C. Bormann & P. Hoffman. https://tools.ietf.org/html/rfc7049[Concise Binary Object Representation (CBOR)]. 2013
- [[[RFC7693]]] M-J. Saarinen & J-P. Aumasson. https://tools.ietf.org/html/rfc7693[The BLAKE2 Cryptographic Hash and Message Authentication Code (MAC)]. 2015
- [[[RFC8439]]] Nir & Langley. https://tools.ietf.org/html/rfc8439[ChaCha20 and Poly1305 for IETF Protocols]. 2018
+- [[[RFC8141]]] Saint-Andre, Filament & Klensin, https://tools.ietf.org/html/rfc8141[Uniform Resource Names (URNs)]. 2017
+
+[bibliography]
+=== Informative References
+
+- [[[Polleres2020]]] Polleres, Kamdar, Fernández, Javier David, Tudorache & Musen. https://epub.wu.ac.at/6371/1/IPM_workingpaper_02_2018.pdf[A more decentralized vision for Linked Data]. 2020
+- [[[ECRS]]] Grothoff, Grothoff, Horozov, & Lindgren. https://grothoff.org/christian/ecrs.pdf[An encoding for censorship-resistant sharing]. 2003
- [[[Zooko2008]]] Zooko Wilcox-O'Hearn. https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html[Drew Perttula and Attacks on Convergent Encryption]. 2008
+
+- [[[content-addressable-rdf]]] openEngiadina. https://openengiadina.net/papers/content-addressable-rdf.html[Content-addressable RDF]. 2020
+- [[[rdf-signify]]] openEngiadina. https://openengiadina.net/papers/rdf-signify.html[RDF Signify]. 2020
+- [[[RFC7049]]] C. Bormann & P. Hoffman. https://tools.ietf.org/html/rfc7049[Concise Binary Object Representation (CBOR)]. 2013