Tutorial on How IPFS Works

IPFS is a peer-to-peer (p2p) file sharing system. Peers from all around the world can access content, relaying it, storing it, or doing both. IPFS can find what you’re looking for by using its content address rather than its physical location.

  1. Understanding IPFS is based on three essential principles
  2. Content addressing allows for unique identification
  3. Directed acyclic networks are used to link content (DAGs)

Distributed hash tables for content discovery (DHTs)

The IPFS ecosystem is enabled by these three concepts, which build on each other. Let’s begin with content addressing and unique content identification.

Content Addressing

IPFS employs content addressing to identify stuff based on what it contains rather than its physical location. You’re probably used to searching for items based on their content. When you go to the library to look for a book, you commonly ask for it by title; this is content addressing because you’re asking for what it is. If you wanted to find that book using location addressing, you’d say, “I want the book on the second level, first stack, third shelf from the bottom, four books from the left.” You’d be out of luck if that book was moved!

That issue is present on the internet as well as on your PC! Currently, content is found based on its location, such as:

  • https://en.wikipedia.org/wiki/Aardvark
  • /Users/Alice/Documents/term_paper.doc
  • C:\Users\Joe\My Documents\project_sprint_presentation.ppt

Every piece of material that uses the IPFS protocol, on the other hand, has a content identifier, or CID, which is a hash. The hash is unique to that content, even if it appears brief in compared to the original information. If you’re unfamiliar with hashes, start with our introduction to cryptographic hashing.

Many distributed systems use content addressing using hashes to not only identify information, but also to link it together – everything from your code’s commits to the blockchains that operate cryptocurrencies use this method. The underlying data structures of these systems, however, are not always compatible.

The Interplanetary Linked Data (IPLD) initiative (opens new window) can help with this. IPLD allows data to be unified across distributed systems by translating between hash-linked data structures. IPLD provides libraries that allow you to combine pluggable modules (parsers for each type of IPLD node) to resolve a path, selector, or query across several linked nodes, allowing you to explore data regardless of the underlying protocol.

IPLD allows you to convert between content-addressable data structures in the following way: “Oh, you use Git? Don’t worry, I’ll be able to follow those links. Oh, you use Ethereum? No problem, I’ll follow those links as well!”

IPFS adheres to specific data-structure principles and preferences. To get from raw content to an IPFS address that uniquely identifies content on the IPFS network, the IPFS protocol leverages those conventions and IPLD.

The following section looks at how a DAG data structure can be used to incorporate links between content within that content address.

Directed Acyclic Graphs (DAGs)

Directed acyclic graphs (opens new window), or DAGs, are a data format used by IPFS and many other distributed systems. They use Merkle DAGs, which have a unique identity for each node that is a hash of the node’s contents. Does this ring a bell? This is a reference to the CID notion from the preceding section. To put it another way, content addressing is the process of identifying a data object (such as a Merkle DAG node) based on the value of its hash. For a more in-depth discussion of Merkle DAGs, see our Merkle DAGs guide.

IPFS employs a Merkle DAG, which is ideal for representing directories and files, however a Merkle DAG can be structured in a variety of ways. Git, for example, employs a Merkle DAG to store multiple versions of your repository.

IPFS frequently separates your content into blocks before creating a Merkle DAG representation. By dividing it into blocks, different elements of the file can come from separate sources and be instantly authenticated. (If you’ve ever used BitTorrent, you’ll know that when you download a file, BitTorrent can grab it from numerous peers at the same time; this is the same concept.)

Merkle DAGs have a “turtles all the way down” (opens new window) situation, in which everything has a CID. Assume you have a file with a CID that identifies it. What if that file is included within a folder that contains multiple other files? CIDs will be assigned to such files as well. What about the CID of that folder? It’d be a hash of the CIDs from the files underneath (i.e. the folder’s content).

Those files, in turn, are made up of blocks, each of which has a CID. You can see how a DAG could be used to represent a file system on your PC. Hopefully, you can also see how Merkle DAG graphs begin to form. Take a look at the IPLD Explorer for a visual representation of this topic (opens new window).

Another advantage of Merkle DAGs and dividing material into blocks is that if you have two comparable files, they can share Merkle DAG portions, i.e., parts of distinct Merkle DAGs can refer to the same subset of data. When you update a website, for example, only modified files obtain new content addresses.

For everything else, your old and new versions can refer to the same blocks. This can make transmitting versions of enormous datasets (such as genomics research or meteorological data) more efficient because you just have to send the sections that have changed or are new, rather than having to create entirely new files each time.

To summarize, IPFS allows you to assign CIDs to content and link them together in a Merkle DAG. Let’s move on to the next point: how you locate and distribute material.

Distributed Hash Tables (DHTs)

IPFS employs a distributed hash table, or DHT, to discover which peers are hosting the content you’re looking for (discovery). A hash table is a database that stores the relationships between keys and values. In a distributed network, a distributed hash table is one in which the table is spread among all of the peers. You enlist the help of your peers to find content.

The IPFS ecosystem’s libp2p project (opens new window) supplies the DHT and manages peers connecting and communicating with one another. (It’s worth noting that, like IPLD, libp2p can be used for various distributed systems besides IPFS.)

You utilize the DHT to find the current location of those peers once you know where your material is (or, more precisely, which peers are storing each of the blocks that make up the content you’re after) (routing). Use libp2p to query the DHT twice to get to the content.

You’ve found your content and the current location(s) where it can be found. You must now connect to that content and obtain it (exchange). IPFS presently employs a module named Bitswap to request and send blocks to other peers (opens new window).

Bitswap allows you to connect to the peer or peers who have the content you’re looking for, submit them your wantlist (a list of all the blocks you want), and have them send you the blocks you requested. You can validate those blocks by hashing their content to obtain CIDs and comparing them to the CIDs you requested. You can also use these CIDs to deduplicate blocks if necessary.

Other content replication techniques are also being discussed (opens new window), with Graphsync being the most developed (opens new window). A proposal to enhance the Bitswap protocol (opens new window) to include functionality around requests and responses is currently being discussed.

SHA File Hashes Won’t Match Content IDs

You may be used to checking a file’s integrity by comparing SHA hashes, however SHA hashes will not match CIDs. Because IPFS divides files into blocks, each block has its own CID, as well as CIDs for any parent nodes.

Merkle DAGs are self-verified structures that keep track of all the content saved in IPFS as blocks, not files. See directed acyclic graph for additional information (DAG).

See Content Identifiers are not Hashes for a full illustration of what occurs when you try to compare SHA hashes with CIDs.

Libp2p

You may be used to checking a file’s integrity by comparing SHA hashes, however SHA hashes will not match CIDs. Because IPFS divides files into blocks, each block has its own CID, as well as CIDs for any parent nodes.

Merkle DAGs are self-verified structures that keep track of all the content saved in IPFS as blocks, not files. See directed acyclic graph for additional information (DAG).

See Content Identifiers are not Hashes for a full illustration of what occurs when you try to compare SHA hashes with CIDs.

A Modular Paradigm

The IPFS ecosystem is made up of various modular libraries that support specific components of any distributed system, as you may have noted from this discussion. You can utilize any section of the stack on its own or combine them in creative ways.

The IPFS ecosystem assigns CIDs to material and uses IPLD Merkle DAGs to link that stuff together. You can use a DHT provided by libp2p to find content, then open a connection to any source of that content and download it via a multiplexed connection. All of this is held together by the centre of the stack, which is made up of connected, unique identifiers; this is the foundation of IPFS.

Disclaimer: The opinion expressed here is not investment advice – it is provided for informational purposes only. It does not necessarily reflect the opinion of EGG Finance. Every investment and all trading involves risk, so you should always perform your own research prior to making decisions. We do not recommend investing money you cannot afford to lose.