Jameson Lopp is an engineer at BitGo, creator of Statoshi.info and founder of Bitcoinsig.com. Lopp thought deeply about the claim that it is safe to remove the block size limit on Bitcoin and instead rely on the existing SPV approach.

Could SPV Support a Billion Bitcoin Users?

A new statement is immortalized in the bitcoin extension discussion.

We’ve heard that it’s safe to remove the block size limit because Bitcoin can easily scale to huge blocks and then support billions of bitcoin users through the simplified Payment Authentication (SPV) scheme that already exists. The assumption is that SPV is very scalable because it requires only SPV customers to store, send, and receive very little data.

Let’s take a closer look at the problem from a number of different angles.

How does the SPV work?

Even though SPV wasn’t implemented until Mike Hearn created BitcoinJ two years later, Satoshi had already described the high-level design of the SPV in his Bitcoin white paper.

The following is quoted from the Bitcoin white paper:

8. Simplify Payment Verification


It is possible to authenticate payment without traversing all nodes of the network. All the user needs to do is save the contents of the block header of the longest proof-of-work chain (by constantly querying the network node until they think they have obtained the longest chain) and the Merkle tree branch of the transaction associated with the time-stamped block. The user cannot detect the transaction himself, but by linking to a location in the chain, the user can see that the network node has accepted the transaction, and the block is added upon further confirmation that the network has accepted the transaction.






Therefore, as long as the honest nodes control the network, the authentication is reliable, but if the network is overwhelmed by an attacker, the authentication becomes very fragile. When network nodes can authenticate transactions on their own, the simplified approach may be fooled by the attacker’s fabricated exchanges as long as the attacker remains in control of the network. One strategy to protect against spoofing by attackers is to accept network node changes when an invalid block is detected, prompt the user’s software to download the full block and alter the transaction to confirm consistency. Businesses that frequently receive payments may still want to have their own node for more independent security and faster authentication.

The initial implementation of the SPV was very simple — downloading the entire blockchain was no more efficient in broadband terms than downloading the entire block (a block full of transactions).

You can save a lot of disk space by discarding transactions that are not relevant to the SPV customer’s wallet. BIP 37, which took 18 months to release, provides Bloom filtering specifications for transactions so that the Merkle root of the block header can be used to display transactions in the block, as described by Satoshi. Providing Bloom filtering specifications greatly reduces the use of loans.

When an SPV customer syncs with the Bitcoin network, he connects to one or more fully certified Bitcoin nodes, determines the latest block at the top of the chain, and then requests all block headers using the ‘getheaders’ command, starting with the last block he syncs to the top of the chain.

If the SPV client is only interested in specific transactions corresponding to the wallet, it builds a Bloom filter based on all addresses where the wallet has a private key, and then sends the ‘filterLoad’ command to the full node (one or more full nodes), which then sends the desired transaction to the client according to the filtering requirements.

After synchronizing the block headers and possibly loading the Bloom filter, SPV customers send a ‘getData’ command to request each missing block in order from the last time they were online (possibly filtered).

When the client is finished syncing, if it remains connected to the full node peer, it will only receive a list of ‘InVs’ that match Bloom filter transactions.

SPV client extension

From the customer’s perspective, Bloom filtering is the most efficient way to capture related transactions in the blockchain when using minimal CPU resources, bandwidth, and disk space.

Each bitcoin’s block header is only 80 bytes long, so only 38 megabytes of data have been written in the entire eight-plus year history of the blockchain. Every year (roughly 52,560 blocks), just 4.2 megabytes of data are added, regardless of the block size in the blockchain.

The Merkle tree used to display transactions in blocks also expands very well. Because each ‘layer’ added to the tree doubles the total number of ‘leaves’ the tree can represent. Even in a block with millions of transactions, you don’t need a deep tree to display transactions compactly.

The Merkle tree’s data structure is very efficient, with a tree of 24 deep representing 16 million transactions — enough to represent an 8GB block. However, the Merkle tree of this transaction proves that the size remains below 1KB.

From an SPV client’s point of view, this is very clear. The Bitcoin network can scale to gigabit blocks and there is no problem with the small amount of data that an SPV client needs to process — even on a 3G mobile phone.

But ah, expanding the Bitcoin network is not that easy!

SPV server expansion

While THE SPV is very efficient for the client, it is not so good for the server — that is, for the full node to which the SPV client sends requests, this approach shows very poor scalability for several reasons.

Nodes in the network have to process very large amounts of data to return the desired data, and repeat the process for each request sent by the peer in each block. Disk I/O quickly became a bottleneck.

Each SPV customer must synchronize the entire blockchain from the last time it contacted the network, or if it thinks it has missed some transactions, it must rescan the entire blockchain from the date the wallet was created. In the worst case, it can reach about 150GB when writing data. The full node must load each block from disk, filter and return results as required by the customer.

Since blockchain is an add-on ledger, the number keeps growing. Without a broad protocol change, blockchain clipping is not compatible with BIP 37 — it expects all blocks to be accessible to all nodes with widespread NODE_BLOOM messages.

The BIP 37 SPV client can be tricked by omission. To prevent this, SPV clients connect to multiple nodes (usually four), but this is not a guarantee — Sybil attacks can separate SPV clients from the main network. This increases the network load on all nodes by a factor of four.

For each SPV client that has been synchronized to the top of the blockchain, each incoming block and transaction must be individually filtered. This involves an unignorable amount of CPU time and must be done individually for each connected SPV client.

Guess the quantity

Approximately 8,300 full nodes receiving incoming connections are running at write time, 8,000 of which broadcast NODE_BLOOM messages and are therefore capable of responding to SPV client requests. However, how many SPV clients can the current number of listening full nodes reasonably support?

What does it take to make a network of full nodes capable of supporting trillions of daily users and blocks large enough to accommodate trillions of transactions?

The bitcoin kernel defaults to a maximum of 117 incoming connections, which can create a maximum of 936,000 usable sockets on the network. However, most of these sockets are consumed today.

Each full node is connected to eight other full nodes by default. Bitcoin kernel developer Luk-Jr’s node count is a (very rough) estimate of 100,000 total nodes at the time of writing. 92,000 of these nodes do not provide valid sockets for SPV clients. This consumed 800,000 valid sockets for the entire node, leaving only 136,000 valid sockets for the SPV client.

This leads me to conclude that approximately 85% of the available sockets are consumed by the full node grid in the network. (It is worth noting that Luk-Jr’s estimation method does not calculate how long non-listening nodes are online, although some of them must periodically disconnect and then reconnect).

fromstatoshi.infoThe average is 100 full node (8 outgoing connections, 92 incoming connections) peers and 25 SPV clients. This means that 80% of valid sockets are consumed by all nodes.

If we want even a billion SPV customers to be able to use this system, there must be enough full node resources available to serve those customers — network sockets, CUP cycles, disk I/O, and so on. Can we achieve this goal?

To believe that SPV extensions can do this, let’s assume a conservative assumption of 1 billion SPV users per user:

  • Accept and send one transaction per day
  • Sync your wallet to the top of the blockchain once a day
  • Query four nodes while synchronizing to reduce cheating by omission

With 1 billion transactions per day, if it were evenly distributed (which it won’t be) there would be 7 million transactions per block. Because Merkle trees are so scalable, only 23 hashes are needed to prove the content of transactions in such a block: 736 bytes of data plus an average of 500 bytes per transaction.

With an additional 12KB of block header data per day, SPV customers will still be using about 20KB of data per day.

However, a billion transactions a day generate about 500GB of blockchain data for the entire node to store and process. Each of the four SPV nodes has to read and filter 500GB of data when an SPV customer connects to check his wallet for the previous day’s transactions.

Keep in mind that there are currently approximately 136,000 valid sockets for SPV clients in a network of 8,000 SPV serving full nodes. If each SPV client uses four sockets, then only 34,000 clients can be synchronized with the network at any one time. If more than 34,000 people are online at one time, other users will receive a connection error when they open their wallet and try to synchronize with the top of the blockchain.

Thus, on a network currently capable of synchronizing only 34,000 users at any one time, to support one billion SPV users synchronizing once a day, 29,400 groups of users a day must connect, synchronize, and disconnect: each user must be able to synchronize the previous day’s data within three seconds.

This poses a bit of a challenge, as it requires each full node to be able to continuously read and filter 167GB of data per second for each SPV customer. With 20 SPV clients on a full node, this translates to 3,333GB of data per second. I don’t know of any device that has that kind of throughput. But it should be possible to create a huge RAID 0 array of high-end solid-state disks, each capable of throughput of about 600MB per second.

You need about 5,555 of these devices to reach your target throughput. The disk, which costs $400 to write to, has about a terabyte of capacity – enough to hold blocks of data on the theoretical network for two days. So every two days you need a new disk array, which is roughly over $2.2 million — which means over $100 million to store a year’s worth of block data and achieve the required throughput.

Of course, we can tweak these numbers a little bit with assumptions. Can we assume a scenario where the cost of a node is reasonable?

Let’s try this: if we had 100,000 full nodes running on cheaper, high-capacity spinning disks, then we could somehow get all of those full nodes to receive SPV customers, and somehow change the software of the full node to support 1,000 connected SPV customers.

So we have 100 million sockets available for SPV customers and can support 25 million SPV customers online at the same time (4 sockets per customer). So each SPV customer has 2,160 seconds per day to synchronize data with the network. For a full node to meet this requirement, it needs to be able to maintain a read speed of 231MB/s per SPV client, which means 1,000 SPV clients connected at 231GB/s.

A 7,200 RPM hard drive has a read speed of about 220MB/s, so more than 1,000 RAID 0 devices can achieve the target read speed.

You can buy one at write time$400 for a 10TB deviceSo a $400,000 RAID array of these devices can hold 20 days’ worth of block data — that’s $7.2 million to hold a year’s worth of data and achieve the required read throughput — a relatively affordable price.



At least two of these devices are added almost every day!

It’s worth noting that no one would run a RAID 0 array with that many devices in their sober mind, because if one of them failed the entire disk would crash. Therefore, fault-tolerant RAID arrays are more expensive and perform less well. And it also looks incredibly optimistic — 100,000 organizations are willing to spend millions of dollars a year to run full nodes.

It is also important to note that these conservative estimates assume that SPV customers can coordinate in some way to evenly distribute their synchronization time throughout the day. In reality, there will be daily and weekly periodic peaks and active troughs — the network will need a reasonable higher capacity than the estimate to meet peak time requirements.

Otherwise, SPV customer synchronization will fail during peak usage.

Interestingly, it turns out that changing the number of socket words per node doesn’t affect the total load of any given full node — it still needs to process the same amount of data. What really matters in this equation is the ratio of full nodes to SPV customers, and of course the size of the block in the chain that full nodes need to process.

The end result seems inevitable: the cost of running a full node SPV capable of supporting a billion on-chain traders a day is enormous.

Find a middle point

From this point of view, it is clear that a billion transactions a day make running a fully authenticated node unaffordable for all but the wealthiest.

But what if we skipped these calculations and instead tried to find a formula that could determine the cost of increasing network load by increasing the throughput of on-chain transactions?

In order for the Bitcoin network to support targeted transactions per second (an increase of 86,400 new daily user capacity), we can calculate the disk throughput requirements per node as follows:

This provides the minimum disk read throughput per second required by full node service SPV customers. Based on the network features that exist today and the technologies available, we can estimate the cost of node operation by using disk throughput as a hypothetical bottleneck. There are certainly other resource constraints that can increase the cost of running a full node.

In the following calculation, I used these assumptions:

  • Average transaction size = 500 bytes according to statoshi.info
  • Total SPV users = 1 per transaction per day
  • Sockets consumed by SPV clients = Standard 5
  • The number of sockets available to SPV customers by the full node = the previously calculated figure was 136,000
  • Hard disk throughput and space cost = $400 10TB, 7,200 PRM RAID 0 configuration hard disk

We can see that disk throughput is reasonable under normal conditions, but not when it exceeds 100 transactions per second. At this point you need to buy multiple disks and fragment them in a RAID array to achieve the desired performance.

Unfortunately, the demand for disk throughput and hence the total node running cost increases twice relative to the number of transactions per second. The cost quickly becomes unaffordable for most people.

For reference, keep in mind that Visa processes approximately 2,000 transactions per second. In Bitcoin it would take approximately $200,000 worth of disks to keep up with SPV demand. It is worth noting that these graphs keep the total number of nodes constant at 8,000 – in reality, they are likely to decrease as the cost increases, thus increasing throughput requirements and the cost of running the remaining nodes increases faster.

This seems to be the compound limit of the centralized nodes.

As I summarized in how to save Bitcoin from centralization in the node network, one of the fundamental arguments for increasing the block size is the cost of running the nodes. The above calculation gives us a basic idea of the complexity of compute node running costs because there are so many variables involved – the above calculation basically keeps variables constant and focuses only on disk I/O costs.

A poll I did a year ago (not very scientific) showed that 98% of node operators would not pay more than $100 a month to run a node, even though their investment in Bitcoin was very high. I’m willing to bet that increasing transactions on the Bitcoin chain by one order of magnitude will result in the loss of most full nodes, and increasing transactions by two orders of magnitude will result in the loss of 90% of nodes or more.

I believe it’s safer to assume that only a few people will go to the trouble of setting up a RAID array in order to run a full node. In this case, the claim that such an increase does not matter to the average user is untenable, as it would result in not even enough full-node disk throughput or sockets to service the SPV.

Other disadvantages of the SPV

SPV is great for end users who do not require security or privacy of fully authenticated nodes. However, there are many reasons why most SPV network nodes can be considered a hindrance, regardless of their scalability.

The SPV makes major assumptions that result in weaker security and privacy than a fully authenticated node.

  1. SPV users trust the miners to properly certify and enforce bitcoin rules, and they believe that a block chain with the greatest proof of cumulative work is also a valid chain. You can see the difference between an SPV and a full node security model from this article.
  2. SPV users assume that full nodes will not lie to them about omissions. A full node will not lie about a transaction that does not exist on a block, but it can lie about a transaction that does not exist on a block.
  3. Because SPV customers strive for efficiency, they only send requests to retrieve data for their own transactions. This leads to a huge loss of privacy.

Interestingly, Matt Corallo, co-author of BIP 37, regrets creating it

“The biggest privacy issue in the system right now is the BIP37 SPV Bloom filter. I’m sorry I wrote that.”

BIP 37 Bloom filtered SPV customers are essentially privacy-free, even at unreasonably high false-positive rates. Jonas Nick(security engineer at Blockstream) found that given a public key, he could identify 70% of the other addresses of the wallet.

You can get around the poor privacy of SPVS by splitting Bloom filters across multiple peers, although this makes SPVS less scalable than putting more load on full nodes.

BIP 37 is also vulnerable to trivial denial-of-service attacks. The code shown here can be obtained, and it cripples the entire node by sending a lot of quick inventory requests through specially constructed filters, causing constant disk searches and high CPU utilization.

Proof-of-concept attack author and lead developer Peter Todd explains:

The bottom line is that you can consume a disproportionate amount of disk I/O bandwidth with very little network bandwidth. “

Even to this day, the fraud warnings Satoshi described in his white paper have yet to materialize. In fact, this effort has shown that it is not even possible to implement a lightweight fraud alert.

For example, a fraud alert is only valid if you actually have the data that is required to prove fraud – if the miner doesn’t provide that data, the fraud alert won’t be created. SPV customers, for example, do not have the level of security Satoshi assumes.

From a high level perspective, a world with a majority of SPV nodes allows consensus to change such as coin cap totals or even make it easier to edit ledgers. Fewer fully authenticated nodes means more centralized implementation of consensus rules and therefore less resistance to changing consensus rules. Some see it as a feature, but many more see it as a defect.

Potential improvements

SPV security and scalability can potentially be improved in several ways, through fraud proof, fraud hints, input proof, expense proof, and so on. But as far as I know, these are still in the conceptual stage and not ready to start developing products.

Bloom filters promise privacy, but there is a tradeoff between the filter size and its false positive rate: Being too sketchy means that the peer will download too many false-positive blocks, while being too fine means that the filter is too large and impractical for anyone using an SPV client to download.

It reduces the load on the full node disk throughput, but the tradeoff becomes increased bandwidth between the SPV client and the full node, since the entire block is transferred over the network.

Recently proposed compact client-side filtering eliminates privacy concerns, but it requires downloading entire blocks (though not necessarily over p2p networks) if the full node of the filter is matched.

UTXO promises to allow SPV customers to synchronize their current SET of UTXOs, so wallet balances do not need to request full nodes to scan the entire blockchain. Instead, it provides proof of existing UTXOs.

It might be possible to prevent Bloom filters from Dos attacks by requiring SPV customers to either submit proof of work (lame battery-powered devices like phones) or channel-based micropayments (customers can’t lift themselves if they haven’t received the money yet), but neither offers a direct solution.

Disk read requirements for full nodes may be reduced by improving data indexing and batching requests from SPV customers.

Ryan X Charles points out in the comments below that using BIP70’s payment protocol to directly tell someone the UTXO payment ID you sent them removes their need to use the Bloom filter because they can request data directly from the full node. It’s very efficient if you’re willing to accept the privacy tradeoff.

As FAR as I’m concerned, there’s still a lot of room for improvement — there are a lot of hurdles to overcome in order to improve on-chain scalability.

Suitable extension solutions

If we ignore extension block size of many other miscellaneous problems, such as propagation delay, UTXO set extension, the initialization block synchronization time and security and privacy trade-off problem, as long as someone is willing to invest huge resources to develop improved basic equipment, software and operation demand, so technically it is possible to expand the currency to a day in the chain of hundreds of millions of users.

It seems unlikely that Bitcoin will evolve organically in this way, however, because there are so many other modest ways to extend the system. The most efficient form of extension has been used: federated centralized API providers. There tend to be huge trust and privacy trade-offs when using this approach, but many interactions involve contractual agreements, which also reduces some of the risk.

Scaling in unreliable ways, layer 2 protocols such as lightning provide more efficient scaling because a large amount of data is transferred between only a few parties directly involved in a given chain transaction. You can think of it as the difference between broadcasting to all Ethernet communication layers and routing IP layers – the Internet can’t scale without routing, and neither can currency networks.

Although this approach to scaling is technically more complex than traditional centralized scaling and has many unique challenges to overcome, the upfront investment in research and the development of these routing protocols will pay huge dividends in the long term because they reduce the load on the entire network by an order of magnitude.

There’s a lot to explore between the two:

  • Centralised regulatory schemes with complete privacy using tokens such as HashCash Chaum
  • Centralised non-regulatory proof systems such as TumbleBit
  • Federated (semi-trusted multi-signed) side chains
  • Miner safety (semi trust) drive chain

I still believe that in the long term, Bitcoin will need a bigger block.

Let’s be patient and strategically try to scale the system as efficiently as possible while maintaining security and privacy.

An auditable, lightly decentralized PayPal would certainly work if operated on the behalf of ordinary users, but it would not provide the financial sovereignty that Bitcoin now loves.

Thanks to Matt Corallo, Mark Erhardt and Peter Todd for their review and feedback on this article.

Disclosure: CoinDesk is a subsidiary of Digital Currency Group and owns Blockstream.