I've been spending some time on the #bittorrent channel on Freenode, to see if I could get any good information from other bittorrent client developers. After some poor initial contact, I did get some good suggestions, as well as some tidbits of information I wasn't aware of.

One of the tidbits was that most of the bittorrent client developers frowned on the use of selective downloading for torrents, to the point where some refuse to implement it in their clients (apparently the latest mainline client doesn't even have it, though I haven't checked due to license issues).

This concerned me, as the DebTorrent client I'm working on will rely heavily on selective downloading to only download packages to be installed by the user. It's not clear to me that it would be an issue though. I'm sure it will make it more difficult to find a rare package in a large swarm of peers, but that's to be expected, and is nicely solved by using an [[!HTTP download from a backup mirror|Jun-27-2006]]. This backup downloading could become extreme though, making the bittorrent-like download (peer-to-peer) mostly just an HTTP download (client-server). To see if it will be a problem, I conducted a simulation of different sized swarms to determine the amount of unnecessary HTTP downloading that will occur.

First, some assumptions to make the simulation easier:

  • peers join and download sequentially and one at a time
  • peers never leave
  • peers download all the packages for their system at once (i.e. all fresh installs, no upgrades)
  • peers are all interested in the same version and architecture of packages
  • there are N total peers, and each can make C connections to other peers

With these assumptions, the simulation becomes quite simple. I used the popcon data to assign packages appropriately to the N peers. The peers download their assigned packages from the C previous peers, if possible, or otherwise by HTTP. I ran this multiple times, varying N and C, and calculating each time how much was downloaded through the debtorrent protocol, and how much through HTTP.

debtorrent wasted.png

This graph shows the percentage of the total download that used HTTP from a mirror. The optimal line shows the minimum possible (i.e. all peers are connected), which corresponds to only one copy (the first) of every package being downloaded using HTTP.

This verifies my previous thinking, which is that fracturing the download population into multiple small swarms results in lots of inefficiencies, and lots of HTTP downloading. All efforts will need to be taken to keep the downloading populations together. (This is especially difficult for unstable, when a new swarm could be created twice a day due to archive updates.) Swarms of 1,000 peers or more seem to be sufficient to minimize the HTTP downloading.

debtorrent unnecessary.png

This graph shows the amount of unnecessary HTTP downloading that occurred, which is just the difference of each line from the optimal one in the previous plot. This shows the danger of selective downloading, as all of this unnecessary HTTP downloading occurs because the swarm is too big to find peers that have the desired package (they do exist). Fortunately, this unnecessary downloading seems to approach a maximum as the swarm size increases.

For the large swarm sizes needed to make the downloading efficient, and assuming a reasonable number of connections of 100 (bittorrent defaults to maintaining at least 40, and large swarms usually means a large number of connections), we can expect to be using HTTP for about 4% of the download. This number is surprisingly low due to the popularity distribution of packages (only rare ones are hard to find, but they aren't downloaded very often). I think 4% is manageable in our situation, given that there is a backup method to find the rare packages, though it does show the need to have that backup method available. This is probably more of a problem for regular bittorrent clients, in which there is no backup method.