avatarRobert Gowty

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

10462

Abstract

it:800/16XAYVfIn4OQjxWuYT9uVPA.png"><figcaption>A simplified flowchart of the <code>rsync algorithm.</code></figcaption></figure><p id="75f1">I’ll now cover each of the five steps in detail.</p><h2 id="b09d">Step 1: Dividing data into blocks</h2><p id="4180">First, a list of files is compiled. For each file, the raw data are divided into <i>N</i> blocks of data of <i>M</i> bytes each. If the file cannot be evenly divided into <i>N</i> blocks of <i>M</i> bytes, the final block can be less than <i>M</i> bytes.</p><h2 id="18ba">Step 2: Sending computer computes 2 individual checksums for each block.</h2><p id="0db7">In order to determine <i>which</i> part of the file needs updating, <i>checksums </i>are used. A checksum is a simple code or sequence of numbers generated by processing input data with a function. This could be as simple — and unrealistic — as creating a checksum for the word “dog,” or, more realistically, creating a checksum for an entire disk image of an OS¹. Checksums are typically used for data-integrity measures, such as determining if a file or message was corrupted in transit. <code>rsync</code> harnesses the power of checksums to determine if the files on the sending and receiving sides are different. Two checksums are performed for each block of data: a <a href="https://rsync.samba.org/tech_report/node3.html">32-bit “rolling” checksum</a> and a more involved 128-bit checksum.</p><p id="17ff"><b>The rolling checksum:</b> The rolling checksum is constructed using an optimized version of the algorithm developed by Mark Adler: <a href="https://en.wikipedia.org/wiki/Adler-32">adler-32</a>. This algorithm is fast and requires little computation power, owing to the recurrence relations used to compute subsequent checksums, as we will soon see. This algorithm receives its “rolling” moniker from this use of recurrence relations, which uses data previously calculated (and, consequently, is the reason for the quick computation speed). This checksum is computed in two separate calculations. Equations (1) and (2) display the two, 16-bit calculations:</p><figure id="ba8b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x7xyrrA91tDDv6zPb7HjKg.png"><figcaption></figcaption></figure><p id="a73b">Where <i>Bi</i> corresponds to byte <i>i</i>, the indices on α and <i>β</i> run from byte <i>j</i> to byte <i>k,</i> i.e., <i>Bj…Bk</i> (where <i>B1 ≤ Bj ≤ Bk ≤ Bn)</i>, and where mod is the modulo operator with modulus <i>M = 2¹⁶</i>.</p><p id="08ff">These 16-bit values are then combined into the final, 32-bit checksum via Equation (3).</p><figure id="ccd2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*CYF6FQox_Mg-htTiWDcqiw.png"><figcaption></figcaption></figure><p id="3b9a">As previously mentioned, recursive relations are employed [see Equation (3)], which greatly reduce the computation time, and help enable the checksums to be calculated in a “rolling” manner. After the first block of data, these relations are used until no more blocks are left.</p><figure id="0472"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*LGhNfr2x7CJNnRyK2fg6Kw.png"><figcaption></figcaption></figure><p id="4038"><b>The 128-bit checksum: </b>The second checksum calculated for each block is a more complex, 128-bit checksum, <a href="https://link.springer.com/chapter/10.1007/3-540-38424-3_22">message-digest 4 (MD4)</a> algorithm. This algorithm was developed by Ronald Rivest, and was primarily used as cryptographic hash function, originally used for security purposes, but due to vulnerabilities, has been replaced with the <a href="https://en.wikipedia.org/wiki/MD5">MD5 algorithm</a>. Computing this checksum is slightly more complicated, but entirely comprehensible.</p><p id="13e1">In the context of <code>rsync</code>, the MD4 algorithm proceeds by padding the data with a single “1” and filling the remaining positions with zeros, such that it reaches 448 bits in length (i.e., 512 – 64 = 448 bits). For cases in which the data are 448 bits long, 512 bits are appended to the data. Next, two 32-bit words are appended to the data, which are representations of the original data before any padding took place, bringing the total length to 512 bits.</p><p id="f90e">To begin the cryptographic process, four 32-bit words (<i>w1</i>, <i>w2</i>, <i>w3</i>, and <i>w4</i>) are initialized with the values specified in the documentation (see the <a href="https://dspace.mit.edu/bitstream/handle/1721.1/149165/MIT-LCS-TM-434.pdf?sequence=1">original paper</a>). Three functions are defined, which map 32-bit input values to an identically-sized output value. Sixteen such words are processed at a time, and the output is used to modify the original four 32-bit words (i.e., by simple addition). The final values of the four, 32-bit words are then combined, in reverse order (<i>w4w3w2w1</i>) to represent the final message digest, or checksum.</p><p id="2782">For more info on this important algorithm, see this <a href="https://nickthecrypt.medium.com/cryptography-hash-method-md4-message-digest-4-explained-with-python-f201b74f51d">fantastic article</a> by Nick from The Crypt.</p><p id="cfb5">These checksums are stored in a hash table (which is basically just a glorified dictionary) for later use. With the requisite checksums computed for each block, we are now ready for the third step.</p><h2 id="baac">Step 3: Receiving computer transmits checksums to sending computer</h2><p id="94d7">There isn’t much to comment on here, aside from the fact that the data are sent sequentially. This ensures that, by virtue of the order of the transmitted blocks, the sending machine can easily keep the order of the data for later use.</p><h2 id="de72">Step 4: Scan through checksums to see what data match</h2><p id="634d">This important step serves to prevent data presently on the receiving machine from being transmitted by the sending machine. Ultimately, the previously calculated checksums are compared with the checksums of the data on the sending machine. This is not the full picture, however, as this portion of the algorithm processes suspected matched data in three levels to determine where exactly the data are identical. The following levels are an abbreviated (and hopefully more comprehensible) discussion of those specified in the <code>rsync</code> documentation in the <a href="https://rsync.samba.org/tech_report/node4.html">Checksum searching</a> section:</p><p id="8ffa"><b>Level 1:</b> To begin, the receiving-side rolling checksums are passed through a hash function, thereby producing a 16-bit value for the entire list of checksums. These are stored in a hash table, for compactness and ease of access. For each “offset” <i>j</i> (which is just a fancy way of saying some byte <i>Bj</i>, where <i>1 ≤ j ≤ n</i>) in the file, the rolling checksum and hash are computed for the sending-side data. After this calculation the sending- and receiving-side hash values are compared; for non-null values, the algorithm moves to the next offset, <i>j + 1. </i>For null values the search is elevated to level 2.</p><p id="f593"><b>Level 2: </b>With a difference detected by the presence of a null value in the level one check, the list of receiving-side rolling checksum values are scanned and compared with the sending-side checksum for the current block. When a 32-bit checksum is found to match, the search proceeds to level 3, otherwise, the scan continues until a non-matching hash value is detected.</p><p id="6081"><b>Level 3: </b>To precisely determine the location of data that is common to both the sending- and receiving-side information, the 128-bit, MD4 checksum is calculated for the sending-side data and referenced to the 128-bit checksum value of the receiving-side data. If a match is found, the offset is tabulated for later use in step 5, after all blocks have been checked.</p><h2 id="a632">Step 5: Send instructions to recreate the data on the receiving computer</h2><p id="ba7f">After a match at level 3, the block offset number of the corresponding<b> receiving-side block</b> is then transmitted from the sending to the receiving machine. During the entire <code>rsync</code> process, these searches for matching data and the actual sending of the data is performed continuously. This search proceeds through all of the <i>N</i> blocks, with levels 1, 2, and 3 repeating if the specific criteria are met. If no matches are found, the raw data are transferred and inserted into file at the appropriate offset.</p><p id="d0f7">When <code>rsync</code> updates a file, the instructions sent during step 5 are used to populate a temporary file which, after the update is complete, is renamed to the appropriate file name.</p><p id="cbf0">As noted earlier, if you don’t want to rely on an examination of the file meta data and would prefer every file to be closely inspected by comparing checksums, one would use the <code>--checksum</code> option.</p><p id="b4c5">For an extremely in-depth look at the <code>rsync</code> algorithm, including its implementation in C, check out Norman Tuft’s <a href="https://www.cs.tufts.edu/~nr/rsync.html">page on the rsync algorithm</a> and the webpage <a href="https://rsync.samba.org/how-rsync-works.html">How Rsync Works: A Practical Overview</a>, as well as the <a href="https://rsync.samba.org/tech_report/tech_report.html">documentation</a> from the developers of <code>rsync</code>. Let’s now turn to the question at hand: why use <code>rsync</code>?</p><h1 id="c83d">How to use rsync</h1><p id="3b49"><code>rsync</code> can be used just like <code>cp</code>, where many of the same flags transfer over. The most basic use of <code>rsync</code> is shown below:</p><div id="ee9c"><pre><span class="hljs-meta prompt_">$ </span><span class="language-bash">rsync -azh --progress account@hostname:/source/dir/ destination/dir</span></pre></div><p id="8f44">Here, the <code>a</code> flag is for archive mode, which is a combination of many flags (specifically <code>-rlptgoD</code>), and specifies that all attributes of the data should be preserved (e.g., ownership, file creation time, permissions, etc.), and that files should be copied recursively. The <code>z</code> flag is for compression, which uses <code>zlib</code> as the compression utility. <b>Remember to use the compression flag responsibly! If you have enough bandwidth and your connection speed is reasonably fast, compression is usually

Options

unnecessary. Additionally, for local-to-local transfers, compression is useless for two reasons: any bottleneck is at the hardware level, and compression/decompression wastes CPU resources which could be spent on transferring/reconstructing data. </b>The <code>h</code> flag is the standard “human-readable” option which will display the post-transfer statistics in human-readable terms (e.g., kB, MB, etc.). Finally, I include the <code>--progress</code> option to display the current transfer statistics.</p><p id="2db0">For more simple data moving purposes, such as a local-to-local transfer, we can use the exact same command (sans the compression option):</p><div id="104d"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -a path/to/source/dir/* path/to/destination/dir</span></pre></div><p id="772b">Here, one can use either the relative (displayed above) or absolute paths.</p><h1 id="1f47">The rsync cookbook</h1><h2 id="6623">Local transfers</h2><p id="aff1"><b><i>Locally transferring a single file</i>: </b>By default, we always want to preserve the file attributes such as ownership, creation time, etc., so we use (<code>-a</code>):</p><div id="93e5"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -a path/to/src/dir/filename.file path/to/dest/dir</span></pre></div><p id="2085"><b><i>Locally transferring multiple files/subdirectories</i>:</b> To see what’s going on, we select verbose output (<code>-v</code>) with human-readable (<code>-h</code>) progress (<code>--progress</code>):</p><div id="9235"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -avh --progress path/to/src/dir/* path/to/dest/dir</span></pre></div><p id="bf62">To transfer files of a certain type (<code>*.txt</code> in this example), add an include rule:</p><div id="0eeb"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --include=<span class="hljs-string">".txt"</span> path/to/src/dir/ path/to/dest/dir</span></pre></div><p id="acaa">The opposite operation is possible using the exclude option, e.g.,</p><div id="d988"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude=<span class="hljs-string">"*.json"</span> path/to/src/dir/* path/to/dest/dir</span></pre></div><p id="edf1"><b>For multiple file types:</b></p><div id="fe8e"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude={<span class="hljs-string">".json"</span>, <span class="hljs-string">".xlsx"</span>}
path/to/src/dir/*
path/to/dest/dir</span></pre></div><p id="aa0c">One can also use include and exclude for directories:</p><div id="0076"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude=<span class="hljs-string">"/path/to/src/dir/dir0"</span>\ path/to/src/dir/* \ path/to/dest/dir</span></pre></div><p id="5dbc">Finally, one can also specify multiple directories:</p><div id="3a34"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude=<span class="hljs-string">"/path/to/src/dir/dir0"</span>
--exclude=<span class="hljs-string">"/path/to/src/dir/dir1"</span>
path/to/src/dir/*
path/to/dest/dir</span></pre></div><h2 id="3f7f">Local-to-Remote Transfers</h2><p id="4a23">Transferring <b>multiple files </b>to a remote server with preservation of file attributes (<code>-a</code>), compression (<code>-z</code>), and human-readable (<code>-h</code>) progress (<code>--progress</code>):</p><div id="9d93"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -azh --progress path/to/src/dir/* account@host:/path/to/dest/dir</span></pre></div><h2 id="b3ce">Remote-to-Local Transfers</h2><p id="ca36">Transferring <b>multiple files</b> from a remote server with preservation of file attributes (<code>-a</code>), compression (<code>-z</code>), and human-readable (<code>-h</code>) progress (<code>--progress</code>):</p><div id="4297"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -azh --progress account@host:/path/to/dest/dir/* path/to/src/dir/</span> </pre></div><h2 id="c7b4">Filter Rules</h2><p id="3004">To keep things more organized, we can collate our filter rules into a configuration file. The following example is a filter list for making a transfer for a few directories in single user’s account:</p><div id="dd8d"><pre>+ /Users/

  • /Users/username/
  • /Users/username/Desktop/
  • /Users/username/Desktop/** # include all files in ~/Desktop
  • /Users/username/Documents/**
  • /Users/username/Documents/**/* # include all files and subdirectories in ~/Documents
  • /Users/username/Photos/**
  • /Users/username/Downloads

  • /</pre></div><p id="48db">This information would be placed in some sort of plain text file. To distinguish my rsync filter files, I usually name them something like <code>backup_filters.conf</code>. Using filter files can be useful for many situations, especially when making backups with <code>cron</code>. <b>Note that each subdirectory must be specified if you wish to transfer the data in this subdirectory, including the trailing <code>- /</code>, as all data under <code>/</code> will be transferred! </b>While this is an extremely verbose method, is very accurate (e.g., you won’t unnecessarily transfer data). An example of a full command would look like:</p><div id="f778"><pre><span class="hljs-meta prompt_">$ </span><span class="language-bash">rsync -ah --progress --include={<span class="hljs-string">".tex"</span>, <span class="hljs-string">".pdf"</span>}
    --filter=<span class="hljs-string">"merge /path/to/rsync_filters.conf"</span> / /path/to/dest/dir</span></pre></div><p id="8f0a">Because using filter rules can be tricky, I’d recommend using the <code>--dry-run</code> option, which will prepare the file list so you can check for any issues before proceeding with the transfer.</p><p id="9fb1">For more information on the use of filters, see Will Haley’s article <a href="https://www.willhaley.com/blog/rsync-filters/">Rsync Filters</a>.</p><h1 id="2382">Final Note</h1><p id="e839">The next time you need to perform a serious data transfer or perform manual backups, consider <code>rsync</code>, a humble, rather unknown tool, that greatly expedites this process through its ingenious process of transferring only the data you need to transfer. This utility has saved me an immeasurable amount of time, and I hope that it can help you along your computing adventures.</p><h1 id="cfea">Appendix: Transfer speed investigation</h1><p id="e075">Because rsync intelligently determines which files have been altered, only the modified parts can be transferred. The time this algorithm saves is therefore inestimable. For example, I needed to make a backup of all of my LaTeX source files from my machine to an external SSD (USB 3.0). This directory contains several nested subdirectories and is, in total, 7.3 GB of data. Using the built-in <code>time</code> utility, I made a backup using <code>cp</code> and <code>rsync</code>. <b>The analysis that follows is simply to illustrate the timing characteristics of both utilities, and is by no means a statistically sound investigation.</b> The output of the <code>time</code> commands for both processes are provided below:</p><figure id="8994"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*q-OlPGJ2JCRQQU-v6Zr1qQ.png"><figcaption>Output of the time command for the cp command.</figcaption></figure><figure id="df2f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*K_syJLxNX0afxA5ipP3pdg.png"><figcaption>Output of the time command for the cp command.</figcaption></figure><p id="8e13">As mentioned before, every job needs the right tool, and here we can see that simply copying files to a new location is more suited to the <code>cp</code> command. Note that I used the <code>-a</code> flag in the <code>rsync</code> command which preserves all of the file attributes (creation time, ownership, etc.) and recursively copies subdirectories, whereas with the <code>cp</code> command I simply copied the files without preserving attributes. The <code>-a</code> flag is almost always necessary with most <code>rsync</code> applications as the algorithm determines if files need to be transferred based on their creation time¹. The skeptical reader will naturally wonder whether the additional effort of copying the file metadata makes a difference in transfer speeds. To determine if this was the case, I repeated the transfer using the <code>rsync -r</code> command, using the same source data. The output of this command is shown below:</p><figure id="67be"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*QY2ZgNVKlOdaCkuJN4JfWg.png"><figcaption></figcaption></figure><p id="4c3c">Surprisingly, it took roughly 3 seconds longer to perform the same transfer <i>without</i> including the metadata. To be sure this wasn’t a statistical anomaly, I timed the same command two more times (with new directories, of course):</p><figure id="31e2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*d9KNogiMJ8-rIMfSTV6NDQ.png"><figcaption></figcaption></figure><p id="d5e9">The average and standard deviation of these times are 71.35 s ± 0.64 s, 28.93 ± 0.23 s, and 40.16 ± 0.74 s for the real, user, and system times, respectively. While this is certainly not a statistically significant approach, we can conclude that it <i>may</i> be faster to transfer the file metadata. Ultimately, it took longer than the <code>cp</code> command, which highlights one of the weaknesses of <code>rsync</code>.</p><p id="b367">However, when updating the source directory with new files, <code>rsync</code> is clearly the superior tool because it only transfers the new data in modified and newly created files:</p><figure id="dfc1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*S4nmat3zh33b9KdPSJPEbg.png"><figcaption></figcaption></figure><p id="2a0b">For reference, this <code>rsync</code> command transferred 151 MB of new data in the <code>LaTeX</code>. Compared to the original <code>cp</code> command, which took roughly 50 s to execute, the <code>rsync</code> command took only 1.75 s to find and transfer the new data. Making backups with <code>rsync</code> is a serious time-saver.</p></article></body>

DELUDED CUSTODIANS DESERVE HOLIDAYS TOO, YOU KNOW.

Welcome to Regatta Day. Hobart, Tasmania.

Yes, it’s a Public Holiday about boats.

One of those cute wooden boats. I really should have cleaned the blood off the lens first. By the author.

Australians are renowned for their ridiculous public holidays. Holidays for horse races, holidays for boats, holidays simply for the sake of having a holiday.

But today is an extra special Regatta Day.

Queen Raine of the Southern Hemisphere (AKA Raine Lore) will be in attendance to review the Sail By and to oversee the public executions.

I’ve been sharpened my guillotine for weeks and I can’t wait to offer the crowd a colourful spectacle. It’s how I earned my nickname Robert “The Guillotine” Gowty.

It looks like the day is off to a bristling start with the first public execution.

Up first, Grimsby Hackney, for crimes against literature. Queen Raine sits atop the podium and a tense hush falls over the crowd as they await her decision. Thumbs up or thumbs down?

This is going to be a thumbs down for sure.

Wait, what’s happening? There’s a minor kerfuffle at the base of the podium as Annie Trevaskis pushes through the crowd to mount the podium. She whispers something in the Queen’s ear and pulls out her phone.

It looks like they might be doing some online shopping.

Annie’s now pointing at her t-shirt, recently acquired from the Grimsby Hackney Merchandise page. The Queen is pointing at the phone. From the reflection in her glasses I think she might have taken a shine to one of the mugs.

And then the signal. It’s a thumbs up. Damn. I’m going to have to release him.

Grimsby tells me to go fuck myself, up-ends a table full of lamingtons prepared by the Gormanston chapter of the Country Womens Association then staggers off into the crowd muttering something about Smillew.

Up next is the Queen of Regatta Day Paegent. Strangely, there is only one entry this year and it’s Queen Raine herself. She’s declared the winner and a cheer goes up through the crowd.

After the disappointment of Grimsby’s release the crowd is baying for blood.

This time Queen Raine doesn’t disappoint.

Next for public execution are half a dozen young ladies who look like they’re dressed up for a beauty paegent or something. It’s thumbs down across the board and the crowd goes wild.

And now it’s time for the King of Regatta Day Paegent. There is a bit more of a field for this one, but I’m still quietly confident of victory. Just a moment, MC Bob Merckel is making an announcement.

“If you’re a contestant in the Regatta Day King competition, in the next 20 minutes you have to withdrawal from the competition or go head to head Robert “The Guillotine” Gowty. You have to.”

Shortly, there’s another announcement over the public address system and it appears there have been a number of late withdrawals, as a gaggle of tuxedo-clad gents are seen running from the regatta grounds as fast as their legs will take them. I can’t imagine why. I appear to be the only remaining contestant.

It’s good to be King!

Finally, the highlight of the day. The Sail By.

Leading the way, it’s the AAV Rexinald Corgington with Rex at the helm and AAV Ziggy Stardust with Ziggy up front. They’re a crowd favourite following their recent victory over the killer penguins on Maatsuyker Island. Lucy Furr is up on the podium, next to Queen Raine, beaming.

They’re followed by what appears to be a small furry coracle, hardly bigger than an upturned hat.

The coracle appears to be manned by a man standing on his head. It’s none other than Smillew Rahcuef. He appears to be flailing a little. I hope he’s OK.

Someone on a barge has just barged into the Sail By. It looks like it might be Kristen Stark. She’s doing her stand-up routine. Oh, she’s telling the one about the axe murderer and the floral jigsaw puzzle. Hysterical. It’s going down very well with the Tasmanian crowd. Axes are very popular in this part of the world.

Now I can see what Smillew’s up to, he’s try to wrap his foot around the microphone lead in order to pull it from her hand. Physical humour at its very best.

Up next is Cristina Cattai who’s float is fashioned in the shape of a giant echidna. Beaming beatifically, she’s waving a giant mushroom at the crowd.

A stream of cute and crazy boats follows.

Finally, a special guest appearance by the Queen of the Northern Hemisphere, Ann James, in a sensational water chariot fashioned in the shape of the four wild horses of the apocalypse being pulled by seven swimming jackalopes.

The crowd is ecstatic and the chant goes up “More blood, more blood, more blood!”

Fortunately, the tuxedo dudes who were seen fleeing the regatta grounds have now been rounded up and the crowd executes a metachronal rhythmic thumbs down as the heads gently plop onto the well manicured lawn in front of the guillotine.

The poor last fellow, as his head dangles by a strand of flesh. After a busy day the blade has been blunted and he’s going to need a second go. Fortunately, the Commodore has just sounded the foghorn to signal the end of the Sail By and his gurgling screams have been muted.

So another Regatta Day, that paragon of good, clean family fun, comes to a close.

Doesn’t Tasmania sound like just the wackiest place? You may enjoy this one too.

This month’s story brought to you by a Deluded Custodians Writing Challenge

Smillewisinverted
Satire
Humour
Absurdism
Recommended from ReadMedium