Free AI web copilot to create summaries, insights and extended knowledge, download it at here
10462
Abstract
it:800/16XAYVfIn4OQjxWuYT9uVPA.png"><figcaption>A simplified flowchart of the <code>rsync algorithm.</code></figcaption></figure><p id="75f1">I’ll now cover each of the five steps in detail.</p><h2 id="b09d">Step 1: Dividing data into blocks</h2><p id="4180">First, a list of files is compiled. For each file, the raw data are divided into <i>N</i> blocks of data of <i>M</i> bytes each. If the file cannot be evenly divided into <i>N</i> blocks of <i>M</i> bytes, the final block can be less than <i>M</i> bytes.</p><h2 id="18ba">Step 2: Sending computer computes 2 individual checksums for each block.</h2><p id="0db7">In order to determine <i>which</i> part of the file needs updating, <i>checksums </i>are used. A checksum is a simple code or sequence of numbers generated by processing input data with a function. This could be as simple — and unrealistic — as creating a checksum for the word “dog,” or, more realistically, creating a checksum for an entire disk image of an OS¹. Checksums are typically used for data-integrity measures, such as determining if a file or message was corrupted in transit. <code>rsync</code> harnesses the power of checksums to determine if the files on the sending and receiving sides are different. Two checksums are performed for each block of data: a <a href="https://rsync.samba.org/tech_report/node3.html">32-bit “rolling” checksum</a> and a more involved 128-bit checksum.</p><p id="17ff"><b>The rolling checksum:</b> The rolling checksum is constructed using an optimized version of the algorithm developed by Mark Adler: <a href="https://en.wikipedia.org/wiki/Adler-32">adler-32</a>. This algorithm is fast and requires little computation power, owing to the recurrence relations used to compute subsequent checksums, as we will soon see. This algorithm receives its “rolling” moniker from this use of recurrence relations, which uses data previously calculated (and, consequently, is the reason for the quick computation speed). This checksum is computed in two separate calculations. Equations (1) and (2) display the two, 16-bit calculations:</p><figure id="ba8b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x7xyrrA91tDDv6zPb7HjKg.png"><figcaption></figcaption></figure><p id="a73b">Where <i>Bi</i> corresponds to byte <i>i</i>, the indices on α and <i>β</i> run from byte <i>j</i> to byte <i>k,</i> i.e., <i>Bj…Bk</i> (where <i>B1 ≤ Bj ≤ Bk ≤ Bn)</i>, and where mod is the modulo operator with modulus <i>M = 2¹⁶</i>.</p><p id="08ff">These 16-bit values are then combined into the final, 32-bit checksum via Equation (3).</p><figure id="ccd2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*CYF6FQox_Mg-htTiWDcqiw.png"><figcaption></figcaption></figure><p id="3b9a">As previously mentioned, recursive relations are employed [see Equation (3)], which greatly reduce the computation time, and help enable the checksums to be calculated in a “rolling” manner. After the first block of data, these relations are used until no more blocks are left.</p><figure id="0472"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*LGhNfr2x7CJNnRyK2fg6Kw.png"><figcaption></figcaption></figure><p id="4038"><b>The 128-bit checksum: </b>The second checksum calculated for each block is a more complex, 128-bit checksum, <a href="https://link.springer.com/chapter/10.1007/3-540-38424-3_22">message-digest 4 (MD4)</a> algorithm. This algorithm was developed by Ronald Rivest, and was primarily used as cryptographic hash function, originally used for security purposes, but due to vulnerabilities, has been replaced with the <a href="https://en.wikipedia.org/wiki/MD5">MD5 algorithm</a>. Computing this checksum is slightly more complicated, but entirely comprehensible.</p><p id="13e1">In the context of <code>rsync</code>, the MD4 algorithm proceeds by padding the data with a single “1” and filling the remaining positions with zeros, such that it reaches 448 bits in length (i.e., 512 – 64 = 448 bits). For cases in which the data are 448 bits long, 512 bits are appended to the data. Next, two 32-bit words are appended to the data, which are representations of the original data before any padding took place, bringing the total length to 512 bits.</p><p id="f90e">To begin the cryptographic process, four 32-bit words (<i>w1</i>, <i>w2</i>, <i>w3</i>, and <i>w4</i>) are initialized with the values specified in the documentation (see the <a href="https://dspace.mit.edu/bitstream/handle/1721.1/149165/MIT-LCS-TM-434.pdf?sequence=1">original paper</a>). Three functions are defined, which map 32-bit input values to an identically-sized output value. Sixteen such words are processed at a time, and the output is used to modify the original four 32-bit words (i.e., by simple addition). The final values of the four, 32-bit words are then combined, in reverse order (<i>w4w3w2w1</i>) to represent the final message digest, or checksum.</p><p id="2782">For more info on this important algorithm, see this <a href="https://nickthecrypt.medium.com/cryptography-hash-method-md4-message-digest-4-explained-with-python-f201b74f51d">fantastic article</a> by Nick from The Crypt.</p><p id="cfb5">These checksums are stored in a hash table (which is basically just a glorified dictionary) for later use. With the requisite checksums computed for each block, we are now ready for the third step.</p><h2 id="baac">Step 3: Receiving computer transmits checksums to sending computer</h2><p id="94d7">There isn’t much to comment on here, aside from the fact that the data are sent sequentially. This ensures that, by virtue of the order of the transmitted blocks, the sending machine can easily keep the order of the data for later use.</p><h2 id="de72">Step 4: Scan through checksums to see what data match</h2><p id="634d">This important step serves to prevent data presently on the receiving machine from being transmitted by the sending machine. Ultimately, the previously calculated checksums are compared with the checksums of the data on the sending machine. This is not the full picture, however, as this portion of the algorithm processes suspected matched data in three levels to determine where exactly the data are identical. The following levels are an abbreviated (and hopefully more comprehensible) discussion of those specified in the <code>rsync</code> documentation in the <a href="https://rsync.samba.org/tech_report/node4.html">Checksum searching</a> section:</p><p id="8ffa"><b>Level 1:</b> To begin, the receiving-side rolling checksums are passed through a hash function, thereby producing a 16-bit value for the entire list of checksums. These are stored in a hash table, for compactness and ease of access. For each “offset” <i>j</i> (which is just a fancy way of saying some byte <i>Bj</i>, where <i>1 ≤ j ≤ n</i>) in the file, the rolling checksum and hash are computed for the sending-side data. After this calculation the sending- and receiving-side hash values are compared; for non-null values, the algorithm moves to the next offset, <i>j + 1. </i>For null values the search is elevated to level 2.</p><p id="f593"><b>Level 2: </b>With a difference detected by the presence of a null value in the level one check, the list of receiving-side rolling checksum values are scanned and compared with the sending-side checksum for the current block. When a 32-bit checksum is found to match, the search proceeds to level 3, otherwise, the scan continues until a non-matching hash value is detected.</p><p id="6081"><b>Level 3: </b>To precisely determine the location of data that is common to both the sending- and receiving-side information, the 128-bit, MD4 checksum is calculated for the sending-side data and referenced to the 128-bit checksum value of the receiving-side data. If a match is found, the offset is tabulated for later use in step 5, after all blocks have been checked.</p><h2 id="a632">Step 5: Send instructions to recreate the data on the receiving computer</h2><p id="ba7f">After a match at level 3, the block offset number of the corresponding<b> receiving-side block</b> is then transmitted from the sending to the receiving machine. During the entire <code>rsync</code> process, these searches for matching data and the actual sending of the data is performed continuously. This search proceeds through all of the <i>N</i> blocks, with levels 1, 2, and 3 repeating if the specific criteria are met. If no matches are found, the raw data are transferred and inserted into file at the appropriate offset.</p><p id="d0f7">When <code>rsync</code> updates a file, the instructions sent during step 5 are used to populate a temporary file which, after the update is complete, is renamed to the appropriate file name.</p><p id="cbf0">As noted earlier, if you don’t want to rely on an examination of the file meta data and would prefer every file to be closely inspected by comparing checksums, one would use the <code>--checksum</code> option.</p><p id="b4c5">For an extremely in-depth look at the <code>rsync</code> algorithm, including its implementation in C, check out Norman Tuft’s <a href="https://www.cs.tufts.edu/~nr/rsync.html">page on the rsync algorithm</a> and the webpage <a href="https://rsync.samba.org/how-rsync-works.html">How Rsync Works: A Practical Overview</a>, as well as the <a href="https://rsync.samba.org/tech_report/tech_report.html">documentation</a> from the developers of <code>rsync</code>. Let’s now turn to the question at hand: why use <code>rsync</code>?</p><h1 id="c83d">How to use rsync</h1><p id="3b49"><code>rsync</code> can be used just like <code>cp</code>, where many of the same flags transfer over. The most basic use of <code>rsync</code> is shown below:</p><div id="ee9c"><pre><span class="hljs-meta prompt_">$ </span><span class="language-bash">rsync -azh --progress account@hostname:/source/dir/ destination/dir</span></pre></div><p id="8f44">Here, the <code>a</code> flag is for archive mode, which is a combination of many flags (specifically <code>-rlptgoD</code>), and specifies that all attributes of the data should be preserved (e.g., ownership, file creation time, permissions, etc.), and that files should be copied recursively. The <code>z</code> flag is for compression, which uses <code>zlib</code> as the compression utility. <b>Remember to use the compression flag responsibly! If you have enough bandwidth and your connection speed is reasonably fast, compression is usually
Options
unnecessary. Additionally, for local-to-local transfers, compression is useless for two reasons: any bottleneck is at the hardware level, and compression/decompression wastes CPU resources which could be spent on transferring/reconstructing data. </b>The <code>h</code> flag is the standard “human-readable” option which will display the post-transfer statistics in human-readable terms (e.g., kB, MB, etc.). Finally, I include the <code>--progress</code> option to display the current transfer statistics.</p><p id="2db0">For more simple data moving purposes, such as a local-to-local transfer, we can use the exact same command (sans the compression option):</p><div id="104d"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -a path/to/source/dir/* path/to/destination/dir</span></pre></div><p id="772b">Here, one can use either the relative (displayed above) or absolute paths.</p><h1 id="1f47">The rsync cookbook</h1><h2 id="6623">Local transfers</h2><p id="aff1"><b><i>Locally transferring a single file</i>: </b>By default, we always want to preserve the file attributes such as ownership, creation time, etc., so we use (<code>-a</code>):</p><div id="93e5"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -a path/to/src/dir/filename.file path/to/dest/dir</span></pre></div><p id="2085"><b><i>Locally transferring multiple files/subdirectories</i>:</b> To see what’s going on, we select verbose output (<code>-v</code>) with human-readable (<code>-h</code>) progress (<code>--progress</code>):</p><div id="9235"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -avh --progress path/to/src/dir/* path/to/dest/dir</span></pre></div><p id="bf62">To transfer files of a certain type (<code>*.txt</code> in this example), add an include rule:</p><div id="0eeb"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --include=<span class="hljs-string">".txt"</span> path/to/src/dir/ path/to/dest/dir</span></pre></div><p id="acaa">The opposite operation is possible using the exclude option, e.g.,</p><div id="d988"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude=<span class="hljs-string">"*.json"</span> path/to/src/dir/* path/to/dest/dir</span></pre></div><p id="edf1"><b>For multiple file types:</b></p><div id="fe8e"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude={<span class="hljs-string">".json"</span>, <span class="hljs-string">".xlsx"</span>}
path/to/src/dir/*
path/to/dest/dir</span></pre></div><p id="aa0c">One can also use include and exclude for directories:</p><div id="0076"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude=<span class="hljs-string">"/path/to/src/dir/dir0"</span>\
path/to/src/dir/* \
path/to/dest/dir</span></pre></div><p id="5dbc">Finally, one can also specify multiple directories:</p><div id="3a34"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -ah --progress --exclude=<span class="hljs-string">"/path/to/src/dir/dir0"</span>
--exclude=<span class="hljs-string">"/path/to/src/dir/dir1"</span>
path/to/src/dir/*
path/to/dest/dir</span></pre></div><h2 id="3f7f">Local-to-Remote Transfers</h2><p id="4a23">Transferring <b>multiple files </b>to a remote server with preservation of file attributes (<code>-a</code>), compression (<code>-z</code>), and human-readable (<code>-h</code>) progress (<code>--progress</code>):</p><div id="9d93"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -azh --progress path/to/src/dir/* account@host:/path/to/dest/dir</span></pre></div><h2 id="b3ce">Remote-to-Local Transfers</h2><p id="ca36">Transferring <b>multiple files</b> from a remote server with preservation of file attributes (<code>-a</code>), compression (<code>-z</code>), and human-readable (<code>-h</code>) progress (<code>--progress</code>):</p><div id="4297"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">rsync -azh --progress account@host:/path/to/dest/dir/* path/to/src/dir/</span> </pre></div><h2 id="c7b4">Filter Rules</h2><p id="3004">To keep things more organized, we can collate our filter rules into a configuration file. The following example is a filter list for making a transfer for a few directories in single user’s account:</p><div id="dd8d"><pre>+ /Users/
- /Users/username/
- /Users/username/Desktop/
- /Users/username/Desktop/** # include all files in ~/Desktop
- /Users/username/Documents/**
- /Users/username/Documents/**/* # include all files and subdirectories in ~/Documents
- /Users/username/Photos/**
-
/Users/username/Downloads
-
/</pre></div><p id="48db">This information would be placed in some sort of plain text file. To distinguish my rsync filter files, I usually name them something like <code>backup_filters.conf</code>. Using filter files can be useful for many situations, especially when making backups with <code>cron</code>. <b>Note that each subdirectory must be specified if you wish to transfer the data in this subdirectory, including the trailing <code>- /</code>, as all data under <code>/</code> will be transferred! </b>While this is an extremely verbose method, is very accurate (e.g., you won’t unnecessarily transfer data). An example of a full command would look like:</p><div id="f778"><pre><span class="hljs-meta prompt_">$ </span><span class="language-bash">rsync -ah --progress --include={<span class="hljs-string">".tex"</span>, <span class="hljs-string">".pdf"</span>}
--filter=<span class="hljs-string">"merge /path/to/rsync_filters.conf"</span> / /path/to/dest/dir</span></pre></div><p id="8f0a">Because using filter rules can be tricky, I’d recommend using the <code>--dry-run</code> option, which will prepare the file list so you can check for any issues before proceeding with the transfer.</p><p id="9fb1">For more information on the use of filters, see Will Haley’s article <a href="https://www.willhaley.com/blog/rsync-filters/">Rsync Filters</a>.</p><h1 id="2382">Final Note</h1><p id="e839">The next time you need to perform a serious data transfer or perform manual backups, consider <code>rsync</code>, a humble, rather unknown tool, that greatly expedites this process through its ingenious process of transferring only the data you need to transfer. This utility has saved me an immeasurable amount of time, and I hope that it can help you along your computing adventures.</p><h1 id="cfea">Appendix: Transfer speed investigation</h1><p id="e075">Because rsync intelligently determines which files have been altered, only the modified parts can be transferred. The time this algorithm saves is therefore inestimable. For example, I needed to make a backup of all of my LaTeX source files from my machine to an external SSD (USB 3.0). This directory contains several nested subdirectories and is, in total, 7.3 GB of data. Using the built-in <code>time</code> utility, I made a backup using <code>cp</code> and <code>rsync</code>. <b>The analysis that follows is simply to illustrate the timing characteristics of both utilities, and is by no means a statistically sound investigation.</b> The output of the <code>time</code> commands for both processes are provided below:</p><figure id="8994"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*q-OlPGJ2JCRQQU-v6Zr1qQ.png"><figcaption>Output of the time command for the cp command.</figcaption></figure><figure id="df2f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*K_syJLxNX0afxA5ipP3pdg.png"><figcaption>Output of the time command for the cp command.</figcaption></figure><p id="8e13">As mentioned before, every job needs the right tool, and here we can see that simply copying files to a new location is more suited to the <code>cp</code> command. Note that I used the <code>-a</code> flag in the <code>rsync</code> command which preserves all of the file attributes (creation time, ownership, etc.) and recursively copies subdirectories, whereas with the <code>cp</code> command I simply copied the files without preserving attributes. The <code>-a</code> flag is almost always necessary with most <code>rsync</code> applications as the algorithm determines if files need to be transferred based on their creation time¹. The skeptical reader will naturally wonder whether the additional effort of copying the file metadata makes a difference in transfer speeds. To determine if this was the case, I repeated the transfer using the <code>rsync -r</code> command, using the same source data. The output of this command is shown below:</p><figure id="67be"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*QY2ZgNVKlOdaCkuJN4JfWg.png"><figcaption></figcaption></figure><p id="4c3c">Surprisingly, it took roughly 3 seconds longer to perform the same transfer <i>without</i> including the metadata. To be sure this wasn’t a statistical anomaly, I timed the same command two more times (with new directories, of course):</p><figure id="31e2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*d9KNogiMJ8-rIMfSTV6NDQ.png"><figcaption></figcaption></figure><p id="d5e9">The average and standard deviation of these times are 71.35 s ± 0.64 s, 28.93 ± 0.23 s, and 40.16 ± 0.74 s for the real, user, and system times, respectively. While this is certainly not a statistically significant approach, we can conclude that it <i>may</i> be faster to transfer the file metadata. Ultimately, it took longer than the <code>cp</code> command, which highlights one of the weaknesses of <code>rsync</code>.</p><p id="b367">However, when updating the source directory with new files, <code>rsync</code> is clearly the superior tool because it only transfers the new data in modified and newly created files:</p><figure id="dfc1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*S4nmat3zh33b9KdPSJPEbg.png"><figcaption></figcaption></figure><p id="2a0b">For reference, this <code>rsync</code> command transferred 151 MB of new data in the <code>LaTeX</code>. Compared to the original <code>cp</code> command, which took roughly 50 s to execute, the <code>rsync</code> command took only 1.75 s to find and transfer the new data. Making backups with <code>rsync</code> is a serious time-saver.</p></article></body>