Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2949

Abstract

me"</span>).unwrap();</pre></div><p id="d679">Then we could start writing to it line by line,</p><div id="65da"><pre>let mut logistic <span class="hljs-operator">=</span> Logistic(<span class="hljs-number">0.35</span>)<span class="hljs-comment">;</span></pre></div><div id="70d3"><pre><span class="hljs-meta prompt_">...</span></pre></div><div id="b25c"><pre>writeln!(<span class="hljs-name">file</span>, <span class="hljs-string">"{}"</span>, logistic.next().unwrap())<span class="hljs-comment">;</span></pre></div><p id="d98a">This approach works for small amounts of data but in our case, it would probably take much too long. Writing to a file line by line is a very inefficient I/O operation.</p><p id="d28a">For this reason, we first buffer as much as possible to write operations in memory and when this buffer overflows, it is getting written to the file in one I/O operation:</p> <figure id="3844"> <div> <div>

            <iframe class="gist-iframe" src="/gist/rust-play/c73cb5ec1ea2647b3ac18dd5a78fdc1e.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="0d36">The idea is to wrap the <code>File</code> with a <code>BufWriter::with_capacity</code>. Here we can specify how many bytes shall be buffered before doing the actual writing to the file.</p><p id="9a28">Also important is to call <code>file.flush().unwrap()</code> when using such buffering. This ensures that all data are getting written from the buffer to the file before the buffer is getting dropped.</p><h1 id="687f">External Sorting</h1><p id="fe84">The core method of the algorithm is the following:</p>
    <figure id="f62f">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/rust-play/3ee5599c0be60d89e19c6bf562b45f58.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="d5c5">Here we are reading portions that fit into memory from the generated large file. To open a file for obtaining a readable stream one just do,</p><div id="e99d"><pre>let <span class="hljs-keyword">file</span> = <span class="hljs-keyword">File</span>::<span class="hljs-meta">open</span>(<span class="hljs-string">'filename'</span>).unwrap();</pre></div><p id="dee6">But again, reading it line by line would result in very inefficient I/O operations. Therefore, we again do the buffering trick but this time the other way round:</p><div id="e634"><pre>let <span class="hljs-keyword">file</span> = BufReader::with_capacity(

10_000_000, <span class="hljs-keyword">File</span>::<span class="hljs-meta">open</span>(<span class="hljs-keyword">filename</span>).unwrap() );</pre></div><p id="22c4">Next, we buffer-read the file line by line into

Options

a vector. This we do until a certain amount of memory is used <code>max_mem_use</code> and then sort this portion of the list by calling <code>sort_and_write_to_file</code>. For convenience, this method in addition is writing the sorted portion into a temporary file.</p><p id="7c5f">The sorting used here is just Rust’s standard sorting method on a <code>Vec</code>(at the time, a merge-sort like algorithm):</p><div id="1ddb"><pre>v.sort_by(|<span class="hljs-type">a</span>, b| <span class="hljs-type">a</span>.partial_cmp(b).unwrap());</pre></div><p id="ba13">We need to use <code>partial_cmp</code> since on <code>f64</code> typed data only the <code>PartialOrd</code> trait is implemented (not the <code>Ord</code> ).</p><p id="c8e3">Still interesting is the method <code>merge</code>. Here buffered readable streams are being put onto each temporary data file and then they are getting read line by line (buffered) and compared with each other. The lowest value is written (buffered) to a result file. We only read the next line from a given stream in case its value turned out to be the lowest:</p> <figure id="96b4"> <div> <div>

            <iframe class="gist-iframe" src="/gist/rust-play/e63c223b20b04b0382aeb21fe19d0856.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="96df"><code>active_readers</code> is a vector holding <code>Line</code> -iterators of all buffered readable streams onto all the temporary data files (<code>tmp_file_names</code>). Then a vector named <code>value</code> is created that initially holds the first line from each temporary file. We repeatedly searched the lowest value in <code>values</code> and write the latter to the result file: <code>writeln!(file, “{}”, min_val);</code></p><p id="e310">In this case, we replace the corresponding index of <code>values</code> with the next line of the corresponding temporary file should it not have reached EOF.</p><p id="e0b0">This way, we ensure to produce an ascending sorted list in a ‘large’ result file.</p><p id="c0df">For convenience, let me paste the entire code I used to produce and sort an almost 20G file. In order to make this run on your system, you might need to adjust the global variables <code>BUFFER_CAPACITY</code> and <code>MAX_MEM_USE</code> to your system’s memory. However note, depending on these settings, this might produce a large number of temporary files.</p>
    <figure id="d23c">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/rust-play/1cc116516329d2af01e90e6a73a1a0cf.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="8ff9">Thanks for reading!</p></article></body>

How to Sort a 20G File in Rust

And know how to deal with file handling

The problem of how to sort a file that does not match into memory already has a standard solution called ‘external sort’.

This story is about giving some details on how one could implement such an external sorting algorithm with Rust.

Although there already exists a bunch of ready-to-use crates for this problem, implementation-wise it is very interesting to look at since one deals with many important concepts from standard file handling.

The Algorithm

Let us quickly recap how external sorting works. The implementation of a standard sorting algorithm requires the targeted list to be held in memory.

In case the available memory is not large enough, we first read the list in smaller portions that fit into memory. Each portion is sorted by using some sorting algorithm and subsequently written to a file dedicated for this portion. This leaves us with a number of files each containing a sorted list and each being able to fit in memory.

Finally, we put a readable stream onto each of these files and always pick the smallest element to be written into a ‘large’ outcome file. If you are not familiar with this merging strategy, you will easily understand the details from the code later.

Producing Test Data

In order to have a large file of about 20G in size, we write the first n iterations of the logistic regression within its chaotic region:

x_n+1 = r * x_n * (1 - x_n )

This is very suitable since it produces a large amount of non-repeating data.

The Iterator for this sequence is implemented like this:

In Rust, one can create a writable stream into a file by simply doing,

use std::fs::{remove_file, File};

let mut file = File::create("filename").unwrap();

Then we could start writing to it line by line,

let mut logistic = Logistic(0.35);

...

writeln!(file, "{}", logistic.next().unwrap());

This approach works for small amounts of data but in our case, it would probably take much too long. Writing to a file line by line is a very inefficient I/O operation.

For this reason, we first buffer as much as possible to write operations in memory and when this buffer overflows, it is getting written to the file in one I/O operation:

The idea is to wrap the File with a BufWriter::with_capacity. Here we can specify how many bytes shall be buffered before doing the actual writing to the file.

Also important is to call file.flush().unwrap() when using such buffering. This ensures that all data are getting written from the buffer to the file before the buffer is getting dropped.

External Sorting

The core method of the algorithm is the following:

Here we are reading portions that fit into memory from the generated large file. To open a file for obtaining a readable stream one just do,

let file = File::open('filename').unwrap();

But again, reading it line by line would result in very inefficient I/O operations. Therefore, we again do the buffering trick but this time the other way round:

let file = BufReader::with_capacity(
  10_000_000, File::open(filename).unwrap()
);

Next, we buffer-read the file line by line into a vector. This we do until a certain amount of memory is used max_mem_use and then sort this portion of the list by calling sort_and_write_to_file. For convenience, this method in addition is writing the sorted portion into a temporary file.

The sorting used here is just Rust’s standard sorting method on a Vec(at the time, a merge-sort like algorithm):

v.sort_by(|a, b| a.partial_cmp(b).unwrap());

We need to use partial_cmp since on f64 typed data only the PartialOrd trait is implemented (not the Ord ).

Still interesting is the method merge. Here buffered readable streams are being put onto each temporary data file and then they are getting read line by line (buffered) and compared with each other. The lowest value is written (buffered) to a result file. We only read the next line from a given stream in case its value turned out to be the lowest:

active_readers is a vector holding Line -iterators of all buffered readable streams onto all the temporary data files (tmp_file_names). Then a vector named value is created that initially holds the first line from each temporary file. We repeatedly searched the lowest value in values and write the latter to the result file: writeln!(file, “{}”, min_val);

In this case, we replace the corresponding index of values with the next line of the corresponding temporary file should it not have reached EOF.

This way, we ensure to produce an ascending sorted list in a ‘large’ result file.

For convenience, let me paste the entire code I used to produce and sort an almost 20G file. In order to make this run on your system, you might need to adjust the global variables BUFFER_CAPACITY and MAX_MEM_USE to your system’s memory. However note, depending on these settings, this might produce a large number of temporary files.

Thanks for reading!