Free AI web copilot to create summaries, insights and extended knowledge, download it at here
2949
Abstract
me"</span>).unwrap();</pre></div><p id="d679">Then we could start writing to it line by line,</p><div id="65da"><pre>let mut logistic <span class="hljs-operator">=</span> Logistic(<span class="hljs-number">0.35</span>)<span class="hljs-comment">;</span></pre></div><div id="70d3"><pre><span class="hljs-meta prompt_">...</span></pre></div><div id="b25c"><pre>writeln!(<span class="hljs-name">file</span>, <span class="hljs-string">"{}"</span>, logistic.next().unwrap())<span class="hljs-comment">;</span></pre></div><p id="d98a">This approach works for small amounts of data but in our case, it would probably take much too long. Writing to a file line by line is a very inefficient I/O operation.</p><p id="d28a">For this reason, we first buffer as much as possible to write operations in memory and when this buffer overflows, it is getting written to the file in one I/O operation:</p>
<figure id="3844">
<div>
<div>
<iframe class="gist-iframe" src="/gist/rust-play/c73cb5ec1ea2647b3ac18dd5a78fdc1e.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="0d36">The idea is to wrap the <code>File</code> with a <code>BufWriter::with_capacity</code>. Here we can specify how many bytes shall be buffered before doing the actual writing to the file.</p><p id="9a28">Also important is to call <code>file.flush().unwrap()</code> when using such buffering. This ensures that all data are getting written from the buffer to the file before the buffer is getting dropped.</p><h1 id="687f">External Sorting</h1><p id="fe84">The core method of the algorithm is the following:</p>
<figure id="f62f">
<div>
<div>
<iframe class="gist-iframe" src="/gist/rust-play/3ee5599c0be60d89e19c6bf562b45f58.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="d5c5">Here we are reading portions that fit into memory from the generated large file. To open a file for obtaining a readable stream one just do,</p><div id="e99d"><pre>let <span class="hljs-keyword">file</span> = <span class="hljs-keyword">File</span>::<span class="hljs-meta">open</span>(<span class="hljs-string">'filename'</span>).unwrap();</pre></div><p id="dee6">But again, reading it line by line would result in very inefficient I/O operations. Therefore, we again do the buffering trick but this time the other way round:</p><div id="e634"><pre>let <span class="hljs-keyword">file</span> = BufReader::with_capacity(
10_000_000, <span class="hljs-keyword">File</span>::<span class="hljs-meta">open</span>(<span class="hljs-keyword">filename</span>).unwrap()
);</pre></div><p id="22c4">Next, we buffer-read the file line by line into
Options
a vector. This we do until a certain amount of memory is used <code>max_mem_use</code> and then sort this portion of the list by calling <code>sort_and_write_to_file</code>. For convenience, this method in addition is writing the sorted portion into a temporary file.</p><p id="7c5f">The sorting used here is just Rust’s standard sorting method on a <code>Vec</code>(at the time, a merge-sort like algorithm):</p><div id="1ddb"><pre>v.sort_by(|<span class="hljs-type">a</span>, b| <span class="hljs-type">a</span>.partial_cmp(b).unwrap());</pre></div><p id="ba13">We need to use <code>partial_cmp</code> since on <code>f64</code> typed data only the <code>PartialOrd</code> trait is implemented (not the <code>Ord</code> ).</p><p id="c8e3">Still interesting is the method <code>merge</code>. Here buffered readable streams are being put onto each temporary data file and then they are getting read line by line (buffered) and compared with each other. The lowest value is written (buffered) to a result file. We only read the next line from a given stream in case its value turned out to be the lowest:</p>
<figure id="96b4">
<div>
<div>
<iframe class="gist-iframe" src="/gist/rust-play/e63c223b20b04b0382aeb21fe19d0856.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="96df"><code>active_readers</code> is a vector holding <code>Line</code> -iterators of all buffered readable streams onto all the temporary data files (<code>tmp_file_names</code>). Then a vector named <code>value</code> is created that initially holds the first line from each temporary file. We repeatedly searched the lowest value in <code>values</code> and write the latter to the result file: <code>writeln!(file, “{}”, min_val);</code></p><p id="e310">In this case, we replace the corresponding index of <code>values</code> with the next line of the corresponding temporary file should it not have reached EOF.</p><p id="e0b0">This way, we ensure to produce an ascending sorted list in a ‘large’ result file.</p><p id="c0df">For convenience, let me paste the entire code I used to produce and sort an almost 20G file. In order to make this run on your system, you might need to adjust the global variables <code>BUFFER_CAPACITY</code> and <code>MAX_MEM_USE</code> to your system’s memory. However note, depending on these settings, this might produce a large number of temporary files.</p>
<figure id="d23c">
<div>
<div>
<iframe class="gist-iframe" src="/gist/rust-play/1cc116516329d2af01e90e6a73a1a0cf.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="8ff9">Thanks for reading!</p></article></body>