avatarStefano Bosisio

Summary

This text is a tutorial on using Rust for machine learning, focusing on the smartcore and Polars libraries for dataframes and linear regression.

Abstract

This tutorial demonstrates how to use Rust for machine learning tasks, specifically focusing on the smartcore and Polars libraries. It begins by introducing the concept of dataframes and how they can be managed in Rust using the Polars library. The tutorial then moves on to linear regression using the smartcore library, showing how to implement it with dataframes. The tutorial concludes by providing a complete example of a linear regression model using the Boston Housing dataset.

Opinions

  • The tutorial is written in a conversational and accessible style, making it easy for beginners to follow along.
  • The author emphasizes the importance of understanding how to work with dataframes in Rust, as they are a fundamental concept in data science and machine learning.
  • The author praises the Polars library for its speed and efficiency in handling large datasets, making it a good choice for data science tasks.
  • The author highlights the flexibility of the smartcore library, which supports a range of machine learning algorithms and can be integrated with other Rust libraries.
  • The tutorial provides clear and concise code examples, making it easy for readers to understand how to implement linear regression in Rust using the smartcore library.
  • The author suggests that readers can use the online Rust playground to test their code and experiment with different machine learning algorithms.
  • The author concludes by encouraging readers to continue learning about Rust and its applications for machine learning, suggesting that there is still much to explore in this rapidly evolving field.

Machine Learning and Rust (Part 3): Smartcore, Dataframe, and Linear Regression

In this tutorial: can we have pandas in Rust? What is smartcore?

Image by Cloris Ying on Unsplash

Welcome back to the third tutorial on Rust and its application for ML! Today we’ll learn Polars a fantastic Rust package to deal with dataframes and series and smartcore, which will be one of our best ML Rust friend :)

What will you learn at the end of this tutorial?

  • Deal with dataframes and series and their operations in Rust
  • Implement Linear Regression with smartcore
  • mix together in the same code smartcore and polars

Dataframes in Rust

One of the most used data science-d object is thedataframe . If you are a data scientist or a Pythonist, surely you had to play with Pandas and dataframes. As the name suggests, those computational objects are data “inserted” in a tabular frame, which allows easy data-visualization and data-managing operations. Rust has its own dataframe management packages, one of them is Polars.

Polars is a fully parallel data processor, based on Apache Arrow, written by Ritchie Vink. This package has recorded speedy performances against popular dataframe packages such as data.tablein R and Spark . The goal of Polars is to deal with data too big for Pandas and too small for Spark. Polars exists in two APIs: eager, where operations are immediately executed, like in pandas, and lazy, which is optimized for queries and data joins. It is besides the scope of this tutorial to deeply investigate Polars, but surely I will write something about this very soon, with nice benchmarks on complex data-operations.

Let’s see what Machine Learning people can do with this package, so cargo new polars_learning ! and datasets can be found here: https://github.com/Steboss/ML_and_Rust/tree/master/tutorial_2/datasets and codes here: https://github.com/Steboss/ML_and_Rust/tree/master/tutorial_2/polars_learning

The first thing is: what to do with Cargo.toml? Let’s add the first required dependency:

At the moment the most recent version of polars is 0.14.2

how can we read a csv file and have it back as a dataframe?

Ok, there’s a lot of info here. First let’s take a look at the imports:

  • All the things we need to deal with polars are in prelude (more or less) so we can either import everything from there use polars::prelude::*; or we can just import what we need (e.g. CsvReader, DataType, DataFrame …)
  • Then use std::fs::File and use std::path::{Path} are used to read a given file, like iris.csv
  • Finally, use polars::prelude::SerReader contains all the traits we need to allow CsvReader to work with the method ::new . Remember, we’re returning a dataframe from CsvReaderso no ;

To read a csv file we can exploit CsvReader :

  • defining the input file with let file = File::open(path).expect("Cannot open file.");
  • check if some headers are present .has_header
  • and collect all the content .finish() . The final output is PolarResult type, namely a dataframe

Finally Some is similar to Haskell’s Just and Nothing , it is a variant enum and allow to read the first 5 lines.

After reading a dataframe:

we’d like to know a bit more about its size and shape.

Nothing too complicated here, the steps are similar to Python’s Pandas, as in this function:

We are returning nothing from this function so -> () . Once the dataframe is created we can get the df.shape() and print it out as {:#?} to allow a correct parsing. Furthermore we can check the schema , as well as the dtype for each column, the width — so the number of columns — and the height — the number of rows.

We can inspect columns and check how they look like:

With get_columns() we can read all the columns of the dataframe, get_column_names() retrieves all the headers. Furthermore, we can iterate through columns values and do something like Python, printing the column name and its values.

Something more sexy now:

stacking dataframes together.

It often happens that we need to stack different dataframes, especially when we’re reading a huge files (will this be the case for Polars as well? we’ll get back to this very soon):

So what do we have?

  • vertical stack with vstack where we can concatenate two dataframes together (in Pandas this is pd.concat([df1, df2]) )
  • We can extract a &series type from a column df3.column("sepal.length").unwrap()
  • More importantly we can think of extracting a series to perform some operations, so we can get a series type as:

let sepal_length = df3.drop_in_place("sepal.length").unwrap();

  • Then, we can do some operations on sepal_length and re-add this back to the df as:

let _df4 = df3.insert_at_idx(0, sepal_length).unwrap()

(Here the underscore before df4 because in Rust — for best practices — unused variables must have _ )

Now that you know how to deal with series,

let’s see how to actually perform operations on series.

For example, we want to log-transform a dataframe column:

  • Firstly, we can apply directly a closure to a column with apply_at_idx — line 22. This is a dataframe method, we need to give the index of the column we want to modify and the operation as a map e.g. |s| s+1 adds 1 to the column
  • Otherwise we can act directly on the series. In this case we can think of using a mutable dataframe and a function numb_to_log : a) on line 4 we transform a series to a chunked array namely a typed array, which allows to apply closures to the data and collect results of type T . b)The operations drop_in_place.unwrap().rename() creates a series from sepal.length and rename it to log10.sepal.length . Then, .f64().unwrap() specifies the type of that series and unwrap transform the series to a ChunkedArray. c) Finally, we can transform this array to a log10-array: cast::() converts to the correct numerical type and returns an object like Result<> so we have to unwrapto have a f64 array and then apply(|s| s.log10()) log-transforms the numbers. log10 is a method of a f64 number type in Rust. d) Importantly, we can convert a ChunkedArray back to a series with into_series() and return to the main function, adding the new series as a column with df.with_column()
  • The above was a good example about dealing directly with series and chunked arrays, but can we do the same directly on the dataframe? Surely we can -line 31- : firstly apply_at_idx tell the dataframe we want to apply an operation on a column, then we can map this column with |s| , transform this to a chunked array, s.f64().unwrap() and finally apply the operation so apply(|t| t.log10())) where |t| refers to the column elements.

The very last thing I want to show you about polars is something related to Cargo.toml file, namely the concept of features . In Rust any package has additional features, let’s say methods, that can be used when needed. For example, polars has the capability of converting a numerical dataframe into an array called to_ndarray :

Features do not come immediately into action if not added explicitly to Cargo.toml . This is to ensure that the final compiled Rust package has not a prohibitive size and it helps to reduce the compilation time. We can see that polars has tons of additional features that can be used: https://github.com/pola-rs/polars/blob/master/polars/Cargo.toml

To add to_ndarray, which is contained in polars-core — the core for all the dataframe operations — we need to add this explicitly to the Cargo file:

Only in this way we will be able to use to_ndarray for a dataframe.

Smartcore + Polars: Implement a Linear regression with dataframes!

Smartcore is a fairly new Rust package with lots of applications for Machine Learning and with a very active community. The Machine Learning algorithms in smartcore range from classification to clustering to metrics for model evaluation — several more with respect to rusty machine

Furthermore, smartcore is optimally integrated with different Rust Algebra libraries, such as ndarray or nalgebra , which gives this package a lot more flexibility handling different data types. Discussing with Lorenzo, active contributor to smartcore, there is still a gap in implementing a more general data parsing, so we’ll see very soon quicker methods to integrate polars’ dataframes in smartcore, but for now we use this occasion to learn more data playing in Rust!

Now, let’s get our hands dirty with Smartcore: cargo new smartcore_linear_regression and codes can be found here: https://github.com/Steboss/ML_and_Rust/tree/master/tutorial_2/smartcore_linear_regression

First thing, let’s prepare the Cargo.toml with the dependencies and features needed:

Second, let’s jump on main.rs and let’s think of 4 steps for implementing Linear Regression:

  • read the input boston_dataset.csv and parse into a Polars Dataframe
  • Extract the relevant training features and target
  • convert features and target to smartcore DenseMatrix format
  • run LinearRegression :)

Among all the usual imports, we have to import smartcore functions, namely LinearRegression , DenseMatrix , BaseMatrix , train_test_split and mean_squared_error . As we’ll see smartcore requires as input a matrix type DenseMatrix or BaseMatrix which are both based on nalgebra package.

The first two steps are a good occasion to learn more new things from Rust:

  • in read_csv I have left the method with_delimiter which may be useful. It’s not the case for this dataset, however, with_delimiter wants bytes only as input. For example considering a hash as a separator between columns:with_delimiter(b'#') — notice we’re using ' and not quotation marks " which would have risen an error
  • from feature_and_target for the first time we’re returning 2 variables at the same time. Simply enclose the variables and their types among parentheses ()
  • Here you could see how handy is polars rather than a custom csv reader, where we can select the columns we want. To select multiple columsn we need to pass a Rust vec , thus vec![col1, col2, ...]

Step 3: convert our the feature dataframe and target column to a desired smartcore format DenseMatrix

  • As you can see, on line 7 we’re using ndarray to convert the dataframe to an array. From there we can initialise a zero matrix xmatrix .
  • This matrix is of type DenseMatrix with numbers. To create the zero matrix, we can simply use BaseMatrix::zeros .
  • Then a bit of attention of the Rust types, we initialize two counters, one for rows row and one for the columns col and we iterate through the array values.
  • The iteration goes as features_res was a 1D array. At each iteration we set a value into xmatrix by DEREFERENCING THE BORROWED value, using *val — so you wouldn’t have the error: expected f64 not &f64 . Finally, we can return the DenseMatrix with Ok

A similar way has been used for converting the target array to a vector — notice how inserting a value in a vector in Rust looks similar to C++ push

Finally, it is important to highlight the mut along all these variables, as we have initiate them to zero and then populated them. In the github code I have left also — commented — the procedure to do the same thing in the main without using functions.

And finally linear regression and fitting in smartcore! This is super easy and it remembers me sklearn approach, making smartcore a fantastic library to work with — and we do not have to be worried about train_test_split as for rusty_machine :

Simple as that!

READY, STEADY, GO!

Now we have everything set up for running our code. As always you can run cargo run in your Rust folder. If everything’s gone well you should see a Cargo.lock file and targetfolder with our compiled code. Furthermore cargo run will run our main.rs

If you’re happy enough, you can build the entire package with cargo buildwhich will run further optimizations on our code and voila!

🎊🎊🎊

That’s all for the moment! Definitely a remarkable step forward in learning Rust today! Stay tuned for the next tutorial!

Please, feel free to send me an email for questions or comments at: [email protected]

Alternatively, you can contact me on Instagram: https://www.instagram.com/a_pic_of_science/

Rust
Software Engineering
Machine Learning
Learning To Code
Coding
Recommended from ReadMedium