Polars a Superior Choice to Pandas

Keegan Fernandes
3 min readJan 21, 2023
Polars

Polars is a python library written in rust which allows less overhead time and uses arrow, a native Arrow2 implementation, as its basis. This software allows data to be stored and processed parallelly, allowing us to process a huge amount of data fixing a major limitation of pandas.

In this tutorial, I will show the usage of polars from the perspective of a data scientist and the common use cases you will encounter. The library is identical to pandas, with similar processes as seen in the examples below.

Installation

We’ll begin with the installation of polars and importing the necessary libraries

!pip install polars #install the polars library
import polars as pl #using polars as pl 
import plotly.express as px #using plotly express
import numpy as np
from sklearn.metrics import mean_squared_error as MSE

Loading the Data

we’ll be loading a CSV file stored on my google drive.

df = pl.read_csv("/content/drive/My Drive/polars_project/insurance_data.csv")#completely similar to pandas
df
A display of the data

Preprocessing

We now need to process the data. We can start with the null values.

#checking the null values in the columns
for x in df.columns: # df.columns returns a list of columns
print("{} : {}".format(x, df[x].null_count()))

we now need to view the null values. This part significantly differs from the pandas library at the same time. I find the code developer-friendly as it increases the readability and standardizes the code, resolving a problem with most machine learning libraries.

#a few columns have missing values lets take a look at them
df.filter(pl.col("age").is_null() | pl.col("region").is_null() )#we need to use the filter column for this part

Next, we’ll remove all the null values in the data. This code is simple and similar to pandas.

k = df.drop_nulls()

No data science project is complete without visualization of the data. We’ll use Plotly, a data visualization library, for the same.

fig = px.histogram(k.select("age").to_series())# You need to use .select() to obtain a series
fig.show()

We can fill these values with the previous rows' data.

df.fill_null(strategy = 'forward')# we'll fill null values

Model Building

Now that the data is clean, we’ll start building a model for the data. We’ll use xgboost to predict the dataset's claim and RMSE(residual mean sum of square errors) to evaluate the model. The model cannot process DataFrame.matrix data, so we need to convert the data into NumPy before feeding it to the model. Fortunately, polars gives us a function to convert the data frame into a NumPy array.

#Finally we'll use a model to make predictions on the data 
import xgboost as xg
from sklearn.model_selection import train_test_split
xgb_r = xg.XGBRegressor(objective ='reg:squarederror',n_estimators = 15, seed = 123)
X=df.select(["PatientID","age", "bmi", "bloodpressure","children"])
y = df.select("claim")
train_X, test_X, train_y, test_y = train_test_split(X, y,test_size = 0.3, random_state = 43)
# Fitting the model
#we cannot use Data.Matrix for model training so we use to_numpy to covert it into a numpy matrix
xgb_r.fit(train_X.to_numpy(), train_y.to_numpy())

pred = xgb_r.predict(test_X.to_numpy())#predicting the data

# RMSE Computation
rmse = np.sqrt(MSE(test_y.to_numpy(), pred))
print("RMSE : % f" %(rmse))

Conclusion

Polars is still in its primitive stage and has a long way to go before becoming an industry standard; however, it has a robust framework and can meet the challenges of the day-to-day work of a data scientist. Go to the official polars documentation to try it out yourself.

--

--

Keegan Fernandes

First year student in Msc Data Science. Writes about data science and machine learning tutorials and the impact it has on the world.