In-Class Assignment 23

In-Class Assignment 23#

⚠️ Should only be run in Google Colab, where we have access to a GPU. ⚠️

Due by the end of the day, Wednesday 9 April, 2025

Exploring GPU Tools in Python - Numba and cuDF#

Learning Objectives#

Develop an understanding of the vocabulary of heterogeneous computing.
Distinguish between CPU and GPU memory heirarchy and management.
Deploy and compare tools to GPU and evaluate possible speedups.

Credit: cuDF Github & GTC 2017 / Anaconda, Inc..

10 Minutes to RAPIDS cuDF’s pandas accelerator mode (cudf.pandas)#

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

cuDF now provides a pandas accelerator mode (cudf.pandas), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.

This notebook is a short introduction to cudf.pandas.

⚠️ Verify your setup#

First, we’ll verify that you are running with an NVIDIA GPU.

#!nvidia-smi  # this should display information about available GPUs

With our GPU-enabled Colab runtime active, we’re ready to go. cuDF is available by default in the GPU-enabled runtime.

If you’re interested in installing on other platforms, please visit https://rapids.ai/#quick-start to learn more.

#import cudf  # this should work without any errors

We’ll also install plotly-express for visualizing data.

Environment Note#

If you’re not running this notebook on Colab, you may need to reload the webpage for the plotly.express visualizations to work correctly.

#!pip install plotly-express

Download the data#

The data we’ll be working with is the Parking Violations Issued - Fiscal Year 2022 dataset from NYC Open Data.

We’re downloading a copy of this dataset from an s3 bucket hosted by NVIDIA to provide faster download speeds. We’ll start by downloading the data. This should take about 30 seconds.

Data License and Terms#

As this dataset originates from the NYC Open Data Portal, it’s governed by their license and terms of use.

Are there restrictions on how I can use Open Data?#

Open Data belongs to all New Yorkers. There are no restrictions on the use of Open Data. Refer to Terms of Use for more information.

Terms of Use #

By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

Submitting City Agencies are the authoritative source of data available on NYC Open Data. These entities are responsible for data quality and retain version control of data sets and feeds accessed on the Site. Data may be updated, corrected, or refreshed at any time.

#!wget https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet

a. - Analysis using Standard Pandas (on Host/CPU only)#

First, let’s use Pandas to read in some columns of the dataset:

Load the following columns from the downloaded dataset using pandas:

"Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"

print the first 10 entries of this dataset using the .sample function in pandas

#import pandas as pd

# read 5 columns data:
#df = pd.read_parquet(
#    "nyc_parking_violations_2022.parquet",
#    columns=[###


# view a random sample of 10 rows:
#df.sample(###

Next, we’ll try to answer a few questions using the data.

b. - Which parking violation is most commonly committed by vehicles from various U.S states?#

Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking offence. Let’s say we want to get the most common type of offence for vehicles registered in different states. We can do this in Pandas using a combination of value_counts and GroupBy.head:

To answer this question, complete the groupby below. Pandas should output a condensed version of the output.

Of the data list, what is the highest violation type - its count, label, and state?

#(df[["Registration State", "Violation Description"]]  # get only these two columns
# .value_counts()  # get the count of offences per state and per type of offence
# .groupby(#### complete here ##### 
# .head(1)  # get the first row in each group (the type of offence with the largest count)
# .sort_index()  # sort by state name
# .reset_index()
#)

The code above uses method chaining to combine a series of operations into a single statement. You might find it useful to break the code up into multiple statements and inspect each of the intermediate results!

c. - Timing on CPU only#

Loading and processing this data took a little time. Let’s measure how long these pipelines take in Pandas:

Add the %%time magic command in python below to measure how long our data query takes using CPU only pandas.

What is the total walltime?

#df = pd.read_parquet(
#    "nyc_parking_violations_2022.parquet",
#    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
#)
#
#(df[["Registration State", "Violation Description"]]
# .value_counts()
# .groupby("Registration State")
# .head(1)
# .sort_index()
# .reset_index()
#)

d. - Using cudf.pandas#

Now, let’s re-run the Pandas code above with the cudf.pandas extension loaded.

Typically, you should load the cudf.pandas extension as the first step in your notebook, before importing any modules. Here, we explicitly restart the kernel to simulate that behavior.

More info about cudf.pandas is available here.

Load cudf.pandas using the load_ext python magic command.
Run our same data query and compare the total walltime to using CPU only pandas.

Is cuDF faster for this query? If so, by how much?

#get_ipython().kernel.do_shutdown(restart=True)

## load `cudf.pandas` here

## time our same routine from above 

#import pandas as pd

#df = pd.read_parquet(
#    "nyc_parking_violations_2022.parquet",
#    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
#)

#(df[["Registration State", "Violation Description"]]
# .value_counts()
# .groupby("Registration State")
# .head(1)
# .sort_index()
# .reset_index()
#)

Understanding Performance#

cudf.pandas provides profiling utilities to help you better understand performance. With these tools, you can identify which parts of your code ran on the GPU and which parts ran on the CPU.

They’re accessible in the cudf.pandas namespace since the cudf.pandas extension was loaded above with load_ext cudf.pandas.

Colab Note#

If you’re running in Colab, the first time you run use the profiler it may take 10+ seconds due to Colab’s debugger interacting with the built-in Python function sys.settrace that we use for profiling. For demo purposes, this isn’t an issue. Just run the cell again.

Using third-party libraries with cudf.pandas#

You can pass Pandas objects to third-party libraries when using cudf.pandas, just like you would when using regular Pandas.

Below, we show an example of using plotly-express to visualize the data we’ve been processing:

e. - Visualizing which states have more pickup trucks relative to other vehicles?#

For fun, run the cell below to answer this question.

Which state has more pickup trucks relative to other vehicles?

#import plotly.express as px
#
#df = df.rename(columns={
#    "Registration State": "reg_state",
#    "Vehicle Body Type": "vehicle_type",
#})

# vehicle counts per state:
##counts = df.groupby("reg_state").size().sort_index()
# vehicles with type "PICK" (Pickup Truck)
##pickup_counts = df.where(df["vehicle_type"] == "PICK").groupby("reg_state").size()
# percentage of pickup trucks by state:
## pickup_frac = ((pickup_counts / counts) * 100).rename("% Pickup Trucks")
## del pickup_frac["MB"]  # (Manitoba is a huge outlier!)

# plot the results:
##pickup_frac = pickup_frac.reset_index()
##px.choropleth(pickup_frac, locations="reg_state", color="% Pickup Trucks", locationmode="USA-states", scope="usa")

Conclusion#

With cudf.pandas, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the cudf.pandas and run your existing code on a GPU!

To learn more, we encourage you to visit rapids.ai/cudf-pandas.

Memory management is crucial when considering GPUs. If the time to copy the necessary data to device is substantial and the data parallelism is low, CPU will remain the better option.

Check out some more ways to speed up Python code in examples from Dr. Chi-kwan Chan :octocat: and his recent Workshop here.