ICE: GPU Acceleration

10. ICE: GPU Acceleration#

Let’s check out some different PyTorch arithmetic on the CPU vs. GPU!

We’ll present the same code for both TensorFlow and Pytorch.

%pip install -q torch

%pip install -q tensorflow

First, import and check the version.

We will also make sure we can access the CUDA GPU. This should always be your first step!

We’ll also set a manual_seed for the random operations. While this isn’t strictly necessary for this experiment, it’s good practice as it aids with reproducibility.

import torch

print("PyTorch Version:", torch.__version__)

# Help with reproducibility of test
torch.manual_seed(2016)

if not torch.cuda.is_available():
    raise OSError("ERROR: No GPU found.")

import tensorflow as tf

print("TensorFlow Version:", tf.__version__)

# Help with reproducibility of test
tf.random.set_seed(2016)

# Make sure we can access the GPU
print("Physical Devices Available:\n", tf.config.list_physical_devices())
if not tf.config.list_physical_devices("GPU"):
    raise OSError("ERROR: No GPU found.")

10.1. Dot Product#

Dot products are extremely common tensor operations. They are used deep neural networks and linear algebra applications.

A dot product is essentially just a bunch of multiplications and additions.

PyTorch provides the torch.tensordot() method.
TensorFlow provides the tf.tensordot() method.

First, let’s define two methods to compute the dot product. One will take place on the CPU and the other on the GPU.

CPU Timing#

The CPU method is trivial!

# Compute the tensor dot product on CPU
def torch_cpu_dot_product(a, b):
    return torch.tensordot(a, b)

# Compute the tensor dot product on CPU
def tf_cpu_dot_product(a, b):
    with tf.device("/CPU:0"):
        product = tf.tensordot(a, b, axes=2)
    return product

GPU Timing#

For PyTorch the GPU method has a bit more two it. We must:

Send the tensors to the GPU for computation. We call torch.to() on the tensor to send it to a particular device
Wait for the GPU to synchronize. According to the docs, GPU ops take place asynchronously so you need to use synchronize for precise timing.

For TensorFlow the tf.device makes it a bit simpler.

# Send the tensor to GPU then compute dot product
# synchronize() required for timing accuracy, see:
# https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution
def torch_gpu_dot_product(a, b):
    a_gpu = a.to("cuda")
    b_gpu = b.to("cuda")
    product = torch.tensordot(a_gpu, b_gpu)
    torch.cuda.synchronize()
    return product

def tf_gpu_dot_product(a, b):
    with tf.device("/GPU:0"):
        product = tf.tensordot(a, b, axes=2)
    return product

Running the benchmark#

This section declares the start and stop tensor sizes for our test. You can change SIZE_LIMIT and then run again; just know that at some point you will run out of memory!

Next, it does tests at several sizes within this range, doubling each time.

We use timeit.timeit() for the tests. It will call the function multiple times and then average those times. Timeit is also more accurate than manually calling Python’s time function and doing subtraction.

Finally, results are saved into a list that’s then exported to a pandas DataFrame for easy viewing.

import pandas as pd
from timeit import timeit

SIZE_LIMIT = 10000  # where to stop at

# This cell is PyTorch
tensor_size = 10  # start at size 10
torch_results = []

print("Running PyTorch with 2D tensors from", tensor_size, "to", SIZE_LIMIT, "square")

# Run the test
while tensor_size < SIZE_LIMIT:
    # Random array
    a = torch.rand(tensor_size, tensor_size, device="cpu")
    b = torch.rand(tensor_size, tensor_size, device="cpu")

    # Time the CPU operation
    cpu_time = timeit("torch_cpu_dot_product(a, b)", globals=globals(), number=50)

    # Time the GPU operation
    # First, we send the data to the GPU, called the warm up
    # It really depends on the application of this time is important or negligible
    # We are doing it here becasue timeit() averages the results of multiple runs
    torch_gpu_dot_product(a, b)
    # Now we time the actual operation
    gpu_time = timeit("torch_gpu_dot_product(a, b)", globals=globals(), number=50)

    # Record the results
    torch_results.append(
        {
            "tensor_size": tensor_size * tensor_size,
            "cpu_time": cpu_time,
            "gpu_time": gpu_time,
            "gpu_speedup": cpu_time / gpu_time,  # Greater than 1 means faster on GPU
        }
    )

    # Double tensor_size
    tensor_size = tensor_size * 2

# Done! Cast the results to a DataFrame and print
torch_results_df = pd.DataFrame(torch_results)
print("PyTorch Results:")
print(torch_results_df)

# This cell is TensorFlow
tensor_size = 10  # start at size 10
tf_results = []

print(
    "Running TensorFlow with 2D tensors from", tensor_size, "to", SIZE_LIMIT, "square"
)

# Run the test
while tensor_size <= SIZE_LIMIT:
    # Random tensor_size x tensor_size array
    with tf.device("/CPU:0"):
        a = tf.random.uniform((tensor_size, tensor_size))
        b = tf.random.uniform((tensor_size, tensor_size))

    # Time the CPU operation
    cpu_time = timeit("tf_cpu_dot_product(a, b)", globals=globals(), number=10)

    # Time the GPU operation
    # First, we send the data to the GPU, called the warm up
    # It really depends on the application of this time is important or negligible
    # We are doing it here because timeit() runs the function multiple times anyway
    tf_gpu_dot_product(a, b)
    # Now we time the actual operation
    gpu_time = timeit("tf_gpu_dot_product(a, b)", globals=globals(), number=10)

    # Record the results
    tf_results.append(
        {
            "tensor_size": tensor_size * tensor_size,
            "cpu_time": cpu_time,
            "gpu_time": gpu_time,
            "gpu_speedup": cpu_time / gpu_time,  # Greater than 1 means faster on GPU
        }
    )

    # Double tensor_size
    tensor_size = tensor_size * 2

# Done! Cast the results to a DataFrame and print
tf_results_df = pd.DataFrame(tf_results)
print("TensorFlow Results:")
print(tf_results_df)

Dot Product Results#

If you left the default sizes, you should see 10 rows of results. You’ll notice that with small tensors the CPU is faster than the GPU! This is also indidcated by the gpu_speedup being less than 1.

But as the tensor sizes grow, the GPU overtakes the CPU for speed! 🏎️

10.2. Another Tensor Operation#

Your task is to repeat this benchmark below, but finding the minimum element in a 1D tensor. You only need to do it with one framework.

Use either

# Define your methods here

# Conduct your benchmark here

Results#

Jot down some thoughts to yourself here about what you saw 📈

Commit this notebook to your fork, then complete the ICE on Gradescope.