Calculating Folder Size In The Lakehouse

This is more of a personal note to self than a blog post. When running various test scenarios, I calculate the size of the lakehouse by calculating the size of its folders (Files and Tables). The code I use is below. I'm sharing it here to save myself the trouble of searching for it and to help others who may find it useful. It's nothing special; I simply use Pool to parallelize calculations for a large number of folders.

import os
import pandas as pd
from multiprocessing import Pool, cpu_count


def get_size_of_folder(folder_path):
    """
    fabric.guru  |  08-09-2023
    Calculate the total size of all files in the given folder.

    Args:
    - folder_path (str): Path to the folder.

    Returns:
    - tuple: (folder_path, size in MB)
    """
    total_size = sum(
        os.path.getsize(os.path.join(dirpath, f)) 
        for dirpath, _, filenames in os.walk(folder_path) 
        for f in filenames
    )

    size_in_mb = total_size / (1024 * 1024)
    return folder_path, size_in_mb


def get_folder_sizes(base_path):
    """
    fabric.guru  |  08-09-2023
    Get the sizes of all folders in the given base path.

    Args:
    - base_path (str): Base directory path.

    Returns:
    - DataFrame: DataFrame with columns 'Folder' and 'Size (MB)'.
    """
    folders = [
        os.path.join(base_path, folder) 
        for folder in os.listdir(base_path) 
        if os.path.isdir(os.path.join(base_path, folder))
    ]

    with Pool(cpu_count()) as p:
        sizes = p.map(get_size_of_folder, folders)

    df = pd.DataFrame(sizes, columns=['Folder', 'Size (MB)'])
    return df

## Return a pandas dataframe containing two columns Folder & Size (MB)
## This will scan the folders from the lakehouse mounted in the notebook
## Use File API path and not the ABFSS or https path

df = get_folder_sizes("/lakehouse/default/Files")

Did you find this article valuable?

Support Sandeep Pawar by becoming a sponsor. Any amount is appreciated!