BZ2, Linux Command Line and Fabric

Apologies for the title that doesn't make any sense. Like many of you, I hadn't heard of bz2 until a few weeks ago. I've been doing experiments and performance tests on Fabric, which involved working with 300GB of data compressed using bz2 compression. I've used gzip and lzma before, but bz2 was new to me. I downloaded the data to the Fabric lakehouse, but couldn't open it in Spark or with the bzip Python library. Data Factory pipelines can read bz2 files as well, but that didn't work either. The data authors likely compressed it on Linux with a very high compression ratio. I then realized that since Fabric spins up a Linux VM for Spark, I could use the Linux command line utilities (if available) in the notebook. Sure enough, it worked.

!bzcat <file.bz2> | head -n 5

You have to use the File API when providing the path, thanks to blobfuse2. I was able to decompress the file and save it as gzip too using the command line and load it in spark.

I'm fairly certain that 99.999% of you will never need to use this (myself included), but I thought I'd share it just in case it might help someone.

This also means if you have csv, json, text files in your Files section, and if you just want to inspect or search those files, instead of loading it in pandas or spark dataframe, you can use the bash commands like gprep ,cat ,awk , head ,column to quickly load the files.

Below I am using head and awk to view the first 5 columns and rows of the csv file

!head -n 5 /lakehouse/default/Files/sales.csv | awk -F, '{ for (i=1; i<=5; i++) printf "%s,", $i; print "" }' | column -s, -t | less

use json_pp to print, inspect json files and json tree structure.

This is a good reference for some of the commands you can use.

Did you find this article valuable?

Support Sandeep Pawar by becoming a sponsor. Any amount is appreciated!