At MS Build, Microsoft announced availability of Native Execution Engine in Fabric. The product team will have detailed documentation and technical details but I will attempt to provide an ELI5 version of what it means and how it works at a 30000 ft level.
Spark processes run on Java Virtual Machine (JVM). JVM can run on any machine that has JVM installed irrespective of the hardware or the OS installed. The downside of this is that it's slow because it's not run/compiled into machine code (like C++) that hardware can use. Another problem is that JVM takes performance hit due to how memory is managed during operations (garbage collection overhead). These two issues can be solved if spark could directly convert sparkSQL into a code that can be run natively on machine, e.g. C++ code. Native Execution Engine (NEE) does exactly that. The SparkSQL code is translated into C++ code to achieve higher execution speed. There are many ways to do that. Databricks does it by using its proprietary Photon engine. Fabric uses open-source Gluten + Velox architecture.
In vanilla Spark, there is JVM between sparkSQL and the machine. In case of NEE, there two layers - Gluten and Velox.
Gluten is responsible for glueing the sparkSQL and execution engine. It does so by taking the optimized query plan from sparkSQL, converting it to something called Substrait plan. For our sake simplification, think of substrait as something like json (gross generalization).
Velox is a C++ library that can execute code close to the machine. In our case, it will take the substrait plan and perform joins, aggregations and other compute heavy operations in an efficient manner thus eliminating the inefficiencies and overhead of Java. This "native execution" is what makes spark code run faster compared to JVM based vanilla spark.
If Gluten is not able to do the conversion, it will fall back to regular spark without any user intervention. There are no code changes required by the developer. You write spark/pyspark as you normally would. You can look at the spark DAG (or event logs) to identify if a particular operation was executed using JVM or NEE.
You can enable it in Fabric notebook by setting up the spark config (if it's available). Note that this is a custom pool so the cluster start up time will increase from 10s to 2 min. While in preview, you will not pay for NEE execution (and I hope it will stay that way beyond preview too as a key Fabric differentiator).
%%configure -f
{
"conf": {
"spark.gluten.enabled": "true",
"spark.shuffle.manager": "org.apache.spark.shuffle.sort.ColumnarShuffleManager",
}
}
You can learn more about it here:
..and here
..and if that's not enough watch the recordings from VeloxCon on how other companies are using it:
I want to thank Estera Kot, Swinky Mann and Surbhi Vijayvargeeya at Microsoft and my colleague Miles Cole. Miles has an excellent blog on Photon you should check out.
Footnote*: ELI5 means "Explain Me Like I am 5 years old"*