Azure Data Engineering Notebooks

Data carries on growing and it is one of the most important assets that a company has nowadays. Learning to visualize, analyse and transform data is becoming a primary skill for any software engineer.

I have been recently learning more about the latest data engineering capabilities available in the Azure platform. In the first part of this post, I captured a scenario where using Azure Synapse Analytics, I was able to interrogate some sample sales data using a Notebook. In the second part of the post, I show a similar example using Azure Databrick Notebooks.

I really like Notebooks. They are a great tool to work with data and document what you are doing.

In this example, the Notebook is presenting the sales data from some files stored in a linked data lake. You can use different languages in the Notebooks. Python and SQL are the most common ones.

If you want to learn more about this, you can check this online sample: https://microsoftlearning.github.io/DP-500-Azure-Data-Analyst/Instructions/labs/02-analyze-files-with-Spark.html

Now let’s look at the Azure Databrick example. I have created a Notebook in a single-node cluster within my Databrick workspace. I can then query the online file with sample data https://raw.githubusercontent.com/MicrosoftLearning/dp-203-azure-data-engineer/master/Allfiles/labs/23/adventureworks/products.csv

If you would like to learn more about this example, you can check the following online lab https://microsoftlearning.github.io/dp-203-azure-data-engineer/Instructions/Labs/23-Explore-Azure-Databricks.html

The previous visualisation used the default graphics capabilities in Databricks Notebooks, but you can extend it using other graphics libraries like matplotlib or seaborn.

Notice how Notebooks accept different programming languages and even a combination of them, like the above example where the Spark Python library is loading a SQL query.

While matplotlib enables you to create complex charts of multiple types, it can require some complex code to achieve the best results. For this reason, over the years, many new libraries have been built on the base of matplotlib to abstract its complexity and enhance its capabilities. One such library is seaborn.

You can learn more about this example following this online lab https://microsoftlearning.github.io/dp-203-azure-data-engineer/Instructions/Labs/24-Analyze-Files-in-Azure-Databricks.html

Leave a comment