Apprenez l’analyse des données avec Julia

Learn Data Analysis with Julia

awan_learn_data_analysis_julia_1 Apprenez l'analyse des données avec Julia NEWS
Image by author

Julia is another programming language similar to Python and R. It combines the speed of low-level languages like C with the simplicity of Python. Julia is gaining popularity in the field of data science, so if you’re looking to expand your portfolio and learn a new language, you’re in the right place.

In this tutorial, we will learn how to set up Julia for data science, load data, perform data analysis, and then visualize it. The tutorial is so simple that anyone, even a student, can start using Julia for data analysis in just 5 minutes.

1. Setting Up Your Environment

  1. Download Julia and install the package by visiting (julialang.org).
  2. Now we need to set up Julia for Jupyter Notebook. Launch a terminal (PowerShell), type « julia » to start the Julia REPL, and then enter the following command.

using Pkg
Pkg.add("IJulia")

  1. Launch Jupyter Notebook and start a new notebook with Julia as the kernel.
  2. Create a new code cell and enter the following command to install the necessary data science packages.

using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("Plots")
Pkg.add("Chain")

2. Loading Data

For this example, we are using the Online Sales Dataset from Kaggle. It contains data on online sales transactions across different product categories.

We will load the CSV file and convert it into DataFrames, similar to Pandas DataFrames.

using CSV
using DataFrames

# Load the CSV file into a DataFrame
data = CSV.read("Online Sales Data.csv", DataFrame)

3. Exploring Data

We will use the « first » function instead of « head » to display the first 5 rows of the DataFrame.

awan_learn_data_analysis_julia_4 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_4 Apprenez l'analyse des données avec Julia NEWS

To generate a summary of the data, we will use the « describe » function.

awan_learn_data_analysis_julia_8 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_8 Apprenez l'analyse des données avec Julia NEWS

Similar to Pandas DataFrame, we can display specific values by providing the row number and column name.

Output:

4. Data Manipulation

We will use the « filter » function to filter data based on certain values. It requires the column name, condition, values, and the DataFrame.

filtered_data = filter(row -> row[:"Unit Price"] > 230, data)
last(filtered_data, 5)

awan_learn_data_analysis_julia_9 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_9 Apprenez l'analyse des données avec Julia NEWS

We can also create a new column similar to Pandas. It’s that simple.

data[!, :"Total Revenue After Tax"] = data[!, :"Total Revenue"] .* 0.9  
last(data, 5)

awan_learn_data_analysis_julia_6 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_6 Apprenez l'analyse des données avec Julia NEWS

We will now calculate the average values of « Total Revenue After Tax » based on different « Product Categories ».

using Statistics

grouped_data = groupby(data, :"Product Category")
aggregated_data = combine(grouped_data, :"Total Revenue After Tax" => mean)
last(aggregated_data, 5)

awan_learn_data_analysis_julia_3 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_3 Apprenez l'analyse des données avec Julia NEWS

5. Visualization

Visualization is similar to Seaborn. In our case, we will visualize the bar chart of the recently aggregated data. We will provide the X and Y columns, then the title and labels.

using Plots

# Basic plot
bar(aggregated_data[!, :"Product Category"], aggregated_data[!, :"Total Revenue After Tax_mean"], title="Product Analysis", xlabel="Product Category", ylabel="Total Revenue After Tax Mean")

The majority of the total average revenue is generated by electronics. The visualization looks perfect and clear.

awan_learn_data_analysis_julia_7 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_7 Apprenez l'analyse des données avec Julia NEWS

To generate histograms, simply provide the X column data and label. We want to visualize the frequency of items sold.

histogram(data[!, :"Units Sold"], title="Units Sold Analysis", xlabel="Units Sold", ylabel="Frequency")

awan_learn_data_analysis_julia_2 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_2 Apprenez l'analyse des données avec Julia NEWS

It seems that the majority of people bought one or two items.

To save the visualization, we will use the `savefig` function.

6. Creating a Data Processing Pipeline

Creating a proper data pipeline is necessary to automate data processing workflows, ensure data consistency, and enable scalable and efficient data analysis.

We will use the « Chain » library to create chains of various functions previously used to calculate the total average revenue based on different product categories.

using Chain
# Example of a simple data processing pipeline
processed_data = @chain data begin
filter(row -> row[:"Unit Price"] > 230, _)
groupby(_, :"Product Category")
combine(_, :"Total Revenue" => mean)
end
first(processed_data, 5)

awan_learn_data_analysis_julia_5 Apprenez l'analyse des données avec Julia NEWS awan_learn_data_analysis_julia_5 Apprenez l'analyse des données avec Julia NEWS

To save the processed DataFrame as a CSV file, we will use the `CSV.write` function.

CSV.write("output.csv", processed_data)

Conclusion

In my opinion, Julia is simpler and faster than Python. Many syntaxes and functions I am accustomed to are also available in Julia, like Pandas, Seaborn, and Scikit-Learn. So, why not learn a new language and start doing things better than your colleagues? Additionally, it will help you land a research-related job, as most clinical researchers prefer Julia over Python.

In this tutorial, we learned how to set up the Julia environment, load the dataset, perform powerful data analysis and visualization, and create a data pipeline to ensure reproducibility and reliability. If you want to learn more about Julia for data science, let me know so I can write even simpler tutorials for you guys.

Abid Ali Awan (@1abidaliawan) is a certified data scientist who loves creating machine learning models. Currently, he focuses on content creation and writes technical blogs on machine learning and data science technologies. Abid holds a master’s degree in technology management and a bachelor’s degree in telecommunications engineering. His vision is to create an AI product using a graph neural network for students struggling with mental illness.

Source