Set up .NET for Apache Spark on your machine and build your first application.
Linux or Windows 64-bit operating system.
10 minutes + download/installation time
Use Apache Spark to count the number of times each word appears across a collection sentences.
To start building .NET apps, download and install the .NET SDK (Software Development Kit).
You need to install additional dependencies on older versions of Windows. See Windows 7 / Vista / 8.1 / Server 2008 R2 / Server 2012 R2 for more information.
Once you've installed, open a new command prompt and run the following command:
Once you've installed, open a new terminal and run the following command:
dotnet
If the installation succeeded, you should see an output similar to the following:
Usage: dotnet [options]
Usage: dotnet [path-to-application]
Options:
-h|--help Display help.
--info Display .NET information.
--list-sdks Display the installed SDKs.
--list-runtimes Display the installed runtimes.
path-to-application:
The path to an application .dll file to execute.
If everything looks good, select the Continue button below to go to the next step.
If you receive a 'dotnet' is not recognized as an internal or external command error, make sure you opened a new command prompt. If you can't resolve the issue, use the I ran into an issue button to get help fixing the problem.
If you receive a dotnet: command not found error, make sure you opened a new terminal window. If you can't resolve the issue, use the I ran into an issue button to get help fixing the problem.
.NET for Apache Spark runs on the 64-bit version of the Windows operating system and assumes the same from this point onwards.
Apache Spark is downloaded as a compressed .tgz file. You'll need 7-zip to extract the file.
If you already have an alternative extraction program installed, you can use that instead.
Apache Spark requires the Java SE Development Kit (JDK) 8 or 11. This tutorial uses version 8, but you can use version 11 if you already have that installed.
Once you've installed, open a new command prompt and run the following command:
java -version
If the command runs and prints your Java version information, you're good to go.
If the command fails and you can't resolve the issue, use the I ran into an issue button to get help fixing the problem.
Apache Spark requires the Java Development Kit (JDK).
Open a terminal and run the following command to install Java SE Development Kit (JDK) 11:
sudo apt-get install default-jdk
Once the previous command completes, run the following command:
java -version
If the command runs and prints your Java version information, you're good to go.
If the command fails and you can't resolve the issue, use the I ran into an issue button to get help fixing the problem.
Apache Spark is downloaded as a .tgz file.
Download Apache Spark 3.0.1Open a new command prompt in administrator mode and run the following commands to set the environment variables used to locate Apache Spark:
setx HADOOP_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\
setx SPARK_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\
setx /M PATH "%PATH%;%HADOOP_HOME%;%SPARK_HOME%\bin"
Once you've installed everything and set your environment variables, open a new command prompt or terminal and run the following command:
spark-submit --version
If the command runs and prints version information, you can move to the next step. If you receive a 'spark-submit' is not recognized as an internal or external command error, make sure you opened a new command prompt.
In your terminal, move to the folder that contains the file you just downloaded then run the following command:
mkdir ~/bin
tar xvf spark-3.0.1-bin-hadoop2.7.tgz --directory ~/bin
Set up the required environment variables by running the following commands:
export SPARK_HOME=~/bin/spark-3.0.1-bin-hadoop2.7
export PATH="$SPARK_HOME/bin:$PATH"
source ~/.bashrc
Run the following command to make sure Spark is installed correctly:
spark-submit --version
Download the Microsoft.Spark.Worker release from the .NET for Apache Spark GitHub repository:
Download .NET for Apache Spark (v1.0.0).NET for Apache Spark requires WinUtils to be installed alongside Apache Spark.
Once winutils.exe downloads, copy it into C:\bin\spark-3.0.1-bin-hadoop2.7\bin.
Run the following command to set the DOTNET_WORKER_DIR
environment variable. This is used by .NET apps to locate .NET for Apache Spark.
setx DOTNET_WORKER_DIR "C:\bin\Microsoft.Spark.Worker-1.0.0"
Finally, double check that you can run spark-shell
from your command line before you move to the next section. Press CTRL+D to quit Spark.
.NET for Apache Spark is downloaded as a .tgz file.
Download .NET for Apache Spark (v1.0.0)In your terminal, move to the folder that contains the file you just downloaded then run the following command:
tar xvf Microsoft.Spark.Worker.netcoreapp3.1.linux-x64-1.0.0.tar.gz --directory ~/bin
Run the following command to set the DOTNET_WORKER_DIR
environment variable. This is used by .NET apps to locate .NET for Apache Spark.
export DOTNET_WORKER_DIR="~/bin/Microsoft.Spark.Worker-1.0.0"
In your command prompt, run the following command to create your app:
In your terminal, run the following command to create your app:
dotnet new console -f netcoreapp3.1 -o mySparkApp
Then, navigate to the new directory created by the previous command:
cd mySparkApp
The dotnet
command creates a new
application of type console
for you. The -o
parameter creates a directory named mySparkApp
where your app is stored and populates it with the required files. The cd mySparkApp
command puts you into the newly created app directory.
To use .NET for Apache Spark in an app, you need to add the Microsoft.Spark package to your project. In your command prompt, run the following command:
To use .NET for Apache Spark in an app, you need to add the Microsoft.Spark package to your project. In your terminal, run the following command:
dotnet add package Microsoft.Spark --version 1.0.0
Your app will be processing a file containing lines of text.
Create an input.txt
file in your mySparkApp
directory, containing the following text.
Hello World
This .NET app uses .NET for Apache Spark
This .NET app counts words with Apache Spark
Open Program.cs
in any text editor and replace all of the code with the following:
using Microsoft.Spark.Sql;
namespace MySparkApp
{
class Program
{
static void Main(string[] args)
{
// Create a Spark session
SparkSession spark = SparkSession
.Builder()
.AppName("word_count_sample")
.GetOrCreate();
// Create initial DataFrame
DataFrame dataFrame = spark.Read().Text("input.txt");
// Count words
DataFrame words = dataFrame
.Select(Functions.Split(Functions.Col("value"), " ").Alias("words"))
.Select(Functions.Explode(Functions.Col("words"))
.Alias("word"))
.GroupBy("word")
.Count()
.OrderBy(Functions.Col("count").Desc());
// Show results
words.Show();
// Stop Spark session
spark.Stop();
}
}
}
Run the following command to build your application after navigating to the directory containing the application:
dotnet build
Run the following command to submit your application to run on Apache Spark (Note: please make sure input.txt
exists in your mySparkApp
directory):
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.1\microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin\Debug\netcoreapp3.1\mySparkApp.dll
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin/Debug/netcoreapp3.1/microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin/Debug/netcoreapp3.1/mySparkApp.dll
If your app runs successfully, you should see the data written to the console:
... logging ...
+------+-----+
| word|count|
+------+-----+
| .NET| 3|
|Apache| 2|
| app| 2|
| This| 2|
| Spark| 2|
| World| 1|
|counts| 1|
| for| 1|
| words| 1|
| with| 1|
| Hello| 1|
| uses| 1|
+------+-----+
... logging ...
You may encounter an error that Spark failed to delete a jar file from a temporary directory. This is a known error and won't affect the output of your app.
Congratulations, you've built and run your first .NET for Apache Spark app!
If you want to keep learning more about .NET for Apache Spark, visit the following official documentation:
.NET for Apache Spark documentation
If you want to learn more about Apache Spark, you can check out the following documentation: