.NET for Apache® Spark™
A free, open-source, and cross-platform big data analytics framework
Supported on Windows, Linux, and macOS
A free, open-source, and cross-platform big data analytics framework
Supported on Windows, Linux, and macOS
Apache Spark™ is a general-purpose distributed processing engine for analytics over large data sets—typically, terabytes or petabytes of data. Apache Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query.
Processing tasks are distributed over a cluster of nodes, and data is cached in-memory, to reduce computation time.
Apache Spark is often used for high-volume data preparation pipelines, such as extract, transform, and load (ETL) processes that are common in data warehousing.
Large streams of data can be processed in real-time with Apache Spark, such as monitoring streams of sensor data or analyzing financial transactions to detect fraud.
Apache Spark can reduce the cost and time involved in building machine learning models through distributed processing of data preparation and model training, in the same program.
Modern business often requires analyzing large amounts of data in an exploratory manner. Apache Spark is well suited to the ad hoc nature of the required data processing.
The .NET bindings for Spark are written on the Spark interop layer, designed to provide high performance bindings to multiple languages.
.NET for Apache Spark is compliant with .NET Standard—a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code.
Kick-start your journey into big data analytics with this introductory video series about .NET for Apache Spark! Learn all about .NET for Apache Spark and how it brings the world of big data to the .NET ecosystem.
// Create a Spark session
var spark = SparkSession
.Builder()
.AppName("word_count_sample")
.GetOrCreate();
// Create a DataFrame
DataFrame dataFrame = spark.Read().Text("input.txt");
// Manipulate and view data
var words = dataFrame.Select(Split(dataFrame["value"], " ").Alias("words"));
words.Select(Explode(words["words"])
.Alias("word"))
.GroupBy("word")
.Count()
.Show();
// Create a Spark session
let spark =
SparkSession.Builder()
.AppName("word_count_sample")
.GetOrCreate()
// Create a DataFrame
let df = spark.Read().Text("input.txt")
let words = df.Select(Split(df.["value"], " ").Alias("words"))
words.Select(Explode(words.["words"]).Alias("word"))
.GroupBy("word")
.Count()
.NET for Apache Spark gives you APIs for using Apache Spark from C# and F#. With the .NET APIs you can access all aspects of Apache Spark including Spark SQL, for working with structured data, and Spark Streaming.
Get started with .NET for Apache SparkTotal execution time (seconds) for all 22 queries in the TPC-H benchmark (lower is better). Data sourced from an internal run of the TPC-H benchmark, using warm execution on Ubuntu 16.04. For benchmark methodology and detailed results, see .NET for Apache Spark performance.
.NET for Apache Spark is designed for high performance and performs well on the TPC-H benchmark.
The TPC-H benchmark consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance.
public void Run(string[] args)
{
var spark = SparkSession.Builder().AppName("Sample").GetOrCreate();
// Register a user-defined function (UDF) that runs C# code
spark.Udf().Register<string, bool>("MyUDF", (text) => Sentiment(text));
var df = spark.Read().Csv("data.csv").CreateOrReplaceTempView("Tweets");
var sqlDf = spark.Sql("SELECT _c0, MyUDF(_c1) FROM Tweets");
}
public static bool Sentiment(string text)
{
// Use ML.NET model to predict positive or negative sentiment of text
var predictionEngine = GetPredictionEngine();
var result = predictionEngine.Predict(new Tweet {Text = text});
return result.Prediction;
}
type Tweet = { Text: string }
let getSentiment (tweet: Tweet) =
// Use ML.NET model to predict positive or negative sentiment of text
let predictionEngine = GetPredictionEngine()
let result = predictionEngine.Predict(tweet)
result.Prediction
let run (args: string[]) =
let spark = SparkSession.Builder().AppName("Sample").GetOrCreate()
// Register a user-defined function (UDF) that runs F# code
spark.Udf().Register<string, bool>("MyUDF", fun text -> getSentiment tweet)
let df = spark.Read().Csv("data.csv").CreateOrReplaceTempView("Tweets")
// Register a UDF for sql functions
let sqlDf = spark.Sql("SELECT _c0, MyUDF(_c1) FROM Tweets")
sqlDf.Show()
.NET for Apache Spark lets you re-use all the knowledge, skills, code, and libraries you already have as a .NET developer.
Your data processing code can also utilize the large ecosystem of libraries available to .NET developers, such as Newtonsoft.Json, ML.NET, MathNet.Numerics, NodaTime, and more.
.NET for Apache Spark can be used on Linux, macOS, and Windows, just like the rest of .NET.
.NET for Apache Spark is available by default in Azure HDInsight, and can be installed in Azure Databricks, Azure Kubernetes Service, AWS Databricks, AWS EMR, and more.
.NET for Apache Spark is part of the open-source .NET platform that has a strong community of contributors from more than 3,700 companies.
.NET is free, and that includes .NET for Apache Spark. There are no fees or licensing costs, including for commercial use.
Our step-by-step tutorial will help you get .NET for Apache Spark running on your computer.
Apache Spark, Spark, and Apache are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.