DEVintersection 2019
Nov 18th – 21st, MGM Grand, Las Vegas NV. Keynotes by Microsoft's Scott Guthrie, Eric Boyd, Julia Liuson, and Scott Hanselman.

.NET for Apache® Spark™

A free, open-source, and cross-platform big data analytics framework

Get Started Request a Demo

Supported on Windows, Linux, and macOS

What is Apache Spark?

Apache Spark™ is a general-purpose distributed processing engine for analytics over large data sets—typically terabytes or petabytes of data. Apache Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query.

Processing tasks are distributed over a cluster of nodes, and data is cached in-memory, to reduce computation time.

Data preparation

Apache Spark is often used for high-volume data preparation pipelines, such as extract, transform, and load (ETL) processes that are common in data warehousing.

Real-time processing

Large streams of data can be processed in real-time with Apache Spark, such as monitoring streams of sensor data or analyzing financial transactions to detect fraud.

Machine learning

Apache Spark can reduce the cost and time involved in building machine learning models through distributed processing of data preparation and model training, in the same program.

Interactive query

Modern business often requires analyzing large amounts of data in an exploratory manner. Apache Spark is well suited to the adhoc nature of the required data processing.

.NET for Apache Spark is built on the same interop layer as SparkR and PySpark.

What is .NET For Apache Spark?

The .NET bindings for Spark are written on the Spark interop layer, designed to provide high performance bindings to multiple languages.

.NET for Apache Spark is compliant with .NET Standard—a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code.

.NET for Apache Spark 101

Kick-start your journey into big data analytics with this introductory video series about .NET for Apache Spark! Learn all about .NET for Apache Spark and how it brings the world of big data to the .NET ecosystem.

// Create a Spark session
var spark = SparkSession
    .Builder()
    .AppName("word_count_sample")
    .GetOrCreate();

// Create a DataFrame
DataFrame dataFrame = spark.Read().Text("input.txt");

// Manipulate and view data
var words = dataFrame.Select(Split(dataFrame["value"], " ").Alias("words"));

words.Select(Explode(words["words"])
    .Alias("word"))
    .GroupBy("word")
    .Count()
    .Show();
// Create a Spark session
let spark =
    SparkSession.Builder()
        .AppName("word_count_sample")
        .GetOrCreate()

// Create a DataFrame
let df = spark.Read().Text("input.txt")

let words = df.Select(Split(df.["value"], " ").Alias("words")

words.Select(Explode(words["words"]).Alias("word"))
     .GroupBy("word")
     .Count()



Apache Spark with C# or F#

.NET for Apache Spark gives you APIs for using Apache Spark from C# and F#. With the .NET APIs you can access all aspects of Apache Spark including Spark SQL, for working with structured data, and Spark Streaming.

Get started with .NET for Apache Spark
To complete all 22 queries in the TPC-H benchmark, .NET took 406 seconds, Python took 433 seconds, and Scala took 375 seconds

Total execution time (seconds) for all 22 queries in the TPC-H benchmark (lower is better). Data sourced from an internal run of the TPC-H benchmark, using warm execution on Ubuntu 16.04. For benchmark methodology and detailed results, see .NET for Apache Spark performance.

High performance

.NET for Apache Spark is designed for high performance and performs well on the TPC-H benchmark.

The TPC-H benchmark consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance.

.NET for Apache Spark performance

public void Run(string[] args)
{
    var spark = SparkSession.Builder().AppName("Sample").GetOrCreate();

    // Register a user-defined function (UDF) that runs C# code
    spark.Udf().Register<string, bool>("MyUDF", (text) => Sentiment(text));

    var df = spark.Read().Csv("data.csv").CreateOrReplaceTempView("Tweets");
    var sqlDf = spark.Sql("SELECT _c0, MyUDF(_c1) FROM Tweets");
}

public static bool Sentiment(string text)
{
    // Use ML.NET model to predict positive or negative sentiment of text
    var predictionEngine = GetPredictionEngine();
    var result = predictionEngine.Predict(new Tweet {Text = text});
    return result.Prediction;
}
let run (args: string[]) =
    let spark = SparkSession.Builder().AppName("Sample").GetOrCreate()

    // Register a user-defined function (UDF) that runs F# code
    spark.Udf().Register<string, bool>("MyUDF", fun text -> Sentiment(text))
    
    let df = spark.Read().Csv("data.csv").CreateOrReplaceTempView("Tweets")
    let sqlDf = spark.Sql("SELECT _c0, MyUDF(_c1) FROM Tweets")

type Tweet = { Text: string }

let getSentiment (tweet: Tweet) =
    // Use ML.NET model to predict positive or negative sentiment of text
    let predictionEngine = GetPredictionEngine()
    let result = predictionEngine.Predict(tweet)
    result.Prediction
                                  

Leverage the .NET ecosystem

.NET for Apache Spark lets you re-use all the knowledge, skills, code, and libraries you already have as a .NET developer.

Your data processing code can also utilize the large ecosystem of libraries available to .NET developers, such as Newtonsoft.Json, ML.NET, MathNet.Numerics, NodaTime, and more.

Build anywhere. Run anywhere.

.NET for Apache Spark can be used on Linux, macOS, and Windows, just like the rest of .NET.

.NET for Apache Spark is available by default in Azure HDInsight, and can be installed in Azure Databricks, Azure Kubernetes Service, AWS Databricks, AWS EMR, and more.

swimlane-contributors-around-world-no-text
60,000+ active OSS contributors 3,700+ OSS company contributors

Open-source and free

.NET for Apache Spark is part of the open-source .NET platform that has a strong community of over 60,000 contributors from more than 3,700 companies.

.NET is free, and that includes .NET for Apache Spark. There are no fees or licensing costs, including for commercial use.

Visit .NET for Apache Spark on GitHub

Ready to Get Started?

Our step-by-step tutorial will help you get .NET for Apache Spark running on your computer.

Supported on Windows, Linux, and macOS

Get Started

Apache Spark, Spark, and Apache are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.