Using Visual Studio & Azure?
Take a quick, 5 minute survey to help us make Visual Studio and its Azure-related experiences better.

SigParser uses ML.NET to detect "non-human" e-mails

Customer
SigParser

Products & services
ML.NET
Office 365

Industry
Software / Telecommunications

Organization Size
Small (<100 employees)

Country
USA

SigParser is an API and service that automates the tedious (and often expensive) process of adding to and maintaining customer relationship management (CRM) systems. SigParser extracts contact information, such as names, e-mail addresses, and phone numbers, from e-mail signatures and feeds all that information as contacts into CRM systems or databases.

The SigParser application lets you provide a sample email and preview the metadata it is able to determine about the email.

Business problem

When SigParser processes e-mails for a company, many of the e-mails are non-human (e.g. newsletters, payment notifications, passwords resets, etc.). The sender's information from these types of e-mails should not show up in contact lists or be pushed into a CRM system. Thus, SigParser decided to use machine learning to predict if e-mail messages are "spammy looking."

Take the following notification e-mail from a forum as an example. The sender of this e-mail isn't a contact that should show up in a CRM, so a machine learning model predicts that "isSpammyLookingEmailMessage" is true:

The sample email comes from a noreply email address and has generated information about unread notifications etc.

SigParser clasifies the sample email as a 'spammy looking E-mail message', using their ML.NET model

Why ML.NET?

When the team at SigParser decided to utilize Machine Learning, they originally tried using R; however, they found it was very difficult to maintain and integrate with their API, which is built with .NET Core.

Paul Mendoza, CEO and founder of SigParser, said that R "was just too disconnected from the development process. With R we were generating all the constants and then we would copy and paste those into .NET and then try the model out for real and learn it didn't quite work and have to repeat. This was too slow."

Thus, they turned to ML.NET to bring everything into one application.

"With ML.NET, we're able to train the model and then immediately test it inside of our code. This makes shipping new changes faster because all the tooling was together in one place."

Paul Mendoza, CEO and founder SigParser

Impact of ML.NET

The impact of moving to ML.NET from R has been a 10x productivity improvement. Additionally, until SigParser moved to R, they only utilized one machine learning model. Since the conversion to ML.NET, they've now got 6 machine learning models for various aspects of email parsing. This increase has come about because it's now possible with ML.NET to quickly experiment with new machine learning ideas and show the results in the application quickly.

Solution architecture

Data processing

SigParser first used the well-known Enron dataset to train their model, but when they realized that it was quite outdated, they ended up labeling a couple thousand e-mails in their own e-mail accounts (keeping with GDPR compliance) as either human or non-human and used this as a training dataset.

Machine learning features

SigParser's ML.NET model has two Features (used to make the prediction "IsHumanE-mail"):

  • HasUnsubscribes — True if an e-mail has an "unsubscribe" or "opt out" in the e-mail body
  • EmailBodyCleaned — Normalizes the HTML e-mail body to make the e-mail language agnostic and to remove any personally identifiable information

Machine learning algorithm

These two Features are inputted into a Binary FastTree algorithm, which is an algorithm for classification scenarios, and the output is the prediction of whether the e-mail was sent from a "real human" or from an automated source. Currently, SigParser is processing millions of e-mails per month with this ML.NET model.


var mlContext = new MLContext();

var(trainData, testData) = mlContext.BinaryClassification.TrainTestSplit(mlContext.CreateStreamingDataView(totalSampleSet), testFraction:0.2);

var pipeline = mlContext.Transforms.Text.FeaturizeText("EmailBodyCleaned", "EmailHTMLFeaturized")
    .Append(mlContext.Transforms.Concatenate("Features", "HasUnsubscribes", "EmailHTMLFeaturized"))
    .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumn: "IsHumanEmail", featureColumn: "Features"));

Console.WriteLine("Fitting data");
var fitResult = pipeline.Fit(trainData);

Console.WriteLine("Evaluating metrics");
var metrics = mlContext.BinaryClassification.Evaluate(fitResult.Transform(testData), label: "IsHumanEmail");
Console.WriteLine("Accuracy: " + metrics.Accuracy);

using (var stream = File.Create(emailParsingPath + "EmailHTMLTypeClassifier.zip"))
{
    mlContext.Model.Save(fitResult, stream);
}

SigParser uses ML.NET's data transformations and algorithms for multiple machine learning solutions, including the spam detection model mentioned above, which has enabled them to automatically export the correct contact information to customer databases from e-mail signatures, bypassing the need for time-consuming and error-prone manual contact data entry.

Ready to Get Started?

Our step-by-step tutorial will help you get ML.NET running on your computer.

Supported on Windows, Linux, and macOS

Get Started