Microsoft Introduces SynapseML, an Open-Source Library to Develop Scalable Machine Learning Pipelines

By Abhishek Jadhav

Freelance Tech Writer

January 03, 2022

Blog

Microsoft Introduces SynapseML, an Open-Source Library to Develop Scalable Machine Learning Pipelines

Developing large-scale machine learning solutions has always been a constant struggle.

Even with the suitable architecture, highly scalable machine learning pipelines require merging a number of infrastructure platforms and frameworks that aren't built to work together seamlessly. Irrespective of the experience in developing ML applications, most developers find the task of orchestrating several ML tools difficult. Microsoft's SynapseML has come to market to solve these issues. 

SynapseML is an open-source library built from the bottom up to construct massively scalable machine learning pipelines. The machine learning library expands Apache Spark's capabilities to better address the needs of largely scaled machine learning applications. For example, any Spark-compatible language, such as Python, Scala, R, Java,.NET, and C# can be utilized with SynapseML. The platform employs a distributed programming methodology to divide a particular machine learning task across thousands of machines while ensuring that GPUs and CPUs are fully utilized. More importantly, all of this is taken care of by a single API that can rely on frameworks like LightGBM or XGBoost beneath.

Microsoft’s SynapseML Simplifies Distributed Machine Learning

Writing an error free application for a distributed network can be tedious and time-consuming, especially when considering the distributed evaluation of a deep network. By distributing the model to hundreds of machines, followed by queuing the data for data readers to work together in such a way that the GPUs are at full capacity, the machines enter the cluster by obtaining copies of the model to allow the data readers to adapt and share work with the new machines. It also facilitates the re-computation of lost work and verifies that the resources have been freed correctly post their usage and their progress must be tracked.

There are frameworks that can help manage all these tasks but if one needs to compare with a different machine learning model, it requires the creation of a new cluster or environment. These training systems aren't meant to serve or deploy models, which necessitates the need for separate inference and streaming infrastructures.

[Image Credit: Microsoft]

SynapseML is a standardized API that provides interaction with various machine learning frameworks, encompassing data, platform, and language agnostics. The API is scalable and can be used for streaming and serving applications. The library is intended to assist developers who are concentrating on the high-level structure of their data and activities rather than the implementation intricacies and peculiarities of various ML ecosystems and databases. This improves developer experience by enabling engineers to quickly work on various ML tools and frameworks while reducing the time required.

The differentiating factor with SynapseML is that it provides easy APIs for pre-built intelligent services, such as Azure Cognitive Services. This helps businesses and researchers to solve large-scale artificial intelligence problems. Developers may use SynapseML to integrate over 45 different cutting-edge machine learning services directly into their systems and databases. Support for distributed form recognition, dialogue transcription, and translation has been introduced to the newest edition. These algorithms can analyze an extensive range of documents, transcribe real-time conversations of multiple speakers, and translate text into more than 100 languages.

SynapseML provides the Spark Ecosystem with new networking features that allows the user to deploy any web service into their SparkML models using the HTTP on Spark project. The Spark serving project, backed by Spark cluster, allows high throughput, sub-millisecond latency web services for production deployment. Extending the Spark Ecosystem, SynapseML provides its users with data science and deep learning tools such as Open Neural Network Exchange(ONNX), LightGBM, the Cognitive Services, Vowpal Wabbit, and OpenCV.

[Image Credit: Microsoft]

SynapseML Extensive Compatibility with Open Neural Network Exchange

The SynapseML library enables developers to use models from a plethora of ML ecosystems for tasks that cannot be executed with existing cognitive services. The maximum potential of this integration can be tapped by developers by writing just a few lines of code, to scale an extensive range of classical and deep learning models. The ONNX-Spark integration manages the distribution of ONNX models to worker nodes, batching and buffering the incoming data for high throughput, and automating the scheduling of work on hardware accelerators.

[Image Credit: Microsoft]

Incorporating ONNX with Spark enables developers to scale deep learning models. Additionally, it helps with the distribution of inference systems across a wide range of ML ecosystems. Translation of models from TensorFlow, scikit-learn, Core ML, LightGBM, XGBoost, H2O and, PyTorch to ONNX is possible with ONNXML tools, thereby facilitating SynapseML for rapid and distributed inference. Furthermore, with the ONNX Model Hub developers can deploy over 120 cutting-edge pre-trained models with ease, spanning various domains like vision, object identification, face analysis, style transfer, to name a few.

Final Thoughts

Microsoft linked the framework to the Azure Synapse Analytics platform, giving the SynapseML release expanded capabilities. This guarantees that the platform will be offered as a native Azure service with enterprise support. SynapseML is an intriguing initiative aimed at reducing fragmentation in the market for machine learning tools and frameworks. It will be fascinating to see how the ML community reacts to and adopts this platform.

For further details kindly check out the documentation page of Microsoft Azure’s SynapseML.

Abhishek Jadhav is an engineering student, freelance tech writer, RISC-V Ambassador, and leader of the Open Hardware Developer Community.

More from Abhishek