Data Science and the Go Programming Language
Why Northwestern SPS Incorporates Go Programming Language into the Master’s in Data Science
Comments by Tom Miller, Faculty Director of Northwestern’s Data Science Program.
Years ago, as a student of applied statistics at the University of Minnesota, I learned a lesson about programming in academia. At the start of the course, the professor said,
"I don't care what language you use for assignments, as long as you do your own work."
I had experience with Fortran but was teaching myself Pascal, trying to adopt a structured programming style.
Taking the professor at his word, I programmed the first assignment in Pascal while my classmates used Fortran. The first assignment comes due. I walk my paper (a program listing) to the front of the room and hand it to the professor. He looks at it quizzically and asks, "What's this?"
I explain, "It’s Pascal. You told us we could program in any language we like, as long as we do our own work."
To which, the professor says, "Pascal. I don't read Pascal. I only read Fortran."
Lesson learned: Academics are not especially open to new programming languages.
FORTRAN
Fortran was developed by John Backus at IBM and introduced in 1957. When you hear its name, think “formula translation.” Fortran is well-suited for numeric calculations, as needed for scientific and engineering applications. Fortran has seen a resurgence recently, perhaps due to the computational demands of large data sets and supercomputing.
PASCAL
Designed by Nicholas Wirth, a Swiss Computer scientist, and introduced in 1970, Pascal is a derivative of ALGOL. Pascal was aligned with a movement toward structured programming at many universities in the 1970s and 80s. Variations on Pascal have been used for systems programming at Apple and Microsoft.
Data science students at most universities today would have a similar experience if they were to submit assignments in Go, Rust, or any other contemporary language rather than Python.
With machine learning applications and AI, Python rules the day. Data scientists might feel content sailing along in a Python boat with life preservers such as Numpy, Pandas, Scikit-learn, and TensorFlow by their sides.
But watch out. Today’s data oceans are choppy. Sharks are approaching.
Recall the words of Chief Brody to Quint in the movie Jaws: "You’re gonna need a bigger boat." I would suggest that a bigger, faster boat be built with Go.
GO (GOLANG)
Go was developed by three Google computer scientists: Robert Griesemer, Rob Pike, and Ken Thompson. It retains the performance advantages of C, while being easier and safer to work with than C. Go was introduced in 2009 and has been the primary systems programming language at Google. For mission-critical systems in many organizations, Go is replacing C/C++, C#, Java, and Python. Go is sometimes called “Golang” to distinguish it from the Go board game and to provide a more reliable term in search engines.
Data Science Careers: The Why of Go
In a presentation entitled “The Why of Go,” Carmen Andoh traced the development of computer languages from 1980 through 2017. She made a convincing argument for using Go in large programming projects. Her argument rings true today.
- Go is Machine Efficient. It beats languages that are interpreted as well as languages that depend on virtual machines.
- Python joined the computer scene more than thirty years ago, before the prevalence of multi-core processors. Python is a single-threaded, interpreted language, poorly suited for systems that demand concurrent processing.
- Data scientists may be writing in Python, but for compute-intensive tasks it is C or C++ that does the work. Python is just the “glue” that holds the pieces of the machine learning boat together.
- It does not take long to find examples of benchmarks demonstrating the advantages of Go over Python and R, the leading languages in data science.
C, C++, AND C#
C was developed by Dennis Ritchie at Bell Labs and introduced in 1972. Because it provides low-level access to memory and maps easily to machine instructions, C has been a popular systems programming language for many years. C has performance advantages over most other programming languages. C++ and C# provide object-oriented extensions to C, while retaining C's structure and performance advantages.
Concurrent processing (never an easy task) is an intrinsic feature of Go.
Go offers a rich set of tools for taking advantage of today’s multicore digital computers. Data science needs languages and systems that can handle the demands of today’s data-driven, data-intensive world. Data science needs Go.
Go Is Programmer Efficient. Python is often touted as easy to learn. But I would argue that Go is easier to learn than Python. Go is simplicity by design, a language with only twenty-five keywords. Go is easy to read, easy to use, and easy to maintain over time.
Let’s be happy that the leaders of the Go community are reluctant to add new features. Donald Knuth had the right idea. When he got to version 3.14 of TeX, he declared that there would be no new versions of the language, no new features, only bug fixes. And with each bug fix, he would borrow another digit from π (pi).
A mantra of Go programmers: “Keep it simple. Keep it running.”
Go has a well-defined structure with formatting utilities to ensure a common style across programmers. Go has automated memory management (garbage collection), protecting programmers from memory leaks and errors. Go is safer than C and C++. Go modules also promote safety, ensuring that the right packages are incorporated into each build at compile time. Go keeps track of software versions as the software stack grows. Think of software development as a game of Jenga. We want to access the blocks at the bottom of the stack, while ensuring that the entire stack does not collapse. Go lets us do this.
Go Simplifies the Software Stack. What about the software stack, the infrastructure?
When Python (even bolstered by C or C++) is not up to the task, data scientists turn to other languages and systems. Here is a so-called solution to Python’s performance problems:
To implement high-performance solutions, data scientists turn to Spark, which is built on Scala, which depends on the Java Virtual Machine. And to provide easy access, these well-meaning data scientists add PySpark to the mix. Is this the best way to address Python’s performance problems? No.
Consider a simpler software stack. It’s Go, just Go:
At the GopherCon 2021 conference, Daniel Whitenack showed how to implement machine learning and artificial intelligence solutions with Go. We can do this today.
Go represents the quintessential systems programming language for today’s multicore, digital computers.
Go Is Widely Used in Industry. Companies value the safety, simplicity, and performance of Go. They also recognize Go’s strengths as a backend systems programming environment. Go is well-suited for developing web and database servers, application programming interfaces, and microservices. Go is well-suited for implementing scalable, high-performance systems.
Beginning with Google, the birthplace of Go, many companies rely on Go for large, mission-critical systems. If Go is good enough for Google, Netflix, Uber, Dropbox, PayPal, American Express, Capital One, Salesforce, Zillow, and many others, then Go is good enough for the rest of us.
If Go can provide an effective platform for building Docker, Kubernetes, Prometheus, Grafana, Pachyderm, Terraform, CrowdStrike, Dgraph, CockroachDB, Aerospike, and a diverse array of distributed systems and cloud-native microservices, then Go can be an effective platform for building data science applications.
Computer science and data science educators should learn from industry. They should add Go to their courses. This is what we are doing at Northwestern.
Three Languages for Data Science Careers
Using Go for data science does not imply that we must give up the good things that R and Python provide. We can be multilingual.
It is not hard to imagine projects for which a data scientist might explore data with R, develop models with Python, and implement systems in Go.
This figure shows the three languages for data science ranking among the top eight computer languages worldwide, according to the Institute of Electrical and Electronics Engineers (IEEE):
Among the three languages for data science, Go is the newest. Go is trending upward and offers substantial job opportunities.
Master's in Data Science Programming Languages
Northwestern’s data science program appreciates the strengths of the three languages for data science.
- R, with numerous packages for analytics and modeling, is well-regarded by applied statisticians. It is an excellent choice for scientific programming and applied research. R is especially good for exploring and visualizing data.
- Python is currently the most popular computer language in data science. It is especially strong in natural language processing and serves as the primary client to deep learning platforms. Python provides a feature-rich environment for developing models.
- Go is a systems programming language designed for today's multi-processor computers. It is well-suited for implementing scalable, high-performance systems for data science.
MS Data Science Online Courses Add Go
Students at Northwestern gain experience with these three languages and can tailor their studies to one language or another. Most courses in the Analytics and Modeling specialization have R as the primary language. Most courses in Artificial Intelligence use Python. And the Data Engineering specialization is moving to Go.
This year Northwestern faculty are introducing Go as the primary language in five data science courses:
- MSDS 431-DL Data Engineering with Go
- MSDS 432-DL Foundations of Data Engineering
- MSDS 434-DL Analytics Application Engineering
- MSDS 436-DL Analytics Systems Engineering
- MSDS 459-DL Knowledge Engineering
The work of data science does not end with data exploration and model development. To improve business processes, we must implement models in functioning systems. Data science and data engineering go hand in hand, putting data science into practice.
Data science plus data engineering, plus Go is a winning combination.
Many Northwestern faculty members accept programming assignments in any of the three languages for data science. It is not hard to imagine a professor saying, "I don’t care if you program in Go, Python, or R, as long as you do your own work."
Organizations need data science, and data science needs Go.
Want to hear Tom Miller make the argument for Go in data science? Check out his GopherCon 2021 presentation.
Northwestern University School of Professional Studies offers many degree and certificate programs, with evening and online options available. To learn more about how Northwestern University's Master’s in Data Science prepares graduates to advance their careers in data science, fill out the form below, and we will be in touch with you soon.