What are the best tools for filtering through large data sets? Speaking with data hackers, we learned about their favorite languages and toolkits for core data analysis.

R language

On these lists, if R is number two, no one else is number one. Since 1997 it has gained worldwide popularity as a free alternative to expensive statistical software such as Matlab and SAS.

In the past few years, R has become the darling of data science, which is now known not only by nerdy statisticians but also by Wall Street traders, biologists, and Silicon Valley developers. Companies in a variety of industries, such as Google, Facebook, Bank of America, and the New York Times, use R, and R continues to spread for commercial use.

R has a simple and obvious appeal. With just a few lines of code, you can sift through complex data sets, process data through advanced modeling functions, and create flat graphs to represent numbers. It has been likened to a hyperactive version of Excel.

The R language’s greatest asset is the vibrant ecosystem that has been developed around it: the R language community is always adding new packages and features to its already rich set of features. It is estimated that more than 2 million people use R, and a recent poll showed that R is by far the most popular language in scientific data, spoken by 61% of respondents (followed by Python at 39%).

Increasingly, it is making a presence on Wall Street. Once, bank analysts would be glued to Excel files until late at night, but now R is increasingly being used for financial modeling, especially as a visualization tool, says Niall O ‘Connor, vice president of Bank of America. “R is what makes our ordinary tables different,” he said.

R’s growing maturity has made it the language of choice for data modeling, although its capabilities are limited when companies need to produce large products, some say because it is being usurped by other languages.

“R is better suited for making a sketch and outline rather than a detailed build,” says Michael Driscoll, CEO of Metamarkets. “You won’t find R at the heart of Google’s page rankings or Facebook’s friend recommendation algorithm. Engineers will prototype in R and hand over to models written in Java or Python.”

Back in 2010, Paul Butler famously created a Facebook map of the world in R, a testament to the language’s rich visualization capabilities. Although he doesn’t use R as much as he used to.

“R is becoming increasingly obsolete because of its slowness and unwieldiness in handling large data sets,” Butler said.

So what does he use instead? Please read on and read on.

Python

If R is a neurotic and lovable master, Python is its easy-going and flexible cousin. Python quickly gained mainstream appeal as a more practical language that combined R’s ability to quickly mine complex data and build products. Python is intuitive and much easier to learn than R, and its ecosystem has grown dramatically in recent years, making it much more suitable for statistical analysis previously reserved for R.

“It’s a step forward for the industry. In the last two years, there has been a very noticeable shift from R to Python, “Butler said.

In data processing, there is often a trade-off between scale and complexity, and Python is a compromise. IPython Notebook and NumPy can be used as a kind of scratchpad for light work, while Python can be a powerful tool for medium-scale data processing. The rich data community is also an advantage of Python, as a large number of toolkits and functions are available.

Bank of America uses Python to build new products and interfaces within the bank’s infrastructure, as well as to process financial data. “Python is so broad and flexible that people want it,” O ‘Donnell said.

However, it is not the highest-performing language and can only be used occasionally for large-scale core infrastructures, Driscoll said.

Julia

Although the vast majority of current data science is performed through R, Python, Java, MatLab, and SAS. But there are other languages that survive in the cracks, and Julia is a rising star worth watching.

Julia is generally considered too obscure in the industry. But data hackers talk excitedly about its potential to replace R and Python. Julia is a high-level, extremely fast, expressive language. It’s faster than R, more extensible than Python, and fairly simple to learn.

“It’s growing step by step. Ultimately, with Julia, you can do everything you can with R and Python, “Butler said.

But so far, young people are still hesitant about Julia. The Julia data community is still in its early stages, and more packages and tools need to be added before it can compete with R and Python.

“It’s still young, but it’s making waves and very promising,” Driscoll said.

JAVA

Java, and java-based frameworks, have been found to be the skeleton of silicon Valley’s biggest tech companies. “If you look at Twitter, LinkedIn and Facebook, Java is the underlying language for all of their data engineering infrastructure,” Driscoll said.

Java does not provide the same quality of visualization as R and Python, and it is not the best choice for statistical modeling. However, if you move on to old prototyping and need to build large systems, Java is often your best bet.

  hadoopAnd the Hive

A host of Java-based tools have been developed to meet the huge demand for data processing. Hadoop as the preferred Java-based framework for batching data has ignited enthusiasm. Hadoop is slower than some other processing tools, but it is surprisingly accurate, so it is widely used for back-end analysis. It works well in pairs with Hive, a query-based framework that runs on top.

Scala

Scala is another Java-based language, and like Java, it is increasingly being used as a tool for large-scale machine learning, or building high-level algorithms. It is expressive and can build robust systems.

“Java is like steel when you build it, whereas Scala is like clay because you can then put it into a kiln and turn it into steel,” Driscoll said.

Kafka and Storm

So what about when you need quick, real-time analysis? Kafka will be your best friend. It has been around for about five years, but has only recently become a popular framework for stream processing.

Kafka, created inside LinkedIn, is an ultra-fast query messaging system. Disadvantages of Kafka? Okay, it’s too fast. Operating in real time causes its own errors and occasionally misses things.

“There’s a trade-off between precision and speed,” Driscoll said. “So all the big tech companies in Silicon Valley use two pipelines: Kafka or Storm for real-time processing, and Hadoop for batch systems, which are slow but super accurate.”

Storm, another framework written in Scala, is gaining a lot of traction in Silicon Valley for stream processing. It was incorporated into Twitter, no doubt, so that Twitter could benefit greatly from fast event processing.

Honorable mention

MatLab

MatLab has long been popular, and despite its expensive price tag, it is still widely used in some very specific fields: intensive machine learning, signal processing, image recognition, to name a few.

Octave

Octave is very similar to MatLab, but it is free. However, it is rarely seen outside academic signal-processing circles.

GO

GO is another upstart making waves. It was developed by Google, loosely derived from THE C language, and is gaining ground on competitors such as Java and Python in building robust infrastructure.