R is for Research, Python is for Production
Matt Dancho and Jarrell Chalmers, 2021-2-18
Translator: Zhang Jingxin
Reprinted in: R&Python Data Science
About the author:
Matt Dancho is the founder of Business Sciences Inc. (www.business-science.io), a consulting firm that helps organizations apply data science to business applications. He is the author of R Package TidyQuant and Timetk and has been engaged in business and financial analysis in the data science field since 2011. Matt holds master’s degrees in business and engineering and has extensive experience in the areas of business intelligence, data mining, time series analysis, statistics, and machine learning.
R and Python are great. This article will talk about some of the advantages of each language by showing the major advances in each ecology.
1. R for research
If I had to use one word to describe R, it would be tidyverse. It helps you complete your research tasks – crunching data, visualizing results, and iterating from idea to code ———— stress-free and, more accurately, enjoyable. Here is the ultimate R quick lookup table to explain why R is used for research.
To start learning R, Tidyverse is an ideal place to start your journey. This is a normalized collection of packages and tools with a consistent structured programming interface, whereas R Base is significantly more complex and less user-friendly.
We can find many smaller R packages that solve specific problems, but here are the most important R packages:
Dplyr&ggplot2
Two powerful packages that help you make daily decisions are DPLYr and GGPLOT2, which are ideal for data processing and visualization. These are the two most important skills that a data scientist or data analyst can have.
Rmarkdown
Without a doubt, one of R’s most special advantages is Rmarkdown, a framework for creating reproducible reports, presentations, blogs, journals, and more! Imagine having a report up and running and creating an easy-to-share HTML page or PDF to share with your team. It’s definitely a better way to do it than clicking hundreds of times in Excel every Monday morning.
Shiny
Shiny is another framework in R for creating interactive Web applications. One of the best features of Shiny is that it provides the data science tools that non-data-focused members of your team need to make decisions through an easy-to-use GUI (graphical user interface). Imagine your team getting together for a Monday afternoon planning meeting, having looked at the previous week’s reports created in Rmarkdown, and running simulations using a collaborative Shiny Web application to determine where to bootstrap the data next.
Where is R’s growth point
Next, if you browse to the “Special Topics page,” you can see the growing R ecology. Below are the key features that set the R ecosystem apart from the Python ecosystem.
You can see that R has been extended to:
- Time series and prediction: ModelTime and Timetk
- Financial Analysis (and other areas) : TidyQuant, Quantmod
- Network analysis and visualization: Tidygraph and Ggraph
- Text analysis: TidyText and recipes
- Geospatial analysis and visualization: Thematic maps
- Machine Learning: H2O, TidyModels and MLr3verse
Note: the original was MLR3, I think it’s better to change it to MLR3verse.
What else is missing from R?
There is an obvious gap in the field of production. R has Shiny (Apps) and Plumber (APIs, not shown), but automation tools such as Airflow and the Cloud Development Suite (SDK) are primarily available in Python.
R summarizes
Because of Tidyverse, R is really special when it comes to research, simplifying the process of data collation and visualization. To be frank, mastering Tidyverse makes you 3-5 times more productive when working with data in R.
2. Why is Python great?
Python is amazing, too, but for some reason, let’s take a Python package like OpenCV — it’s for computer vision.
This is a real advantage of Python, because we can use OpenCV for crazy cool things like target detection.
But how much does it affect my daily life? It’s about zero. Why is that? Because I’m a business analyst and a data scientist who uses SQL databases. I’m more interested in how Python can help me better mine the information and put the results to productive use.
Let’s check the Python ecology with the ultimate Python quicktable (note that this is different from the R quicktable shown earlier).
As you can see, basically everything related to imports, cleaning, and data processing is done by the Pandas package. So what is pandas? Pandas is an object-oriented tool for manipulating data in Python.
Pandas vs Tidyverse
Although programmers love Pandas, business analysts may not initially be comfortable with the object-oriented (Python-style) approach to data boxes with methods:
customer_counts_df = df.group_by('customer_id').value_counts()
Copy the code
Everything in Python is an object, on which we call these methods (such as group_BY and value_counts). This call doesn’t look too bad. However, we usually try to do more processing. It becomes very challenging, less readable and more complex.
The Tidyverse in R, by contrast, uses a different syntax: pipe symbol (% > %). This is very similar to SQL and the data processing flow that users imagine.
customer_counts_tbl <- df %>%
group_by(customer_id) %>%
summarize(count = n())
Copy the code
This neat data processing workflow makes it easier for data analysts to extend a series of operations to 10 or more. Remember, the challenge is not to enter code, but to turn your ideas into code. This is where Tidyverse is really strong.
The main advantage of Python is Production ML
OK, so why is Python useful for business? As it turns out, its strengths lie in machine learning and production!
As you can see, Python has well-established tools for producing ML:
- Airflow, Luigi
- Cloud-aws, Google Cloud and Azure Software Development Suite
- Machine learning: ScikitLearn
- Deep learning and Computer vision: PyTorch, TensorFlow, MXNet, OpenCV
- NLP: spaCy, NLTK
These production-oriented tools make IT easier to work with people who interact with the cloud and do operations as part of a larger IT team because they are already using Python. There is no need to include R and any other dependencies in the production system.
Python summary
If you can overcome the learning curve at Pandas, Python can be a great tool. Most IT teams know Python, so your code will fit perfectly into their workflow. Just realize that your productivity in Research may be 3 to 5 times lower than that of your R peers due to tidyverse enhancements.
Which language should you learn?
This decision can be challenging, as both Python and R have clear advantages.
- R is phenomenal for research: making visualizations, data insights, generating reports, and using Shiny to make MVP-level applications. From concept (idea) to execution (code), R users are often able to perform these tasks three to five times faster than Python users, making research work very efficient.
- For production, Python is extraordinary: integrate a machine learning model into a production system where your IT infrastructure relies on an automation tool like Airflow or Luigi.
Why not learn Python with R?