Participate in the topic discussion at the end of the article and give asynchronous books every day
— Asynchronous small editor
William Chen is a data scientist at Quora, where he helped Quora grow and share knowledge with the world. After graduating from Harvard with a double major in statistical and applied mathematics, he went straight to being a data scientist, and was one of the first students in the world to take a full data science course at school and eventually go straight into data science after graduation. Before joining Quora full time, he worked as a data intern at Quora and Etsy. He loves to tell stories about data and shares his knowledge widely on Quora.
William is also a co-author of Interviews with Data Scientists.
Can you tell us a little bit about your journey into data science?
In my first year at Harvard, I wanted to study math, but ended up taking Joe Blitzstein’s Statistics 110. That class changed the way I think about uncertainty and everyday things, and taught me the value of intuition and communication. Influenced by that course, I switched my major to statistics the next year.
In my sophomore year, I began to look around for internships, hoping to put some of my knowledge of probability and statistics to use. At that time, I only had theoretical knowledge and limited knowledge of application development. I was surprised when Etsy invited me to join their company as an intern as a data analyst. It was my first attempt to use data to improve the business — the internship helped me grow in many ways, honed my skills and turned me into a budding data scientist.
Etsy is A company based on data metrics, and I can clearly see and understand that the most important core business of Etsy relies on some algorithms for A/B testing. The frequent exchange of statistics enabled me to learn about common technologies and some of the potential vulnerabilities of a data-driven technology company.
Etsy’s data presentation is beautiful (D3 dashboard and highlighted slide deck). In a corporate environment that valued visualization, I taught myself GGplot2 and started making my own images. I learned a lot during that internship — it was the first step in my career as a data scientist.
After finishing my internship at Etsy, I started my junior year. That year, I returned to Harvard and became an assistant statistician for Class 110.
By helping people solve their probability problems, I realized that teaching statistics could help me improve my communication and storytelling skills. It’s also fun, and I’m more comfortable sharing what I’ve learned.
If you don’t have a strong enough programming knowledge to implement your statistical ideas, you are limited in what you can do.
During my junior year, I also started taking more computer courses, and I realized how important they were in data science. If you don’t have a strong enough programming knowledge to implement your statistical ideas, you are limited in what you can do. I realized that to be a successful data scientist, both statistics and computers are essential, so I tried to become an expert in the intersection of statistics and computers by taking courses related to both.
In my junior year, I also applied for some internships, with the idea that I would use my statistical and programming skills to help the company make better decisions. I received an internship Offer from Quora and accepted it, even though I knew nothing about the product at the time.
On Quora, I was exposed to more code bases and learned more about software engineering. I always take my projects seriously and think about them very hard. I’ve been working on projects that involve the company’s new growth plans, and I love how free Quora is and how it trusts its employees. I love meeting people and enjoying the products, so I decided to return to Quora full time after graduation.
In my senior year, I continued to study statistics and various programming tools, and completed my graduation thesis.
Why did you choose statistics over computer science in the first place?
I put a lot of time into Statistics 110 and a bunch of other statistics classes – I love them, so there’s no reason for me to choose another major!
During my internship at Etsy, I saw firsthand how limited I would be if I could only do statistics instead of programming. That summer, I spent a lot of time learning to analyze data in R.
I took about the same number of statistics and computer science courses in my junior and senior years. By taking computer courses, I can do statistical analysis more efficiently. I choose courses that allow me to better apply statistics (machine learning, parallel programming, Web development, data science) or simply because they are interesting math topics (data structures and algorithms, economics, and computer science).
My main interest is still statistics, but I value computer science very much because it allows me to do more complex analysis, generate visualizations, process large amounts of data at the same time, and automate a lot of my work so THAT I can focus on very interesting problems.
I even applied for a second degree in computer science in the first semester of my senior year. I had just met their graduation requirements (definitely by accident) and was good enough to apply for a second degree certificate because I didn’t have to do anything more than stamp papers.
Could you tell us more about some of the difficult problems you encountered during your internship?
One of the exciting things about working for a data-centric tech company is that there are so many potential projects you need to tackle. There’s a lot of data to analyze, and they never had enough data scientists to really dig into all of these things. My main challenge during my internship, especially on Quora, was figuring out how to prioritize the bunch of things I’m working on, especially if I’m working on many projects at once.
On Quora, I realized THAT I can’t do everything at the same time, which is the way I do things at school. I realized I needed to prioritize the things that would have the most impact on the company. If I spend too much time on certain software, I may not have enough time to focus on growth initiatives that might have a higher impact.
What do you think of people saying that “data science is an intersection of mathematics, statistics and computer science”? What do you think their weight is?
I think the programming and software engineering part is really important, because you might want to implement the model yourself, write dashboards, and extract data in some really novel way. You will be responsible for transferring your data. You’ll be the person with end-to-end and full-stack development capabilities, from extracting data to producing a report and presenting it to the company.
The Pareto principle is in full play here. 80% of the time is spent crawling data, cleaning it up, and writing code for analysis. I found this to be true during my internship (especially since I was just starting out). Good coding knowledge is especially important here, saving you a lot of time and making you less likely to get frustrated.
Let me stress this: getting the data and figuring out what to do with it takes a lot of time, and it usually doesn’t require any statistical knowledge. This part is mostly about using software engineering techniques to clean up the data, or writing efficient query code to move and analyze your data in the database. Programming is really important here.
One interesting thing to note is that the statistics used in data science are really different from the statistics you read in your research papers. Companies prefer statistical methods for speed, interpretability and reliability rather than theoretical perfection.
The more you understand the underlying mechanics and principles of statistics or algorithms, the better you can clarify what you’re doing and communicate it to the rest of the team.
While the statistics and math companies use may not be sophisticated, a solid background in math and statistics is still important when you need to distinguish real insights from bogus results. In addition, solid fundamentals and experience will give you a better intuition for how to solve the company’s more intractable problems. You may have a better intuitive explanation of why a metric suddenly drops, or why people suddenly choose your product.
Another benefit of a strong background in statistical mathematics and mathematics is its contribution to communication. The more you understand the underlying mechanics and principles of statistics or algorithms, the better you can clarify what you’re doing and communicate it to the rest of the team. Most of your job as a data scientist is to show people what you think will have a big impact in the future. Communication is very important to achieve this.
Some data science positions require a very strong background in statistics or machine learning. They may require you to develop feed automation or other recommendation engines, or they may require you to know how to do time series analysis, basic machine learning techniques, linear regression, causal reasoning, etc. There are many types of data that require more advanced statistical methods to complete the analysis.
The balance between computer science, statistics and math will depend on your position, that’s my observation.
What do you think about the fact that the majority of people joining data science today have PHDS?
Data science is a new field, and employers are looking for people with the skills to become data scientists. Because it’s a brand new field, and not many people have experience in it, you have to find people who represent what they can do in the future. PhD students with a computational/quantitative research background are usually a good choice because they have already done a lot of research and data work. PhD and master’s students with experience in data processing often already have many of the qualities of data science: the ability to learn quickly, ask questions, and be flexible.
I think in the future companies will start to recruit more and more undergraduates to take on the role of data scientist, and in the next five to ten years, there will be more talents that meet the needs of the data science field. There were so many sophomores at Harvard, some of them wanted to be data scientists, like me when I was a sophomore. I think they see it as a promising and exciting career direction, and I personally see it the same way.
PhD and master’s students with experience in data processing often already have many of the qualities of data science: the ability to learn quickly, ask questions, and be flexible.
Right now, there are plenty of MOOCs (open online courses) offering courses and certificates, and universities around the world are offering their first data science courses. For example, Harvard’s first data science course and its first predictive modeling course appeared in the 2013-2014 academic year. These courses are the perfect starting point for undergraduates who want to learn about data.
If you want to hire data scientists, I’m afraid there really aren’t many people with experience right now, and those with PHDS and Masters degrees are good candidates. That could change over the next five to 10 years as more undergraduates also have qualified data science skills requirements.
There are already data science majors on Coursera, and at Harvard, Joe Blitzstein and Hanspeter Pfister teach data science courses. Joe is the professor who teaches my favorite statistics class.
In the spring of 2014, Harvard offered a course on predictive modeling. This is a Kaggle competition focused class. Such courses are the perfect starting point for undergraduate students who want to work in the data field.
If you could go back to your college days, what would you focus more on? Was there anything that you felt was overlooked?
I think my biggest regret in college course selection is that I didn’t take a programming course in my freshman year. Programming is so important in data science — unless it’s a giant company like Google or Amazon, there are few pure statistician positions that don’t write code, because those giant companies may need statisticians to specialize. Programming is so important, you can’t run away from it.
When it comes to the term “data science”, many people worry or claim that there is a lot of hype in this field because it is exaggerated. What do you think of this view?
The hype around data science is definitely overdone right now, as is the cloud and mobile/localization/social platform craze. However, just because it’s exaggerated doesn’t mean it’s not important. I think the hype and the bubble will go away in the next few years, but the importance of data science will not.
Do you think the need for data scientists will die out as software tools improve?
Personally, I like all the new software tools. I think the job of a data scientist will change over the next few years as program tools get better and better.
I don’t think the demand for data scientists will diminish, though, because we’ll always need people who can interpret results and distill insights into actionable plans to improve the business. Data science is never short of hard questions — people always need to interpret results and exchange ideas. I think that’s what data science is all about — it turns data into actionable conclusions that can be used to improve products and businesses.
We always need people who can interpret results and distill that insight into actionable plans to improve the business.
Software tools can make some of the work done by data scientists obsolete, as startups offer enterprise-level comprehensive solutions and commercialize certain data-related tasks. But even with new tools, we still need data scientists to rely on human intelligence to use them. You’ll need to have your data scientists look at the results and consider how you can directly help your company grow.
How much more expertise in your field does it take to be a good data scientist? To what extent do you need to know what people are doing online? Does this help you develop new products?
On Quora, I work on a project that involves understanding user engagement. Given that I am an avid Quora user myself, I thought hard about this question. One advantage you have when you have domain knowledge is that you can make better assumptions about what you’re curious about before you even look at the data. You can then look at the data again to get a better intuition of why your previous assumptions were right or wrong. Domain expertise and intuition related to it are helpful, especially if the model is complex or needs to be presented to an internal audience. Domain expertise helps share valuable stories that help you explain what drives human behavior in your products. This is really different from some of the data sets on Kaggle, some of which don’t even have column names (for privacy reasons), so you don’t fully understand the data you’re analyzing.
One advantage you have when you have domain knowledge is that you can make better assumptions about what you’re curious about before you even look at the data.
When applying for a job, you were choosing between quantitative financial analyst and data science, and finally chose data science. Why did you choose data science? What were the considerations behind this decision?
I think quantitative financial engineers and data science are good choices. I’m pretty sure data science is the right choice for me because I’m excited to see how technology can change the world and make everything work better. I felt like I wanted to be a part of it. I felt that if I wanted to do that, I needed to be part of a technology company that had a large customer base, where I could help build a product that drove people to get things done.
I also really enjoyed both the teaching and communication aspects of data science — I found myself enjoying it when I was an assistant professor of Statistics 110 at Harvard. There’s a lot of teaching and communication in data science. In quant finance, you just report what you’ve done behind the scenes.
I want to be an evangelist for some data ideas and convince people that data is useful. I think the tech industry has a lot of potential. Data is very new to technology and very old to finance. It was exciting to be involved in data science at a time when the field was still in its infancy. I want to work with more people to use technology to make people’s lives better.
This article is excerpted from Interviews with Data Scientists
Interviews with Data Scientists
Carl Shan is waiting
Click on the cover to buy the paper book
In-depth interviews with 25 world-renowned data scientists from different perspectives and dimensions gather their wisdom, experience, guidance and advice into a book. Each interview is an in-depth exchange, covering the entire process of starting as a novice data scientist, arming and enriching themselves with various knowledge, and eventually becoming an effective data scientist.
Through reading the interviews in this book, you can form a macro understanding and understanding of data science, understand and experience the role of data scientists more deeply, and learn valuable knowledge and experience from the past experience of these predecessors to apply to your own growth and career.
Scan the code to buy the E-reading e-book of Interviews with Data Scientists, immediately reduce 20 yuan in cash, enter the discount code: C4a86B-B, equivalent to 7.6 yuan to buy the book.
Click on the cover to buy the book
Today’s interactive
Are there high barriers to entry for the data scientists you know? Deadline: 17:00, June 29, leave a message + forward this activity to the moments of friends, small series will be lucky to select one reader to give a paper book and two e-reading version of 20 yuan asynchronous community voucher, (the most like message will automatically get one).
Recommended reading
May 2018 Book List (bonus at the end)
A list of new books for April 2018
Asynchronous books the most complete Python book list
A list of essential algorithms books for programmers
The first Python neural network programming book
Long press the QR code, you can follow us yo
I share IT articles with you every day.
If you reply “follow” in the background of “Asynchronous books”, you can get 2000 online video courses for free
Click to read the original article to buy Interviews with Data Scientists
Read the original