What is Data Science
Data science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in either structured or unstructured forms.
From scientific discovery to business intelligence, data science is changing our world. The dissemination of nearly all information in digital form, the proliferation of sensors, breakthroughs in machine learning and visualization, and dramatic improvements in cost, bandwidth, and scalability are combining to create enormous opportunity.
The field also presents enormous challenges, thanks to the relentless increase in the volume, velocity, and variety of information ripe for mining and analysis.
It employs concepts and techniques from mathematics, statistics, information science, and computer science, in particular from machine learning, classification, cluster analysis, data mining, databases, and visualization.
“Data scientist” has become a popular occupation with the Harvard Business Review dubbing it “The Sexiest Job of the 21st Century” and McKinsey & Company projecting a global excess demand of 1.5 million new data scientists.
What do Data Scientists do?
Data scientists use their data and analytical ability to:
- find and interpret rich data sources
- manage large amounts of data despite hardware, software, and bandwidth constraints
- merge data sources
- ensure consistency of datasets
- create visualizations to aid in understanding data
- build mathematical models using the data
- present and communicate the data insights/findings.
They are often expected to produce answers in days rather than months, work by exploratory analysis and rapid iteration, and to produce and present results with dashboards.
How to Build Your Profile for MS in Data Science?
Thinking for pursuing an MS in Data Science (or, Machine Learning)?
Head to the Home of Data Science and Machine Learning – Kaggle Competition!
Kaggle is a platform for predictive modelling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know at the outset which technique or analyst will be most effective.
Kagglers come from a wide variety of backgrounds, including fields such as computer science, computer vision, biology, medicine, and even glaciology. It also includes many of the world’s best-known researchers, including members of IBM Watson’s Jeopardy-winning team and the team working on Google’s DeepMind. Many of these researchers publish papers in peer-reviewed journals based on their performance in Kaggle competitions.
How does Kaggle Competitions Works?
- Companies and organizations prepares the data and a description of the problem. Kaggle frame the competition, anonymize the data, and integrate the winning model into their operations.
- Participants, like you, experiment with different techniques and compete against each other to produce the best models. Work is shared publicly through Kaggle Scripts to achieve a better benchmark and to inspire new ideas. Submissions are made through Scripts or through private manual upload. For most competitions, submissions are scored immediately (based on their predictive accuracy relative to a hidden solution file) and summarized on a live leaderboard.
- After the deadline passes, the host company pays the prize money for the winning solution. many companies recruit participants based on their place on the leaderboard, final score, and submitted scripts.
- Alongside its public competitions, Kaggle also offers private competitions limited to Kaggle’s top participants.
What Kaggle competition should a beginner start with?
I’d start with the tutorials first just to make sure you have a good grasp of the primary tools and techniques that most people use: https://www.kaggle.com/wiki/Home
Afterwards, Titanic: Machine Learning from Disaster is a good competition to start. It will prep you with fundamentals of data science – the data size is manageable, the problem is interesting, and you need minimum overhead in terms of computational requirements.
If you aren’t decided on your weapon of choice, I would suggest that you start with R. The tutorial can be found at Titanic: Machine Learning from Disaster. Follow this up with Python, Titanic: Machine Learning from Disaster.
Since your objective is learning, the most important place for you is the Kaggle forum. There is just tons of valuable information buried in those posts. What worked, what didn’t work, the issues others are facing, interesting patterns and visualizations, and neat tricks. I find it to be the best “practical” data science guide out there.
Once you have a sound footing, maybe in a couple of weeks, the next step would be to try something with text data like Sentiment Analysis on Movie Reviews.
Add to that some competition that uses audio and/or video data. There could be a few running or you can always dig up the old ones like Challenges in Representation Learning: Facial Expression Recognition Challenge and The Marinexplore and Cornell University Whale Detection Challenge
Career in Data Science
A career in Data Science involves statistics, mathematics, business, economics and Computer Science.
After a Master’s in Data Science, you can work in various sectors such as finance, healthcare, consulting, retail or consumer products – basically any field where there is lots of data and there is a requirement to analyze large data sets to develop custom models and algorithms to drive business solutions.
With regard to Data Science, the primary focus is on applications rather than research. You use some knowledge from Computer Science (data structure, deep learning, computer vision, natural language processing, machine learning) in your data science role.
Typical employers include Walmart, Tesla, Intuit, Collective Health and numerous financial/trading companies on Wall Street.
The average salary for a job in Data Science in the US is about $113,000 as per Glassdoor. Another source – Payscale – puts the median salary at about $93,000.
Let’s have a look at the application of data science in different fields.
#1 Data Science in Retail
With online commerce, retail data is increasing exponentially in terms of volume, the velocity at which data is being generated and their value for the kind of insights and profit they could offer. As per McKinsey’s report on Big Data, retailers using big data analytics could raise their operating margins by as much as 60 percent.
The following points are a few of the applications of big data in retail:
- Customer Experience: Personalized recommendation based on purchase history, sentiment analysis, predictive analytics for improving customer experience across all channels and devices
- Merchandising: Improving layout, product placement and promotional display, identify cross-selling opportunities
- Marketing: Location-based personalized offers on mobile phones, real-time pricing, better targeted campaigns
- Supply chain logistics: real-time inventory tracking and management, demand-driven forecasting, route optimization and efficient GPS-enabled transportation
#2 Data Science in Health Care
In the US, health care expenses represented 17.6% of the GDP in 2013 with annual spend of $2.6 trillion. Out of this, $600 billion was consumed by waste and fraud. By 2020, this figure is estimated to rise to nearly 20%.
Big Data has the potential to help physicians make better decisions across the board – from personalized treatments to preventive care, while, at the same time, slashing the cost of providing health care services.
The following list details some of the applications of big data in retail:
- Personalized medicine: Create a personalized treatment plan based on individual biology using data from various sources including clinical trials, electronic medical records, online patient network, genomics research etc
- Genomics: Inexpensive DNA sequencing and next-generation genomic technologies are changing the way health care providers do business. They are getting better understanding of the genetic bases of drug response and disease by combining genomic data with other data in disease research.
- Predictive analytics and preventive measures: Some examples are: Mount Sinai Medical Center reduced its readmission rate, Texas Health identified high-risk patients to offer them customized interventions and Methodist Health System predicted patients who will need high cost care in future.
- Patient monitoring and home devices: Wearable body sensors – sensors tracking everything from heart rate to testosterone to body water – can take vital stats of the patients every minute of the day. Personal ECG heart monitor, medical monitoring devices and mobile applications are cropping up daily.
#3 Data Science in Finance
There has been a flood of financial data in recent times from various sources such as social media activity, mobile interactions, server logs, real-time market feeds, customer service records, transaction details and, of course, information from existing databases.
The following list details some of the applications of big data in finance:
- Sentiment analysis: Use natural-language processing, text analysis and computational linguistics to discover what people really think.
- Automated risk credit management: Alibaba has successfully used big data to offer loans to entrepreneurial online vendors without any collateral by using their transaction records, customer ratings, shipping records and a host of other info.
- Real-time analytics: helps in fighting financial fraud, improve credit ratings and providing more accurate pricing
- Predictive analytics: For example, whether certain customers are likely to pay off their credit cards using the demographic characteristics of customers’ neighborhoods and making calculated predictions.
#4 Big Data in Telecom
Mind Commerce, a market research firm, predicts that the big-data-driven telecom analytics market will grow by nearly 50 percent from 2014 to 2019 and forecasts that by the end of 2019, the market will be up to $5.4 billion in annual revenue.