Data science is an interdisciplinary field focused on extracting meaningful information from large sets of data. To discover hidden patterns, Data Scientists use math, science, data analysis, algorithms, and systems to identify opportunities for increased efficiency, productivity, and profitability.
In simpler terms, data science uses math and technology to analyze structured data and unstructured data to find ways to be more productive and profitable. To find those patterns, a Data Scientist spends a lot of time collecting, cleaning, modeling, and examining data, from numerous angles, some of which have not been looked at before.
Essentially, data science work is about knowledge creation: it makes use of the most state-of-the-art techniques and tools the fields of computer science and statistics have to offer to turn a mess of data into knowledge that an organization can use to inform their business practices.
Among the most noteworthy techniques a Data Scientist uses are predictive causal analytics, prescriptive analytics, and machine learning. The first, predictive causal analytics, uses data to predict the likelihood of different possible outcomes of a future event. Prescriptive analytics goes a step further, suggesting a range of different actions based on those possibilities, with an eye toward optimizing outcomes. Machine learning, unlike the two techniques just mentioned, is not the “what” but the “how” of data science: it’s the practice of using data-based algorithms that improve automatically based on past experiences – essentially learning to do their job better – to discover patterns and make predictions.
And yet, to answer the question “what is data science” in the real world, it is worth understanding that the data science process involves much more than simply using computers to crunch numbers. In fact, Data Scientists may be heavily involved in the decision-making process across departments, which means that, practically speaking, data science also involves collaborating with others, and especially knowing how to communicate important findings to other people.
History of Data Science
The history of modern data science and overall interest in big data really picked up in the mid-90s, when Business Week published a cover story on “database marketing,” noting that companies were collecting large amounts of data about their customers and using it to predict how likely they would be to buy a product and to craft a marketing message that would make you more likely to do so.
Two years later, members of the International Federation of Classification Societies met for their biennial meeting, and for the very first time, “data science” was included in the title of the conference (“Data science, classification and related methods.”) The same year, an influential paper titled “From Data Mining to Knowledge Discovery in Databases” was published, and the following year the journal Data Mining and Knowledge Discovery was launched. Also in 1997, C.F. Jeff Wu delivered an inaugural lecture for the H. C. Carver Chair in Statistics at the University of Michigan in which he called for statistics to be renamed data science and statisticians to be renamed Data Scientists.
In 2002, the Data Science Journal launched, followed by the Journal of Data Science the next year. And 2007 saw the establishment of the Research Center for Dataology and Data Science in Shanghai.
Still, those who weren’t plugged into data science trends might have been taken aback when, in 2009, Google Chief Economist Hal Varian told the McKinsey Quarterly that “the sexy job in the next 10 years will be statisticians.” Time has proven him right. You’d be hard-pressed to find a successful company that isn’t pouring money into finding creative and efficient ways to harness the power of big data, and Data Scientists are at the core of that.
Benefits of Data Science
Research shows that companies that truly embrace data-driven decision-making are more productive, profitable, and efficient than the competition. Data science technologies are crucial to helping organizations identify the right problems and opportunities while helping to form a clear picture of customer and client behavior and needs, employee and product performance, and potential future issues.
Data science benefits include:
- It removes the guesswork and provides actionable insights. Companies make better decisions powered by data and quantifiable evidence.
- Business intelligence helps companies better understand their place in the market. Data science projects can help companies analyze the competition, explore historical examples, and make numbers-based recommendations.
- It can be leveraged to identify top talent. Lurking in big data are lots of insights about productivity, employee efficiency, and overall performance. Data can also be used to recruit and train talent.
- You’ll get to know everything about your target audience, client, or consumer. Everyone is generating and collecting data now, and companies that don’t properly invest in data science simply collect more data than they know what to do with. Insights into the behavior, priorities and preferences of past or potential customers or clients are invaluable, and they’re simply waiting for a qualified Data Scientist to discover.
Data Science Techniques
There are many different techniques within data science, including:
Designing, building, optimizing, maintaining, and managing the infrastructure that supports data collection as well as the flow of data throughout an organization.
Cleaning and transforming data.
Extracting (and sometimes cleaning and transforming raw data) usable data from a larger data set.
Analyzing data and using algorithms and machine learning techniques to analyze the likelihood of various possible future outcomes based on data analysis.
Automating analytical model building in the data analysis process to learn from data, discover patterns, and empower systems to make decisions without much human intervention.
Using data visualization tools to create visual elements (including graphs, maps, and charts) that illustrate insights found in data in an accessible way so audiences can understand trends, outliers, and patterns found in data.
Natural language processing
Teaching computers to understand words and text in a way that is similar to people.
Data Science vs Data Mining
Where data science is a broad field, data mining describes an array of techniques within data science to extract information from a database that was otherwise obscure or unknown. Data mining is a step in the process known as “knowledge discovery in databases” or KDD, and like other forms of mining, it’s all about digging for something valuable. Since data mining can be viewed as a part of the data science lifecycle, there’s of course overlap; data mining also includes such steps as data processing, statistical analysis, and pattern recognition, as well as data visualization, machine learning, and data transformation.
Here are some of the key differences between data science and data mining:
|A broad field that includes machine learning, artificial intelligence, predictive causal analytics, and prescriptive analytics
|A subset of data science that includes data cleaning, statistical analysis, and pattern recognition, and sometimes includes data visualization, machine learning, and data transformation
|Deals with all kinds of data, including both structured and unstructured data
|Deals primarily with structured data, not unstructured data
|Aims to build data-centric products and make data-driven decisions
|Aims to take data from various sources and make it usable
|Communicated through data visualizations
Data Science vs Machine Learning
It might be easiest to view machine learning as a part of data science. Machine learning frees Data Scientists from the tedious task of sifting through massive volumes of data by using complex algorithms and problem-solving methods including supervised and unsupervised learning, regression, classification, clustering, and neural networks.
Examples of machine learning are all around you. Facebook, for instance, uses machine learning models to analyze your past behavior to present content and notifications in line with your interests. Similarly, when Netflix somehow recommends a show you’d love to binge-watch, it’s an example of machine learning.
Perhaps the simplest example of machine learning in motion lies in how it approaches the task of recognizing handwriting. To train a machine with examples of correct input-output pairs – which is called supervised machine learning – the computer is shown images of handwritten numbers alongside the correct labels for those digits. The computer then tries to figure out the shared characteristics of each digit, and gradually picks up on the patterns between the images and the labels.
Generally, machine learning is effective to solve problems that are statistical or probabilistic in nature, deeply complex, and that can be adequately handled with an approximate solution. For instance, the issue of detecting credit card fraud checks those boxes: solutions are probabilistic because a determination won’t be made until a company reaches its customer; the rules around fraud are complex; and approximate solutions are adequate since we’re simply flagging transactions for further review.
Although many of the more advanced machine learning tools do require some experience and knowhow, the basics can still be impactful for those looking to dig deeper. Many supervised and unsupervised learning models are implemented in R and Python, and straightforward models like linear or logistic regression can be used to perform informative machine-learning tasks.