What is data really? Data is defined by information about the world and its individuals that is collected and analyzed in order to aid in decision making. Although data is, today, often associated with helpful data visualization, such as charts and infographics, it is important to understand the historical evolution of data.
Looking to 3200 BC, when writing was first being developed in Mesopotamia, scribes would record data from daily life - such as tax and crop information - in order to improve their accounting and agricultural systems.
As both the natural and mathematical sciences continued to advance, coupled with the introduction of better technology, mathematical statistics transformed into something more powerful: data science.
Data science combines what we tend to think of as traditional statistics and computer science in order to analyze large amounts of data and find new ways of doing so.
While data analytics utilizes mathematical knowledge from a statistics course in order to model data, data science functions mainly as a discipline that is used to extract information and draw new insight from large amounts of data. Some of the basic skills you will need to acquire in order to learn data science are:
- Calculus I & II
- Linear Algebra
- Probability Theory
- Mathematica Statistics
- Computer Science
There are a number of online resources that can help get you started with learning or refining these fundamental skills. Sites like Superprof offer a wide variety of tutors that can give both online and face-to-face lessons in statistical maths and data driven computer science.
What is Big Data and Big Data Analytics?
When defining data science, it is impossible to do so without first defining what big data is. The first breakthrough in big data goes back to the second World War with the group of computers known as Colossus.
Used through 1943-1945 by British cryptographers, the innovative framework was used to decipher codes used by the Nazi regime. While the term big data wasn’t coined until about 50 years later by John Mashey at Silicon Graphics, big data - and the large datasets that characterize it - is still used in the same way. Meaning, computer programs and advanced algorithms are applied to analyze extremely large amounts of data in order to find valuable patterns and trends.
Today, the analysis of big data can be done by a number of processing software and is performed by individuals, governments and companies. Hadoop, released in 2005, was the first free, open-source software that businesses and retailers could use in order to store large amounts of data and deliver better search results to their customers.
Some of the most popular companies on the internet are using Hadoop, including Facebook, Twitter, Amazon and Linkedin. While Hadoop is widely seen as a revolutionary product in the field of big data and business analytics, there are a number of different programs available now, like Spark, that are equally as powerful.
The mainstream definition of big data can be found within Doug Laney’s 2001 definition, which is now known as the three “V’s:”
- Volume: the idea that companies and governments amass large amounts of data from a multitude of sources, including social media and business transactions
- Variety: the notion that data comes in a variety of forms, including text, audio and emails
- Velocity: the idea that data flows in at a remarkable speed and needs to be processed efficiently and rapidly
The most important extensions of this definition include innovation, which needs to be used in order to apply all three V’s effectively, and decision making, which is the ultimate result of processing big data.
What Does a Data Scientist Do?
At the start of the decade, data science was pronounced by the Harvard Business Review to be the “sexiest job of the 21st century.” This statement was spurred not only from big data’s increasing role in business structures and governments but also from fact. Two of Linkedin’s 2018 Top 5 emerging jobs included data scientist specializations. The question is, what do data scientists actually do?
In 1962 Jon W. Turkey wrote “The Future of Data Analysis,” in which he starts to wonder whether his central function in his field of statisticians was shifting towards what he called data analysis.
While professions in data analysis are common today, the same ambiguities that plagued the burgeoning discipline in the late 20th century are now being replicated in today’s hottest filed: data science. While there are many definitions of what exactly data scientists do, there are a few key roles they fill, which include:
- Collecting, cleaning and transforming often unstructured, large amounts of data
- Utilizing computer programming languages in order to do so (most often SAS, SPSS, Python, R)
- Finding patterns in and modelling the big data of enterprises and governments in order to increase profit or ameliorate strategies
- Inventing better ways of dealing with big data
- Automating these processes
The last point is an important implication of data science. While many businesses hire data scientists to discover profitable ways to convert their large stores of customer data into better business practices, other sectors look to data scientists to drive the future. The skills that are most commonly linked to this drive towards automation are machine learning and deep learning.
What are Machine Learning and Deep Learning?
Once imagined as a far-off achievement of the future, artificial intelligence, or AI, is now used daily - everything from digital assistants to photo face recognition features. While many tend to think that the boundaries of AI only extend to computer science, it still captures many data science attributes. Two important examples of this are machine learning and deep learning.
One of the first attempts at defining machine learning was made by a man named Alan Turing. In 1950, in his seminal paper “Computing Machinery and Intelligence,” Turing sought to discover whether machines could actually “think.”
A year after this analytical paper was published, Marvin Minsky and Dean Edmonds introduced the first computer-based simulation that replicated the human brain - otherwise known as an artificial neural network. Since then, machine learning has grown to encompass a definition that involves computers being able to receive information and learn from that data themselves, becoming more accurate as they continue to “learn.”
While machine learning is widely considered to be a subset of the field of AI, deep learning is seen as a subset of machine learning. Taking automation a step further, the goal of deep learning is for machines to be able to recognize objects and concepts without being programmed to do so.
One famous example of both machine and deep learning involves cats. While machines can be trained by steps involving algorithms to recognize a cat through machine learning, deep learning suggests that by exposing a massive amount of images of cats to a machine, they will eventually learn to recognize a cat on their own.
In other words, one involves manipulating data in order for machines to predict outcomes, and the other involves introducing vast amounts of data so that machines can predict outcomes. Needless to say, both of these processes are moves toward predictive analytics.
How to Become a Data Scientist?
Fortunately, whether you just want to learn more about the subject or are looking to refine the skills you’ve already acquired, there are plenty of sources to help people interested in becoming data scientists.
For those who are looking either for undergraduate or graduate data science programs that will suit them best, it is important to keep two things in mind:
- Is data science right for you?
- What data science jobs are out there?
When deciding which data science program is right for you, it is important to carefully compare the courses offered in each stage of the program. While some data science programs are heavily math based and tend to lean more on quantitative theoretical applications, other programs will be geared towards data driven business intelligence.
One action that can help you decide which program will be meaningful is browsing job offers in data science that seem exciting and relevant to you and your interests. If you notice required skills that keep recurring in the job listings you are interested in, write them down and later compare them to the university programs you have selected. In this way, you can get a better sense of which program will not only be right for you but will help you find a job after you graduate.
If you are already a professional seeking to switch paths, one important step would be to decide whether data science is right for you. To do so, it might be useful to also look at job listings that incite your interest and look at the common skills required.
If these skills sound like something you would like to acquire, if they don’t already overlap with your current profession, then start seeking out ways in which you can master them. Some common capabilities required from data scientists are:
- Analytical skills
- Data preparation
Fortunately, sites like Superprof offer a wide range of tutors in these fields and can be found quickly and easily on their site. For more information, data sets and data analysis tools, you can also check out some online resources like Kaggle and the data science toolbox.