Understanding Data Science
- Data science is the field of study that combines mathematics and statistics to extract meaningful insights from data.
- It deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions.
- Data science uses math and statistics, specialized programming, advanced analytics, Artificial Intelligence (AI), and machine learning to extract insights from data and guide decision-making and strategic planning.
- Data Science includes extraction, preparation, analysis, visualization, and maintenance of information.
- It involves cross-disciplinary knowledge from computer science, statistics, and mathematics.
The Data Science Lifecycle
The lifecycle of data science has five distinct stages:
- Capture – Data acquisition, entry, signal reception, data extraction. This stage involves gathering raw structured and unstructured data.
- Maintain – Data warehousing, cleansing, staging, processing, and architecture. This stage puts raw data into a usable form.
- Process – Data mining, clustering/classification, modeling, summarization. Scientists examine data patterns, ranges, and biases for predictive analysis.
- Analyze – Exploratory/confirmatory analysis, predictive analysis, regression, text mining, qualitative analysis.
- Communicate – Data reporting, visualization, business intelligence, and decision-making. Results are presented in charts, graphs, and reports.
Data Science Tools
- Data Analysis: SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner
- Data Warehousing: Informatica, AWS Redshift, Wega
- Data Visualization: Jupyter, Tableau, Cognos, RAW
- Machine Learning: Spark MLlib, Mahout, Azure ML Studio
Phases of Data Analysis
-
Data Requirements – Data can be collected, curated, and stored from different sources.
- Example: An application tracking the sleeping pattern of dementia patients requires several sensors: sleep data, heart rate, electro-dermal activity, and user activity. These are mandatory requirements for diagnosis. Data must also be categorized as numerical or categorical and stored in the right format.
-
Data Collection – Data from multiple sources should be stored in the correct format and transferred to the right personnel. It can be collected from objects, events, and sensors.
-
Data Processing – Preprocessing involves selecting and organizing datasets before analysis. Tasks include exporting datasets, structuring them, and formatting correctly.
-
Data Cleaning – Even preprocessed data is not ready for analysis. It must be checked for incompleteness, duplicates, errors, and missing values.
- This involves record matching, finding inaccuracies, removing duplicates, and filling missing values.
- Example: Using outlier detection for quantitative data.
-
Exploratory Data Analysis (EDA) – At this stage, the actual message in the data is understood. Various data transformation techniques may be required.
-
Modeling and Algorithms – Models or formulas represent relationships among variables.
- Example: Total price of pens = UnitPrice × Quantity. Here, Total is dependent, and UnitPrice is independent.
- In general: Data = Model + Error.
-
Data Product – Any software using data inputs to produce outputs and give feedback.
- Example: A recommendation system suggesting products based on purchase history.
-
Communication – Disseminating results to stakeholders for business intelligence. Data visualization techniques such as tables, charts, and diagrams are used.
Applications of Data Science
- Healthcare
- Gaming
- Image Recognition
- Recommendation Systems
- Fraud Detection
- Speech Recognition
- Airline Route Planning
- Virtual Reality