Understanding the ds


Understanding Data Science

  • Data science is the field of study that combines mathematics and statistics to extract meaningful insights from data.
  • It deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions.
  • Data science uses math and statistics, specialized programming, advanced analytics, Artificial Intelligence (AI), and machine learning to extract insights from data and guide decision-making and strategic planning.
  • Data Science includes extraction, preparation, analysis, visualization, and maintenance of information.
  • It involves cross-disciplinary knowledge from computer science, statistics, and mathematics.

The Data Science Lifecycle

The lifecycle of data science has five distinct stages:

  1. Capture – Data acquisition, entry, signal reception, data extraction. This stage involves gathering raw structured and unstructured data.
  2. Maintain – Data warehousing, cleansing, staging, processing, and architecture. This stage puts raw data into a usable form.
  3. Process – Data mining, clustering/classification, modeling, summarization. Scientists examine data patterns, ranges, and biases for predictive analysis.
  4. Analyze – Exploratory/confirmatory analysis, predictive analysis, regression, text mining, qualitative analysis.
  5. Communicate – Data reporting, visualization, business intelligence, and decision-making. Results are presented in charts, graphs, and reports.

Data Science Tools

  • Data Analysis: SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner
  • Data Warehousing: Informatica, AWS Redshift, Wega
  • Data Visualization: Jupyter, Tableau, Cognos, RAW
  • Machine Learning: Spark MLlib, Mahout, Azure ML Studio

Phases of Data Analysis

  1. Data Requirements – Data can be collected, curated, and stored from different sources.

    • Example: An application tracking the sleeping pattern of dementia patients requires several sensors: sleep data, heart rate, electro-dermal activity, and user activity. These are mandatory requirements for diagnosis. Data must also be categorized as numerical or categorical and stored in the right format.
  2. Data Collection – Data from multiple sources should be stored in the correct format and transferred to the right personnel. It can be collected from objects, events, and sensors.

  3. Data Processing – Preprocessing involves selecting and organizing datasets before analysis. Tasks include exporting datasets, structuring them, and formatting correctly.

  4. Data Cleaning – Even preprocessed data is not ready for analysis. It must be checked for incompleteness, duplicates, errors, and missing values.

    • This involves record matching, finding inaccuracies, removing duplicates, and filling missing values.
    • Example: Using outlier detection for quantitative data.
  5. Exploratory Data Analysis (EDA) – At this stage, the actual message in the data is understood. Various data transformation techniques may be required.

  6. Modeling and Algorithms – Models or formulas represent relationships among variables.

    • Example: Total price of pens = UnitPrice × Quantity. Here, Total is dependent, and UnitPrice is independent.
    • In general: Data = Model + Error.
  7. Data Product – Any software using data inputs to produce outputs and give feedback.

    • Example: A recommendation system suggesting products based on purchase history.
  8. Communication – Disseminating results to stakeholders for business intelligence. Data visualization techniques such as tables, charts, and diagrams are used.


Applications of Data Science

  • Healthcare
  • Gaming
  • Image Recognition
  • Recommendation Systems
  • Fraud Detection
  • Speech Recognition
  • Airline Route Planning
  • Virtual Reality