Significance of ds


Virtual Reality

SIGNIFICANCE OF EDA

➤ In science, economics, engineering, and marketing, large amounts of data are stored in electronic databases. Decisions should be made based on collected data.

➤ Datasets with many data points are hard to understand without computer programs. To gain insights and make further decisions, data mining is performed, which includes different analysis processes.

Exploratory Data Analysis (EDA) is the first step in data mining. It helps to visualize data, understand it, and create hypotheses for further analysis. EDA creates a summary of data or insights for the next steps without assumptions.

➤ Data scientists use EDA to understand what type of modeling and hypotheses can be developed. Main components include summarizing data, statistical analysis, and visualization.

➤ Python tools for EDA:

  • Pandas – summarizing
  • Scipy – statistical analysis
  • Matplotlib, Plotly – visualization

STEPS IN EDA

  1. Problem Definition

    • Define the business problem before extracting insights.
    • Tasks include:
      o defining objectives
      o defining deliverables
      o outlining roles and responsibilities
      o checking current data status
      o defining timetable and cost/benefit analysis
    • Based on this, an execution plan is created.
  2. Data Preparation

    • Prepare dataset before analysis.
    • Tasks include:
      o defining data sources
      o defining schemas and tables
      o understanding characteristics of data
      o cleaning dataset
      o deleting irrelevant data
      o transforming data
      o dividing data into chunks for analysis
  3. Data Analysis

    • Involves descriptive statistics and analysis.

    • Tasks include:
      ➤ summarizing data
      ➤ finding hidden correlations
      ➤ identifying relationships
      ➤ developing predictive models
      ➤ evaluating models and calculating accuracies

    • Techniques used for summarization:
      • Summary Tables
      • Graphs
      • Descriptive Statistics
      • Inferential Statistics
      • Correlation Statistics
      • Searching
      • Grouping
      • Mathematical Models

  4. Development and Representation of Results

    • Present results to stakeholders in an easy-to-understand form.
    • Use graphs, summary tables, maps, diagrams.
    • Graphical techniques include:
      • Scatter plots
      • Character plots
      • Histograms
      • Box plots
      • Residual plots
      • Mean plots