Data transformation


Exploratory Data Analysis

1.9 Data Transformation Techniques

Data transformation techniques help convert raw data into a clean and ready-to-use dataset. Different techniques are applied depending on the project or data pipeline.


1. Data Smoothing

  • Removes noise from data to identify trends.
  • Techniques:
    • Clustering: Group similar values; outliers are separated.
    • Binning: Divide data into bins, smooth values within bins.
    • Regression: Find relation between attributes, predict values.

2. Attribution Construction (Feature Construction)

  • Creates new features from existing attributes.
  • Example: From impressions and cost, create CPM (cost per million impressions).
  • Helps in comparing performance using a single metric.

3. Data Generalization

  • Converts low-level attributes → high-level attributes using hierarchy.
  • Example: Street → City → State → Country.
  • Useful for categorical data with large distinct values.

4. Data Aggregation

  • Summarizes raw data into compact form.
  • Example: Calculate average, sum, min, max for a given time period.
  • Types: Time aggregation and Spatial aggregation.

5. Data Discretization

  • Converts continuous data into intervals.
  • Example: Age → Youth, Middle-aged, Senior.
  • Methods: Equal-width, Equal-frequency, MDLP.
  • Improves efficiency of algorithms.

6. Data Normalization

  • Scales data into a smaller range for consistency.
  • Methods:
    • Min-Max Normalization → Linear transformation.
    • Z-Score Normalization → Based on mean and standard deviation.
    • Decimal Scaling → Move decimal point of values.
  • Helps reduce skewness and improve algorithm performance.

7. Data Integration

  • Combines data from different sources into one unified view.
  • Sources: Databases, Data cubes, Flat files.
  • Approaches: Tight coupling and Loose coupling.

8. Data Manipulation

  • Alters or organizes data to make it readable and usable.
  • Helps identify patterns and generate insights.
  • Example: Grouping continuous age values into intervals for easier analysis.