Monday 29 July 2019

Exploratory Data Analysis

EDA objective:

We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.

Types of Data

  1. Nominal : Labels
  2. Ordinal: Qualitative or Ordered values. Limited functions of stats. No meaning of "true zero:
  3. Interval Scale: Values with an order and distances. Discrete Data like year. 
  4. Ratio Scale:  Real Numbers.Meaning of true zero. 

Type of Data Presentation:

There are three modes of presentation of data i.e. textual presentation, tabular presentation, and diagrammatic presentation.

Diagrammatic presentation:

Histogram:

  1. Datatype: Discrete/Continuous Data
  2. Used for probability density function (frequency density)  
  3. No space between bins to indicate the continuous nature of values.
  4. Indicates the distribution of the discrete data.
  5. The value is proportional to the area of the bar.
  6. Uni-variate Analysis

Bar Chart:

  1. Datatype: Categorical Data
  2. Relates  between two variables.
  3. Multi-variate analysis

Scatter (Multi-Variate) :

  1. Datatype: Continuous Vs Continous
  2. Relation (liner or non-linear) between pair of variables.
  3. visual representation of correlation

Boxplot:

  1. Datatype: Continuous
  2. The points outside of whisker are designated as outliers.
  3. SIQR (Semi-Interquartile range). 
  4. Most of the data is expected in the range of median 土 3SIQR
     What is its relation to the Normal Distribution ?
  5. Uni-variate Analysis

Side-by-Side Boxplot (Multi-Variate)

  1. Data type: Continuous vs Categorical
  2. Box plots for each level of categorical variables (Ozone vs Month). No statistical relevance

Tabular presentation

Contingency Tables (Multi-variate analysis):

  1. Data type: Categorical Data  Vs Categorical Data 
  2. Relation between Categorical Variables (count and count%)
  3. Contingency tables can be analysed for association between rows an columns using the chi-squared test

Problems

  1. The randomness of the data presents problem
  2. Missing values treatment

Cheat Sheet (Linear Data):



Data TypeOne VariableMulti VariableVisual
L1
  • Histogram. Helps to identify the intervals and ranges.No Stats can be applied. 
  • Box Plot. Helps to identify Outliers.Stats can be applied
  • Density Plots.
Continuous
Vs
Categorical 
Nomax i|xi|
  • Side-by-Side Box Plot, if the categorical range values are limited. No statistical relevance.
  • Z Test if the samples size >= 30
  • T Test if the sample size < 30
(i=1nxip)1p
  • Scatter Plot. Helps if the relation is linear or non-linear.
Categorical
Vs
L
max i|xi|
  • Two Way Table
  • Pivot Table in XL
  • Chi-Square Test