Data Fundamentals

Learning Objectives:
Data Manipulation, Merging Files, Data Transformation, Visualization

Data for class use (1)US States (2) US County Data

In most situations, the phenomenon that we are studying (our dependent variable) will be recorded in the database as a continuous or nominal (discrete) variable. For example, we may be interested in examining closing stock prices of APPL or unit sales of iPhone X, both of will be recorded in data as continuous variables.

Likewise, our data investigation may involve analyzing a phenomenon (dependent variable) which is discrete. For example, we may be testing click-thru rates where each observation in data is recorded as Click/No-click. Similarly, variables like customer retention/defection, brand or platform choice (iPhone vs. Galaxy or Android vs. IOS) are discrete in nature.

Most software/algorithms can easily handle dependent variables of any nature (including unstructured data that we will examine later). However, it is our job that we recognize the nature of variable and instruct our software on how to treat that variable. For example, our variable of interest may be coded as (1, 2, 3) where 1=Android, 2=IOS, 3=Windows. Here the numeric coding is simply a label for the type of operating system, but unless instructed, would be treated as numeric by the software.

The analytical approach taken to summarize, visualize, model, evaluate (and eventually deploy) our analysis will depend critically on this simple but important point.

Part 1: Summarising Variables

Lets take 2 variables and and summarize them: (1) Red_Blue_2016, (2) Per_Trump. The first variable is an indicator for whether Trump or Clinton won a particular county. The second is a continuous variable for % of total votes in a county that Trump got.

   Per_Trump            Red_Blue   
 Min.   :0.04087   Blue_2016: 528  
 1st Qu.:0.54687   RED_2016 :3055  
 Median :0.66925                   
 Mean   :0.63716                   
 3rd Qu.:0.75600                   
 Max.   :0.94585                   

Time & Geography

We should always pay attention to Time & Geographical variables. In this case we only have a cross-sectional data (each county has a single row). Lets summarize the variables above in a map.

You can drill down to County

Part 2: Data Transformation

Next we will look at some important “variable transformations” that are used in empirical analysis. We will be using these data transformations throughout the semester.

2.1: Log Transformation

Consider the variable population density. This provides measure of how many people live in a county relative to its area. For example, rural counties tend to be large in area but have small population so population density is small. In contrast, LA or Cook (Chicago) or New York (Manhattan) are densely populated counties with very large populations compared to their geographical area. Given the highly skewed distribution of population density, we will be using a natural log transformation.

In JMP, right clik on a column and then New Formula Column–>Transform–>Log. This will create a new column that is natural log of the original variable.

2.2: Discretize a continuous variable

Sometimes it is useful to take a continuous variable and put it in discrete buckets. This is particularly useful to capture “non-linearity” in your data. For example in the customer management literature, the 80-20 rule suggests that 80% of firm’s revenue (or profits) are generated by 20% of its best customers. To check this, we may divide total customer revenue in quintiles (5-equal buckets of 20% each). Similarly decile analysis (dividing a continuous variable into 10 equal groups) is a common tactic used in business analysis.

Here we will take the population density variable and put it into 20 buckets. There are approximately 3000 counties in the US so here we are putting roughly 150 counties into 20 buckets from least to most densely populated. Lets plot average % Trump and % Clinton votes based on population density buckets. See a similar graph for 2012 election in Wahington Post.

In JMP, we will be using an “add-in” to do this. You can download the addin from this link. Usage of this will be shown in class.

2.3: Scaling or Standardizing a variable

A standardized variable (sometimes called a z-score) is a variable that has been re-scaled to have a mean of zero and a standard deviation of one. To create a standardized variable, for each row we subtract the mean of the original variable and divide by its standard deviations. Later in the course we will be covering techniques like Principle Component (PCA) and Clustering analysis. In these methods, our input (and sometimes the output) will be scaled columns.

In JMP, right click on a column, then New Formula Column–>Distributional–>Standardize. This will create a new column with mean of zeor and standard deviation of 1.

Lets create 2 new variables for standardized (or scaled) Population Density and Median_income. Table below shows the first 6 rows in the data.

To get an intuition for this, suppose we run the following 2 regressions:

\(Model 1: PerTrump = \alpha+\beta_{1}*Population Density+ \beta_{2}*Median Income\)

\(Model 2: PerTrump = \alpha+\beta_{1}*Scaled Population Density+ \beta_{2}*Scaled Median Income\)

Here we are running a very simple model of % of Votes that Trump got in a county on Population Density and Median Income of the county. Compare the output of the model when we use original (raw) data vs. scaled variables. Note that although the coefficients change (since they depend on the scale of variable), the t-stat and r-square are identical in the two models. Scaling the variable provides a direct measure on which variable is more important by looking at the coefficients (which we can’t do if the variables are not standardized).

                                     Dependent variable:     
                                     Raw           Scaled    
                                     (1)            (2)      
Pop_Density                        -0.00002                  
                                t = -15.297***               
Median_income                      -0.00000                  
                                t = -11.000***               
Scaled_Popdensity                                  -0.039    
                                               t = -15.297***
Scaled_Income                                      -0.028    
                                               t = -11.000***
Constant                            0.755          0.637     
                                t = 72.545***  t = 253.765***
Observations                        3,583          3,583     
R2                                  0.104          0.104     
Adjusted R2                         0.104          0.104     
Residual Std. Error (df = 3580)     0.150          0.150     
F Statistic (df = 2; 3580)        208.260***     208.260***  
Note:                             *p<0.1; **p<0.05; ***p<0.01

2.4: Variable Reccode

Finally, we will look at “recoding” a variable. This is particularly important when you have large number of categorical variables (for example in survey type data). For this example, lets recode the Census_Region variable from 4 categories to a dummy variable for “South” vs. “Non-South”.

In JMP, Right click on a variable and select “recode”. See this video for details

Part 3: Relationship between variables

Consider a simple situation where the focal variable of interest (call it Y) is continuous or discrete. Suppose we are examining its relationship with a second variable (call it X). Consider following 4 cases in the context of our current county level data:
1. Continuous Y (% Trump votes in county) and Continuous X (% with College degree in county)
2. Continuous Y (% Trump votes in county) and Discrete X (Quintiles of counties by College)
3. Discrete Y (Red or Blue county?) and continuous X (% with College degree in county)
4. Discrete Y (Red or Blue county?) and Discrete X (Quintiles of counties by College)

In JMP: For all cases listed below, we use Analyze–> Fit Y by X

3.1: Continuous Y & Continuous X

Here we can simple plot the variables and see correlations. We can also run a regression of Y on X.

We can extend the correlations graph above by repeating it using some other variable. Here lets see this correlation for counties by Census Region.

3.2: Continuous Y and Discrete X

Let’s create a new variable ‘Education Quintiles’ based on % of population with College degree in a county.Here we are putting counties in the US into five 20% groups based on what % of population in that county has College degree.
The analysis here is mean comparison (ANOVA or regression)

                        Dependent variable:    
edu_quint2                     0.004           
edu_quint3                     0.001           
edu_quint4                   -0.028***         
edu_quintTop 20%             -0.192***         
Constant                     0.680***          
Observations                   3,583           
R2                             0.225           
Adjusted R2                    0.225           
Residual Std. Error      0.140 (df = 3578)     
F Statistic          260.301*** (df = 4; 3578) 
Note:               *p<0.1; **p<0.05; ***p<0.01

Statistic N   Mean     St. Dev.    Min   Pctl(25)   Pctl(75)     Max    
Sum Sq    3  140.578   167.159   20.348   45.136    200.694    331.462  
Df        3 1,194.333 2,064.316     1       2.5      1,791      3,578   
F value   2 8,610.529 11,809.010 260.301 4,435.415 12,785.640 16,960.760
Pr(> F)   2   0.000     0.000     0.000    0.000     0.000      0.000   

3.3: Discrete Y & Discrete X

The visualizatioin of 2 discrete variables is using a Mosaic Chart. For a statistical test, we could use Chi-square (contingency table) or run a regression (e.g. Logistic regression)

3.4: Discrete Y and Continous X

Run a logistic regression.

Appendix: JMP Video Tutorial

JMP Tutorial Beginner (Approx 15 min): Watch it here

Data Cleaning in JMP (long 60 min video but very useful): Watch it here

Vishal Singh