Overview and topic description for D3M class at NYU Stern
Data are widely available; what is scarce is the ability to extract wisdom from them” Hal Varian (UC Berkeley and Chief Economist, Google)
The quote above summarizes the main theme of this course. In every aspect of our daily lives, from the way we work, shop, communicate, or socialize; we are both consuming and creating vast amounts of information. More often than not, these daily activities create a trail of digitized data that is being stored, mined, and analyzed by firms. The general goals of these data driven initiatives is the hope of generating valuable intelligence that is pertinent to business decisions or public policies. For example, customer transaction databases provide vast amounts of high-quality data that can allow firms to understand customer behavior, and customize business tactics to increasingly fine segments or even segments of one.The value of such data-driven policies have shown dividends in many fields including language translation, product recommendation, autonomous driving, and healthcare.
Successful implementation of any data-powered business logic requires a cross-functional team effort comprising of data scientists, technology engineers, and most importantly translators of data to actionable strategies, i.e. managers like yourself. There is however a growing gulf between data scientists (who tend to be statisticians and computer scientists) and business managers, aggravated further by plethora of machine learning algorithms and technical jargon surrounding them. The objectives of this course are to fill this gap by (1) exposing you to the linguists of data science, (2) providing hands-on experience with cutting-edge tools and techniques used in practice, and (3) expose you to a wide variety of empirical contexts to instill an intuition for D3M, i.e. how to generate insights from the volumes of data.
For our first session we will go over the linguistics of data analytics that includes understanding: (1) Types of data (e.g. Cross-sectional vs. Panel, Time-series, Geospatial, etc.), (2) Types of variables (e.g. Numeric, categorical, Text, Images) and (3) An overview of modern approaches to business analytics that range from simple visualizations to cutting edge machine learning algorithms. We will also go over course logistics and take a look at two simple yet powerful software (JMP & Tableau) that we will be using through the semester.
Our second topic covers Experimental design— often regarded as the “gold standard” for making causal inferences. We will discuss the issues of design of experiments and look at 3 hands-on case studies to illustrate ideas of controlled experiments, A-B testing, and “natural” experiments. We will also look at some latest algorithms proposed by Google & Uber to help answer counterfactual or what-if type of questions.
Any phenomenon we study (e.g. Sales of our product, Click-thru rates for our Ad campaign) depends on a large number of factors. For example, sales may depend on price, advertising, competitor pricing, etc. Regression based models help us to isolate the impact of different variables, i.e. provide marginal estimates. They allow us to questions like (with caveats), how much additional web traffic can I expect if I were to spend X$ on FB Ad (holding constant other factors that may impact web traffic). In addition, its a useful tool for forecasting although better predictive algorithms exist as we shall see later
A large component of analytics is focused on developing predictive models. These include simple forecasting tools to advanced machine learning algorithms that are implemented in real time (e.g. what Ad to show to a customer). This topic provides an overview of core ideas in predictive modeling such a training/test data, popular algorithms for variable selection and modeling, metrics used to pick appropriate model, and framework for eventual deployment.
Unlike predictive modeling, we are often confronted with situations where the primary objective is exploratory data analysis to find hidden patterns or groupings in data. This session looks at some important data reduction techniques (PCA & Clustering algorithms) that are used extensively in applications ranging from psychometric analysis, product positioning, and market segmentation.
Increasingly the data generated in modern world is unstructured. This includes data from text (e.g. consumer reviews) as well as images/audio/video from social media platforms such as Instagram, Twitter & FB. This section provides an overview of the recent advances in analyzing such unstructured data. We will look at core ideas in text analytics such as sentiment analysis and topic modeling. Applications include twitter mining, Wikidata, and audio analytics using Spotify API.