Role Overview
As a Hybrid Data Scientist you will sit at the intersection of high-scale data pipelining and advanced statistical methodology. You will be responsible for the end-to-end lifecycle of Incremental Reach and Audience Measurement products—from architecting Python-based data pipelines to implementing sophisticated Bayesian and Machine Learning models that quantify the lift of Digital media over a Linear TV baseline.
Key Responsibilities
1. Advanced Statistical Modeling (The "Science" Side)
Incremental Reach Frameworks: * Small-N Datasets: Implement Bayesian Model Averaging (BMA) to cycle through regression combinations, providing robust coefficients and credible intervals when study data is limited.
Large-Scale Prediction: Deploy Gradient Boosted Regression Trees (GBM) to identify non-linear patterns and rank the impact of "Reach Drivers" (Media Weight, On-Target %, Frequency).
Audience Deduplication: Use Maximum Entropy (MaxEnt) models to estimate unique audience reach across fragmented platforms by reconciling census and panel data.
Additional Frameworks:
Mixed-Effect Models: Use Hierarchical/Multilevel modeling to account for nested data (e.g., campaigns nested within specific industry verticals).
Causal Lift: Apply Synthetic Control Methods to measure incremental shifts in behavior for campaigns with fixed timeframes where a clean control group is unavailable.
2. Data Engineering & Pipeline Architecture (The "Engineering" Side)
Python-Centric ETL: Architect and maintain robust data pipelines using Python (Pandas, PySpark) to ingest, clean, and harmonize data from Linear TV logs and Digital ad servers.
Feature Engineering: Automate the extraction of Base Drivers (GRP, Reach Efficiency, Seasonality) and Custom Drivers (Share of Voice, Flighting) into a supervised learning-ready schema.
Productionization: Wrap statistical models into production-grade APIs or scheduled containers (Docker/Airflow) to ensure repeatable and scalable measurement.
Cloud Operations: Manage large-scale datasets within Cloud Data Warehouses (Snowflake, AWS, or GCP), optimizing SQL queries for high-performance analytics.
3. Experimental Design & Methodology
Control/Test Logistics: Design scientifically valid Control and Test groups, ensuring proper randomization or using Propensity Score Matching to mitigate selection bias.
Variable Importance: Provide stakeholders with Posterior Inclusion Probabilities to identify which media levers (Duration, Weight, etc.) most consistently drive incremental reach.
Cross-Media Calibration: Reconcile Linear TV's "One-to-Many" metrics with Digital's "One-to-One" tracking to provide a unified view of the consumer.
Experience: 3-6 years of statistical model development and Mastery of Python (specifically for data manipulation and ML) and advanced SQL. Experience with PySpark or Dask for distributed computing is a plus.
Statistical Mastery: Proven experience with GBM (XGBoost/LightGBM) and Bayesian Frameworks (e.g., PyMC, Stan, or R-BMA) among other Data Science models.
Media Knowledge: Understanding of Linear TV vs. Digital dynamics, including Reach/Frequency, GRPs, and Deduplication logic.
Education: Bachelor’s or Master’s in a quantitative field (Statistics, Computer Science, Economics) or equivalent professional experience.