Mahdi Karabiben mahdiqb

Hi there 👋

I'm Mahdi, a Product and Data lead building products in the data space - specifically, I'm currently a PM at Sifflet. Before transitioning to product, I spent seven years designing and building petabyte-scale data platforms wearing different hats (data engineer, tech lead, data architect, and ML Ops engineer). I'm very passionate about open-source projects and enjoy working with data and designing scalable solutions. You can also read my content on Medium and via the Data Espresso newsletter.

The technologies I'm most familiar with:

Apache Spark (and larger Databricks ecosystem): I used it on a daily basis for nearly five years (and so we know each other pretty well).
dbt: It's the tool I currently work with the most. At Zendesk, I added dbt to our data stack and worked on defining and implementing standards, frameworks, and automation to better leverage it at scale. (Article from the Zendesk Engineering blog)
Snowflake: Was part of the core team that worked on transitioning from BigQuery to Snowflake at Zendesk.
AWS Ecosystem: Worked on it for 3 years, for various data and ML projects (mostly worked with Glue, EMR, Athena, ECS, SageMaker, Redshift, and the AWS CI/CD stack).
GCP Ecosystem: Worked on it for 3 years, mostly everything BigQuery and GKE.
Hadoop: Worked with Hadoop data lakes for two and a half years (it was the ecosystem that first introduced me to distributed systems and the paradigms/concepts behind them).
Other notable projects/tools: Apache Superset, Apache Airflow, Apache Zeppelin, Apache Hive, Dremio, Jupyter, and D3.js.
Languages I'm fluent in: Python, Java, and SQL.
Other languages I used in the past: C++, C#, JavaScript (Angular, Node.js), and HTML+CSS.
IaC: Terraform and CloudFormation.

Notable published work:

End-to-End Batch Data Pipeline with Spark: A series of four projects that I authored for Manning Publications as part of their liveProjects platform. The series goes through the different steps of building an end-to-end Big Data pipeline. Learners get to use Apache Spark, Delta Lake, and Apache Superset.
Building an End-to-End Open-Source Modern Data Platform: Proposes an exhaustive design (accompanied by the necessary Infrastructure-as-Code) to build a modern data platform solely using open-source projects and the resources offered by cloud providers.
Writing design docs for data pipelines: Exploring the what, why, and how of design docs for data components — and why they matter.
Navigating Your Career Transition in Tech: A Practical Roadmap: A practical guide to a successful career pivot in tech: from making the decision to thriving in your new role.
Data Modeling Techniques for the Post-Modern Data Stack: A set of generic techniques and principles to design a robust, cost-efficient, and scalable data model for your post-modern data stack.
Navigating Your Data Platform’s Growing Pains: A Path from Data Mess to Data Mesh: A set of strategies and guiding principles to scale your data platform while maximizing its business impact efficiently.
A Simple (Yet Effective) Approach to Implementing Unit Tests for dbt Models: Proposes an innovative unit testing approach for dbt models - relying on standards and dbt best practices.
Creating Notebook-based Dynamic Dashboards: A design (accompanied by a POC) in which notebooks are leveraged to generate dynamic dashboards, to support a Google-like metadata search engine.

Notable presentations and podcasts:

Data Innovation Summit 2023: The Data Engineer's Guide to Data Quality Testing: The Fun, Easy, and Scalable Way
Big Data Expo 2022: A Practical Case Study for Data Engineers: Performing Data Quality at Scale
The Modern Data Show (S01E02): The third wave of data technologies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mahdi Karabiben mahdiqb

Achievements

Achievements

Block or report mahdiqb

Hi there 👋

The technologies I'm most familiar with:

Notable published work:

Notable presentations and podcasts:

Pinned Loading