AI Engineer & Data Engineer · Toronto, ON

Scott
Shi.

AI Engineer & Data Engineer with 10+ years across Canada's top financial institutions. Currently building production-grade AI systems — LLM agents, RAG pipelines, multi-agent orchestrators — while architecting Azure Databricks platforms ingesting 40M+ records daily at TD Bank.

Scott Shi
Scott Shi — Data Scientist II · TD Bank
10+
Years in Data
40M
Records / Day
5+
Enterprise Clients
7
Certifications
Open to full-time roles — Toronto · Remote · Open to U.S. relocation
Stack
PySpark· Azure Databricks· Delta Lake· Apache Airflow· Python· SQL· ADLS Gen2· AWS Glue· Azure Synapse· LLaMA 3.3· RAG· ChromaDB· DuckDB· MLflow· Docker· PySpark· Azure Databricks· Delta Lake· Apache Airflow· Python· SQL· ADLS Gen2· AWS Glue· Azure Synapse· LLaMA 3.3· RAG· ChromaDB· DuckDB· MLflow· Docker·

About

01
Scott Shi

I'm Scott Shi, an AI Engineer & Data Engineer at TD Bank, Toronto, currently building production-grade AI systems alongside enterprise data platforms.

10+ years across Canada's top financial institutions — TD, Sun Life, Desjardins, and the NFL — delivering pipelines, cloud migrations, and ML feature stores that run at real scale.

On the AI side, I'm actively building LLM-based agents, RAG pipelines, and multi-agent orchestrators.

Building Now
DataScope — a SQL agent (LLaMA 3.3 70B + DuckDB), RAG pipeline (ChromaDB + ONNX), and multi-agent orchestrator. FastAPI · Streamlit · MLflow · Docker · GitHub Actions.
Core Strength
End-to-end data systems — from raw ingestion through Delta Lake transformation to LLM-powered agents. Equally comfortable tuning Spark jobs or shipping production AI.
Background
MSc Economics, University of Guelph (2014). 8 professional certifications including Databricks Professional, AWS Data Engineer, and Microsoft AI Leader.

Skills

02
AI / ML
  • LLMs (LLaMA 3.3, Groq)
  • RAG Pipelines
  • Multi-agent Systems
  • ChromaDB / ONNX
  • MLflow
  • FastAPI / Streamlit
Data Engineering
  • PySpark / Spark SQL
  • Python / SQL / T-SQL
  • Apache Airflow
  • Delta Lake / Parquet
  • SAS (Advanced)
  • SSIS / ADF
Cloud & Data
  • Azure Databricks
  • ADLS Gen2 / Synapse
  • AWS S3 / Glue
  • Azure Data Factory
  • DuckDB
  • Databricks Workflows
DevOps & Tools
  • Docker / Compose
  • GitHub Actions / CI/CD
  • Git / Bitbucket
  • Tableau / Alteryx
  • Azure DevOps / JIRA
  • Unix Shell Scripting

Certifications

03
Databricks Certified Data Engineer Professional
Databricks
Verify →
Databricks Certified Data Engineer Associate
Databricks
Verify →
AWS
AWS Certified Data Engineer Associate
Amazon Web Services
Verify →
AWS
AWS Cloud Practitioner
Amazon Web Services
Verify →
SAS
SAS Certified Advanced Programmer for SAS 9
SAS Institute
Verify →
SAS
SAS Certified Base Programmer for SAS 9
SAS Institute
Verify →
Microsoft Certified: AI Transformation Leader
Microsoft
Verify →
C
Discrete Mathematics (Verified)
Coursera
Verify →

Experience

04
Jump to
2024 – Present
2022 – 2024
2021 – 2022
2018 – 2021
2015 – 2018
2024
Data Scientist II
TD Bank · Toronto, ON · Dec 2024 – Present

Architected and delivered an end-to-end data platform on Azure Databricks (PySpark / Delta Lake) ingesting 40M+ records daily, improving pipeline SLA adherence from 94% to 99.8%. Engineered a storage optimization layer using Delta Lake partitioning, Z-ordering, and compaction that cut compute costs by 35% and query latency by 40%. Built a self-serve analytics framework enabling 5+ business teams to independently access curated datasets, reducing ad-hoc requests by 60%. Drove data governance across 50+ tables, establishing lineage documentation and data contracts.

Azure Databricks PySpark Delta Lake Python ADLS Gen2 Data Governance
2022
Data Engineer
Adastra Corporation · Toronto, ON · Mar 2022 – Dec 2024

Delivered enterprise data projects across TD Bank and the NFL:

NFL — Operational Data Engineer (Nov 2024 – Dec 2024)

Monitored and maintained daily data ingestion processes across multiple platforms serving NFL operational reporting. Troubleshot ingestion failures, latency issues, and pipeline bottlenecks. Implemented automated checks, alerts, and validation frameworks to improve reliability and reduce downtime.

TD Bank — ED&A Frontier (Cloud Data Engineer)

Designed a full-cycle distributed ETL platform on Azure Databricks processing 5+ TB of financial data daily. Led and mentored a 3-engineer squad. Achieved a 60% reduction in Spark job execution time through AQE tuning, shuffle optimization, and right-sized clusters. Reduced storage costs 25% via partition pruning and Parquet compression.

BMO — ASAP Machine Learning Platform (Mar 2022 – May 2023)

Built 8 ML feature pipelines powering ASAP credit risk models over billions of customer records within tight latency SLAs. Achieved 50% processing speed improvement via Spark tuning, broadcast joins, and caching. Developed a modular PySpark library and Airflow DAG orchestration that reduced pipeline failure rate by 70%.

TD Bank — DaaS Cloud Migration (Jun 2024 – Oct 2024)

Led root-cause analysis and remediation of 20+ data discrepancies between on-prem and cloud, achieving 100% reconciliation accuracy across billions of records. Migrated legacy SSIS and T-SQL pipelines to modular PySpark, eliminating reporting errors by 45% and cutting pipeline execution time by 30%.

PySpark Azure Databricks SSIS → PySpark Migration Airflow ML Feature Pipelines T-SQL
2021
Senior Data Engineer
Desjardins · Montreal, QC · Nov 2021 – Mar 2022

Built and maintained complex multi-source data pipelines serving 5+ internal product teams, resolving scalability bottlenecks and reducing ad-hoc query SLA breaches by 40%. Conducted data profiling and quality analysis across TB-scale datasets, implementing cleansing and enrichment strategies aligned with enterprise data integrity standards. Collaborated end-to-end with project managers and stakeholders across Agile delivery cycles.

Data Pipelines Data Quality Agile SQL TB-scale Data
2018
Data Engineer / Data Integrator
Sun Life Financial · Toronto, ON · Oct 2018 – Nov 2021

Engineered and scaled the Centralized Campaign View platform using SAS and Spark, consolidating TB-scale data from 8+ digital channels to support 20+ Agile Marketing Pods. Developed Spark ingestion pipelines handling terabytes of customer interaction data for personalized campaign targeting of 3M+ policyholders. Automated batch jobs via SAS Macros, cutting manual processing by 50% and supporting $100M+ in annual marketing spend.

SAS Apache Spark SparkSQL Python Campaign Analytics
2015
Data Analyst
JF Insurance Agency Group Inc. · Toronto, ON · May 2015 – Oct 2018

Built SAS/SQL ETL scripts to extract and transform data from multiple sources, enabling post-campaign analytics (ROI, response rate, incremental revenue) that informed strategic budget allocation. Delivered campaign performance reports using KPIs — ROI, response rate, new subscriber growth — that drove data-backed decisions across sales and marketing leadership.

SAS SQL ETL Campaign Analytics

Projects

05
02 / Cloud Migration · TD Bank

DaaS Cloud Migration & Reconciliation

Led on-prem to Azure cloud migration for TD Bank's critical data infrastructure. Achieved 100% reconciliation accuracy across systems serving billions of records. Migrated legacy SSIS packages and T-SQL stored procedures to modular PySpark pipelines, eliminating reporting errors by 45% and cutting execution time by 30%. Maintained CI/CD hygiene via Bitbucket with zero-downtime deployments.

Azure Databricks SSIS Migration PySpark T-SQL CI/CD
Case Study →

Insights

06
Delta Lake

How We Cut TD Bank's Query Latency by 40% with Z-Ordering and Compaction

Most teams use Delta Lake but skip the tuning. Z-ordering, compaction strategies, and partition pruning are where the real performance gains live — here's exactly what we did.

10 min read · Delta Lake · Azure Databricks
ML Engineering

Building ML Feature Pipelines That Actually Hold Up in Production

After building production ML feature pipelines for major financial institutions, here's what makes the difference between pipelines that work in dev and ones that hold up under real SLAs.

12 min read · PySpark · ML Features
Data Migration

100% Reconciliation: Lessons from a Billion-Record On-Prem to Cloud Migration

Cloud migrations fail at data reconciliation. Here's the systematic approach — data profiling, iterative validation, Alteryx-based guardrails — that got us to perfect accuracy at TD Bank.

9 min read · Cloud Migration · Data Quality

Let's Connect

07

Always happy to connect — whether it's to talk data and AI, swap ideas on something you're building, or just say hello. No agenda needed.

Email
scottxinshi@gmail.com
GitHub
github.com/scottxinshi
LinkedIn
linkedin.com/in/scott-xin-shi
Location
Toronto, ON · Open to Remote