All Projects
MLOps

End-to-End Phishing URL Detection System with MLOps Pipeline

A production-grade MLOps pipeline for network security threat detection — automating data ingestion from MongoDB through validation, transformation, model training with full experiment tracking, and real-time inference via a containerised FastAPI service deployed on AWS EC2 with GitHub Actions CI/CD.

Try Live Demo
0
manual steps from data to deployed model
Full
experiment reproducibility via MLflow + DagsHub
Auto
CI/CD deploy on every commit via GitHub Actions
System Overview
End-to-End Phishing URL Detection System with MLOps Pipeline system overview

1The Problem

Phishing and network-based threats remain one of the most common and damaging attack vectors, yet most detection systems rely on static rule sets or manually retrained models that degrade quickly as attacker behaviour evolves. The core engineering challenge was building a system that could ingest raw network/URL feature data reliably, enforce data quality automatically, train and track models reproducibly across runs, and serve real-time predictions through a deployable API — all without manual intervention at any stage of the pipeline. Without a proper MLOps foundation, model updates require brittle ad-hoc scripts, experiments are untraceable, and deploying a new model version means downtime.

2The Approach

We designed and implemented a modular, end-to-end MLOps pipeline with clearly separated stages: data ingestion from MongoDB, schema-based data validation, feature transformation, scikit-learn model training, and FastAPI-based inference serving. Each stage is encapsulated as an independent pipeline component with its own configuration entity and artifact output, making individual stages testable and replaceable. MLflow experiment tracking is connected to DagsHub for remote run storage and comparison. The final model and preprocessing artifacts are saved to Amazon S3 and pulled at inference time. A GitHub Actions workflow builds the Docker image on every push, pushes it to Amazon ECR, and deploys it to an AWS EC2 instance — giving the system a fully automated path from code commit to live prediction endpoint.

Technical Architecture

1

Storage Layer: MongoDB for raw network/URL feature data; Amazon S3 for processed datasets, trained model artifacts, and preprocessing pipelines

2

Data Ingestion Layer: Automated component that reads from MongoDB and exports structured data for downstream pipeline stages; schema defined in schema.yaml and enforced via data_schema components

3

Data Validation Layer: Schema-based validation that checks feature types, value ranges, and dataset integrity before allowing training to proceed — failures halt the pipeline with structured exceptions

4

ML Training Layer: scikit-learn model trainer with MLflow experiment tracking and DagsHub as the remote tracking server; all hyperparameters, metrics, and artifacts logged per run for full reproducibility

5

Inference Layer: FastAPI application (app.py) exposing GET /train to trigger the full pipeline and POST /predict to accept a CSV upload and return phishing/benign predictions; containerised with Docker and stored in Amazon ECR

6

CI/CD Layer: GitHub Actions workflow that builds the Docker image, authenticates to AWS, pushes to ECR, and deploys to the target EC2 instance on every push to main

7

Monitoring Layer: Drift tracking and model performance monitoring dashboard built in Power BI; application logs centralised for observability

8

Orchestration Layer: Apache Airflow DAGs for scheduling pipeline runs and coordinating data movement between stages

Results

  • Fully automated pipeline from raw MongoDB data to a live prediction endpoint with zero manual steps between stages

  • Reproducible model training with complete experiment history tracked in MLflow and DagsHub across all runs

  • Containerised deployment on AWS EC2 via ECR with automated CI/CD — new model versions reach production on every push to main

  • Schema-enforced data validation prevents corrupt or mismatched data from silently degrading model quality

  • Power BI monitoring dashboard enables ongoing drift tracking and performance visibility post-deployment

  • Modular pipeline architecture allows individual components (ingestion, validation, training) to be updated or swapped independently

Key Insights

Separating each pipeline stage into its own component with a typed configuration entity and artifact output is the single most important structural decision in an MLOps project — it's what makes individual stages independently testable, observable, and replaceable without breaking the rest of the pipeline.

Schema-based data validation is not optional: without it, silent schema drift from upstream data sources becomes the most common and hardest-to-debug cause of model performance degradation in production.

Connecting MLflow to a remote tracking server (DagsHub) from day one costs almost nothing to set up but pays back enormously — local-only experiment tracking is effectively no tracking at all once you need to compare runs across machines or team members.

GitHub Actions + ECR + EC2 is a lightweight but fully production-capable CI/CD path for containerised ML services; the main operational risk is environment variable management across GitHub Secrets, EC2 instance config, and the running container.

Tech Stack

Pythonscikit-learnFastAPIMLflowDagsHubApache AirflowMongoDBAmazon S3Amazon ECRAmazon EC2DockerGitHub ActionsPower BI

Want to see it in action?

Try the live demo yourself, or get in touch if you'd like to discuss how this approach could work for your business.