Welcome to my GitHub space! I'm a passionate technologist specializing in data science, machine learning, and big data engineering. I transform complex data into actionable insights and build scalable solutions to solve real-world problems.
π Exploring cutting-edge technologies and methodologies
π€ Collaborating on open-source projects and innovative solutions
π‘ Creating impact through data-driven decision making
Feel free to reach out for discussions on data science, ML/AI projects, or the latest tech trends. Let's build something amazing together!
Flight Delay Prediction System
End-to-end ML pipeline for predicting flight delays using weather data
- Technologies: Apache Spark, Scala, MLflow, Docker, GCP Dataproc
- ML Techniques: PCA feature engineering, Random Forest, k-fold cross-validation
- Results: 85.8% accuracy with complete CI/CD deployment pipeline
SVM Optimization Learning
Advanced implementation of Support Vector Machine optimization techniques
- Focus: Linear and non-linear separability scenarios
- Methods: Hinge loss, ramp loss, hard margin optimization
- Output: Comparative analysis with performance visualizations on 2D synthetic datasets
Honey Production Analysis & Forecasting
Time series analysis and predictive modeling of US honey production (1998-2012)
- Objective: Forecast honey production trends for upcoming years
- Repository: Part of AI Projects Collection
PageRank on Apache Spark
Scalable PageRank algorithm implementation with multi-scale Wikipedia graph analysis
- Technologies: Scala, Apache Spark, GCP Dataproc, GitHub Actions
- Scale: From wiki-chti (5K pages, 40K edges) to wiki-fr (400K pages, 5M edges)
- Optimization: Performance comparison between baseline and partition-optimized implementations
- Analysis: Interactive Jupyter notebooks for comparative performance metrics
LLM-GreenTune: Eco-Efficient Language Models
Sustainable LLM optimization through distillation, fine-tuning, and compression techniques
- Distillation: Llama-3.2-3B β 1B student model (temperature-scaled softmax T=2.0, Ξ±=0.85)
- Fine-tuning: LoRA (r=16, Ξ±=16) + QLoRA with 4-bit NF4 quantization on financial Q&A (7K samples)
- Compression: Magnitude pruning + GPTQ quantization achieving 67% memory reduction
- RAG System: SEC 10-K API, FAISS vector DB, HuggingFace embeddings for financial documents
- Performance: 85%+ accuracy retention with ROUGE, BLEU, and perplexity metrics
- Deployment: Production-ready Gradio chatbot for real-time financial Q&A
H&M Fashion Recommendation Pipeline
End-to-end recommendation system for personalized fashion suggestions
- Dataset: 31M+ transactions, 1.4M customers, 105K articles
- Algorithm: LightFM with collaborative filtering (WARP/BPR loss functions)
- Approach: Hybrid model combining collaborative and content-based features
- Optimization: Grid search hyperparameter tuning
- Deployment: Streamlit interface for real-time predictions
Electric Vehicle Charging Stations Analysis
Comprehensive analysis of EV charging infrastructure
- Technologies: Python, pandas, data visualization libraries
- Analysis: Station distribution, usage patterns, and infrastructure insights
Common Crawl Domain Graph Analysis
Large-scale analysis of web domain relationships from Common Crawl dataset
- Technologies: Apache Spark, Hadoop
- Scale: Processing petabytes of web crawl data
- Focus: Domain graph structure and connectivity patterns
Spark Connected Components Finder
Distributed graph algorithm implementation for finding connected components
- Algorithm: Connected Components Finder (CCF)
- Framework: Apache Spark for distributed processing
- Application: Large-scale graph analysis and network clustering
Languages: Python β’ Java β’ Go β’ Bash β’ PowerShell β’ SQL
Data Formats: YAML β’ JSON
Frameworks: Scikit-learn β’ MLflow β’ LightFM β’ HuggingFace Transformers
Deep Learning: LoRA β’ QLoRA β’ Model Distillation β’ Quantization (GPTQ, NF4)
Techniques: SVM β’ Random Forest β’ PCA β’ Cross-validation β’ Time Series Forecasting β’ Recommendation Systems
RAG & Vector DBs: FAISS β’ LangChain β’ Semantic Search
Processing: Apache Spark β’ Hadoop β’ Scala
Platforms: Google Cloud Platform (Dataproc) β’ Databricks
Algorithms: PageRank β’ Connected Components β’ Graph Analysis
Tools: Pandas β’ Jupyter β’ Data Visualization
Containerization: Docker β’ Podman
CI/CD: Jenkins β’ GitHub Actions
Automation: Ansible
Cloud: Google Cloud Platform (Dataproc, Compute Engine)
Deployment: Gradio β’ Streamlit
OS: Ubuntu β’ Gentoo
Tools: SystemD β’ Bash scripting β’ Network Configuration
Virtualization: VirtualBox
Protocols: TCP/IP β’ DNS β’ DHCP β’ HTTP/S
Security: Wireshark
Automation: Ansible
Version Control: Git β’ GitHub β’ GitLab
IDEs: VSCode β’ PyCharm β’ Vim
Documentation: Markdown β’ Sphinx