Open Source Learning Path · 2026–2028

Zero to Data Scientist

From absolute zero to Agentic AI — a practical, skills-driven roadmap with 9 levels, 150+ topics, and hands-on projects. Built for the community, by the community.

K
Level 0
🧱
Foundation
Python · DSA · Mathematics — The Bedrock of Everything
📁 Explore Folder2–3 Months
🐍 Python Programming
Variables & Data Types
int, float, str, bool, type casting — the atoms of all programs
Control Flow
if/else, while, for loops, break/continue — program logic
Functions
def, return, *args, **kwargs, scope (LEGB rule)
Data Structures
Lists, Tuples, Sets, Dictionaries — comprehensions & nesting
OOP Foundations
Classes, objects, inheritance, polymorphism, dunder methods
Advanced Concepts
Decorators, Generators, Closures, Context Managers
Error Handling
try/except/finally, custom exceptions — robust code
Environment
Pip, Virtual Environments (venv/conda), requirements.txt
🔗 Data Structures & Algorithms
Linear DS
Arrays, Strings, Linked Lists, Stacks & Queues
Hashing
HashMap, HashSet, collision handling, O(1) lookups
Non-Linear DS
Binary Trees, BST, Graphs (BFS/DFS), Heaps
Algorithms
Sorting, Binary Search, Recursion, Dynamic Programming
Complexity
Big-O notation, time and space complexity analysis
📐 Mathematics for Data Science
Linear Algebra
Vectors, Matrices, Eigenvalues, SVD, Dot Products
Calculus
Derivatives, Chain Rule, Gradients, Partial Derivatives
Probability
Bayes' Theorem, Conditional Probability, Random Variables
Statistics
Hypothesis Testing, p-values, Normal Distribution, Correlation
Flagship Project Idea

Build a "CLI Budget & Expense Intelligence" — A comprehensive terminal app using OOP principles. Implement data persistence with JSON, custom exceptions for error handling, and a module-based structure. Tech: Python, OOP, JSON, and Pytest.

Level 1
🧹
Data Preprocessing
NumPy · Pandas · Data Science Theory
📁 Explore Folder1.5–2 Months
🔢 NumPy — Numerical Computing
Array Creation
np.array(), zeros(), ones(), arange(), linspace(), random.randn()
Indexing & Slicing
Boolean indexing, fancy indexing, multidimensional slicing
Broadcasting
Vectorization, shape compatibility rules, memory efficiency
Linear Algebra
Matrix multiplication (np.matmul), dot products, inverse, determinants
🐼 Pandas — Data Manipulation
DataFrames & Series
Creation, indexing (loc/iloc), dtypes, and statistics (describe/info)
Data Cleaning
Handling nulls (fillna/dropna), duplicates, and string operations
GroupBy & Aggs
Pivot tables, multi-indexing, aggregation functions, transformation
Time Series
Date conversion, resampling, rolling windows, shift/diff operations
Merging Data
Inner/Outer/Left/Right joins, concatenations, overlapping keys
📚 DS Theory Fundamentals
Preprocessing
Scaling (MinMax/Standard), encoding (One-Hot/Label), outliers
Feature Engineering
Interaction terms, binning, polynomial features, log transformation
DS Lifecycle
CRISP-DM overview, problem definition, data acquisition
Flagship Project Idea

Build a "Real-Estate Analytics Pipeline" — Use NumPy and Pandas to clean a raw housing dataset. Handle missing locations, engineer features like "Price per SqFt", and perform deep segment analysis. Tech: Pandas, NumPy, and Matplotlib.

Level 2
📊
Data Visualization
Matplotlib · Seaborn · Power BI · Tableau
📁 Explore Folder3–4 Weeks
🎨 Theory & Principles of Visualization
Why Visualize?
Communicating data stories, pattern discovery, stakeholder reporting
Chart Selection Framework
When to use bar, line, scatter, pie, heatmap, histogram, box plot
Gestalt Principles
Proximity, similarity, continuity — how humans perceive charts
Data-Ink Ratio
Edward Tufte's principles — remove chartjunk, maximize clarity
Exploratory vs Explanatory
EDA for yourself vs polished charts for stakeholders
📈 Matplotlib
Figure & Axes API
fig, ax = plt.subplots() — OOP approach is the right way
Basic Charts
Line, bar, barh, scatter, histogram, pie, stem, step
Customization
Colors, markers, linestyles, labels, titles, legends, grid
Subplots & Layouts
plt.subplots(m,n), gridspec, tight_layout
Annotations
ax.annotate(), ax.text(), arrows, highlighting data points
Saving & Export
savefig(), DPI, formats (PNG, SVG, PDF)
🌊 Seaborn
Distribution Plots
histplot, kdeplot, ecdfplot, rugplot
Relational Plots
scatterplot, lineplot — hue, size, style parameters
Categorical Plots
barplot, boxplot, violinplot, stripplot, swarmplot, countplot
Matrix Plots
heatmap, clustermap — correlation matrices
Pair & Joint Plots
pairplot(), jointplot() — EDA power tools
FacetGrid
Multi-panel plots by category — storytelling with subsets
Themes
set_theme(), set_style(), set_palette() — publication-quality plots
📊 Power BI (Fundamentals)
Power BI Architecture
Desktop, Service, Mobile; data flow concepts
Data Connections
Import vs DirectQuery, connecting Excel/SQL/APIs
Power Query (ETL)
Data transformations without code — M language basics
DAX Basics
SUM, CALCULATE, FILTER, ALL — measures vs calculated columns
Visualizations
Charts, slicers, maps, KPI cards, conditional formatting
Dashboard Building
Report vs Dashboard, publishing to Power BI Service
📉 Tableau (Fundamentals)
Tableau Interface
Dimensions vs Measures, Rows/Columns shelf, Marks card
Connecting Data
Live vs Extract connections, joining data sources
Chart Types
Show Me panel, building bar/line/scatter/map charts
Filters & Parameters
Context filters, dynamic parameters, filter hierarchy
Calculated Fields
Basic formulas, LOD expressions (FIXED, INCLUDE, EXCLUDE)
Dashboard Design
Layouts, actions, publishing to Tableau Public
Flagship Project Idea

Build a "Global Trade & Economics Storyteller" — An interactive multi-page dashboard. Ingest real-world data from World Bank or Kaggle, perform advanced cleaning, and build a suite of interconnected charts highlighting global trends. Tech: Python, Seaborn, Plotly/Dash, or Tableau/PowerBI.

Level 3
🗄️
DBMS, Data Engineering & Cloud
SQL · DBMS · Big Data · Cloud Computing
📁 Explore Folder2–3 Months
🗃️ SQL — Basic to Advanced
DDL Commands
CREATE, ALTER, DROP, TRUNCATE — table schema management
DML Commands
SELECT, INSERT, UPDATE, DELETE with conditions
Filtering & Sorting
WHERE, AND/OR/NOT, BETWEEN, IN, LIKE, IS NULL, ORDER BY
Aggregations
GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX
JOINs
INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF join — with diagrams
Subqueries
Correlated vs non-correlated, EXISTS, IN with subqueries
Window Functions
ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, PARTITION BY — CRITICAL for DS
CTEs
WITH clause, recursive CTEs, chaining CTEs for readability
Indexes
B-tree, composite, unique indexes — query optimization
Views & Stored Procedures
Virtual tables, reusable logic, parameterized procedures
Transactions & ACID
BEGIN, COMMIT, ROLLBACK, SAVEPOINT, isolation levels
SQL for Data Analysis
Date functions, string functions, CASE WHEN, COALESCE, NULLIF
🏛️ DBMS — Zero to Full Flash
Database Fundamentals
DBMS vs RDBMS vs NoSQL, 3-schema architecture, data abstraction
ER Modeling
Entities, attributes, relationships, cardinality, ER diagrams
Normalization
1NF, 2NF, 3NF, BCNF — eliminating redundancy
Relational Algebra
Select, Project, Join, Union, Difference — theoretical foundation
Indexing & Hashing
Primary vs secondary, clustered vs non-clustered, hash-based
Query Optimization
Execution plans, cost estimation, EXPLAIN/ANALYZE
Concurrency Control
Locks, deadlocks, 2PL, MVCC, isolation levels
NoSQL Databases
Document (MongoDB), Key-Value (Redis), Column (Cassandra), Graph (Neo4j)
CAP Theorem
Consistency, Availability, Partition tolerance trade-offs
🌊 Big Data & Data Lakes
Big Data Concepts
5Vs: Volume, Velocity, Variety, Veracity, Value; batch vs stream
Hadoop Ecosystem
HDFS architecture, MapReduce paradigm, YARN, Hive, HBase
Apache Spark
RDDs, DataFrames, SparkSQL, PySpark basics, lazy evaluation, DAG
Data Lake Architecture
Raw/Curated/Consumption zones, Delta Lake, Apache Iceberg
Data Warehousing
Star schema, snowflake schema, fact/dimension tables, OLAP vs OLTP
ETL vs ELT
Traditional ETL vs modern ELT pipelines, dbt basics
☁️ Cloud Computing (Zero to Intermediate)
Cloud Fundamentals
IaaS, PaaS, SaaS; public, private, hybrid cloud; on-premise vs cloud
AWS for Data Science
S3 (storage), EC2 (compute), RDS, Redshift, SageMaker, Lambda basics
GCP for Data Science
BigQuery, Cloud Storage, Vertex AI, Dataflow (Apache Beam)
Azure for Data Science
Azure Data Factory, Azure ML, Blob Storage, Synapse Analytics
Docker Basics
Containers vs VMs, Dockerfile, images, containers — packaging ML models
Cloud Storage Patterns
Object vs block storage, data lifecycle policies, cost optimization
AWS Free Tier GCP Free Tier Azure Student Credits
Flagship Project Idea

Build a "Data Warehouse Migration & Analytics Engine" — Architect a schema for a million-scale database, implement complex window functions for user behavior analysis, and build an ETL bridge to BigQuery or Snowflake. Tech: PostgreSQL, dbt, Apache Spark, and GCP/AWS.

Level 4
🤖
Machine Learning
Lifecycle · Algorithms · Supervised · Unsupervised · RL
📁 Explore Folder3–4 Months
🔄 ML Lifecycle
Problem Framing
Business problem → ML problem mapping, success metrics
Data Collection
Sources, APIs, scraping, synthetic data
EDA & Preprocessing
Exploring patterns, cleaning, feature engineering
Model Selection
Baseline models, complexity vs performance trade-off
Training & Validation
Train/val/test split, k-fold CV, stratified sampling
Evaluation
Metrics, confusion matrix, ROC-AUC, bias-variance
Hyperparameter Tuning
Grid Search, Random Search, Bayesian Optimization, Optuna
Deployment
Saving models (pickle, joblib), serving with Flask/FastAPI
Monitoring
Data drift, model drift, retraining triggers
📈 Supervised Learning
AlgorithmTypeWhen to UseKey Concepts
Linear RegressionRegressionContinuous target, linear relationshipOLS, coefficients, R², MSE, RMSE
Logistic RegressionClassificationBinary/multiclass, interpretableSigmoid, log-loss, decision boundary
Decision TreesBothNon-linear, interpretableGini/Entropy, depth, pruning, CART
Random ForestBothTabular data, robust to noiseBagging, feature importance, OOB error
Gradient BoostingBothCompetitions, tabular dataXGBoost, LightGBM, CatBoost, learning rate
SVMBothHigh-dimensional, small datasetsMargin, kernel trick, C & gamma params
KNNBothBaseline, recommendationDistance metrics, K choice, scalability limits
Naive BayesClassificationNLP, spam detectionBayes theorem, conditional independence
🔍 Unsupervised Learning
K-Means Clustering
Centroid-based, elbow method, inertia, silhouette score
DBSCAN
Density-based, handles arbitrary shapes, noise detection
Hierarchical Clustering
Agglomerative, dendrogram, linkage methods
PCA
Variance explained, scree plot, 2D/3D projection
t-SNE & UMAP
High-dim visualization, embedding exploration
Anomaly Detection
Isolation Forest, LOF, One-Class SVM
Association Rules
Apriori, FP-Growth, support/confidence/lift — market basket
🎮 Reinforcement Learning (Basics)
Core Concepts
Agent, environment, state, action, reward, policy, value function
Markov Decision Process
MDP formulation, Bellman equation
Q-Learning
Q-table, exploration vs exploitation, epsilon-greedy
Deep Q-Network
DQN basics, experience replay, target network
Policy Gradient
REINFORCE algorithm, actor-critic intuition
Applications
Games (OpenAI Gym), robotics, recommendation systems, trading
📏 Model Evaluation Metrics
Classification Metrics
Accuracy, Precision, Recall, F1-Score, ROC-AUC, PR-AUC, MCC
Regression Metrics
MSE, RMSE, MAE, MAPE, R², Adjusted R²
Bias-Variance Tradeoff
Underfitting vs overfitting — the fundamental ML dilemma
Cross-Validation
k-Fold, Stratified, LOOCV, TimeSeriesCV
Scikit-learn XGBoost LightGBM Optuna MLflow
Flagship Project Idea

Build a "Loan Default Predictor" — A complete ML pipeline using XGBoost/LightGBM. Perform deep EDA, handle imbalanced data, tune hyperparameters with Optuna, and serve the model via FastAPI. Tech: Scikit-learn, Optuna, FastAPI, and Docker.

Level 5
💼
Applied Data Science
Business Domains · DS Lifecycle · Fintech Applications
📁 Explore Folder1–2 Months
🏢 Business Domain Knowledge
DomainDS Use CasesKey Metrics
E-CommerceRecommendation engines, churn prediction, demand forecasting, A/B testingCVR, ARPU, NPS, CLV
FinTechCredit scoring, fraud detection, risk modeling, algorithmic tradingAUC-ROC, KS stat, Gini, Default Rate
HealthcareDisease prediction, medical imaging, drug discovery, patient segmentationSensitivity, Specificity, AUC
ManufacturingPredictive maintenance, quality control, supply chain optimizationOEE, MTBF, Defect Rate
MarketingCustomer segmentation, attribution modeling, sentiment analysis, RFMCTR, ROAS, CAC, LTV
LogisticsRoute optimization, delivery prediction, inventory managementOn-time %, Cost/delivery
🔄 Data Science Lifecycle (Deep Dive)
Business Understanding
Stakeholder interviews, problem definition, success criteria, KPIs
Data Understanding
Data inventory, profiling, quality assessment, gap analysis
Data Preparation
Cleaning pipelines, feature stores, train/test strategy
Modeling
Baseline → iterate → ensemble → final model selection
Evaluation
Business metric alignment, fairness, explainability
Deployment & Monitoring
MLOps, CI/CD for ML, model versioning, A/B testing live
💰 Data Science in FinTech
Credit Risk Modeling
PD, LGD, EAD, scorecard development, FICO-like models, Basel requirements
Fraud Detection
Imbalanced classification, transaction pattern analysis, graph-based fraud
Algorithmic Trading
Time series forecasting, momentum strategies, backtesting frameworks
Customer Analytics
CLV prediction, churn modeling, next-best-action
Regulatory Compliance
Model explainability (SHAP, LIME), bias auditing, GDPR in ML
Alternative Data
Satellite imagery, social sentiment, telco data for underwriting
🔬 Advanced DS Aspects
Experimental Design
A/B testing, hypothesis testing, power analysis, multi-armed bandit
Causal Inference
Correlation vs causation, DAGs, instrumental variables, DID
Time Series Analysis
ARIMA, SARIMA, Prophet, trend/seasonality decomposition
NLP Basics
Text preprocessing, TF-IDF, word embeddings (Word2Vec), sentiment
Model Explainability
SHAP values, LIME, feature importance, partial dependence plots
Flagship Project Idea

Build an "E-commerce Growth Engine" — A comprehensive system that predicts customer churn, estimates Lifetime Value (CLV), and performs RFM Segmentation to drive marketing strategy. Tech: Pandas, Lifetimes, Streamlit, and SQL.

Level 6
🧠
Deep Learning
ANN · CNN · RNN · Transformers · GANs
📁 Explore Folder3–4 Months
🔩 Neural Network Fundamentals
Biological Inspiration
Neurons, synapses → perceptrons, activation functions analogy
Perceptron & MLP
Single neuron, multi-layer, input/hidden/output layers
Activation Functions
ReLU, Sigmoid, Tanh, Leaky ReLU, GELU, Softmax — when to use each
Forward Propagation
Matrix multiplications through layers, computing predictions
Loss Functions
MSE, Cross-Entropy, Binary CE, Huber — choosing right loss
Backpropagation
Chain rule in depth, gradient flow, vanishing/exploding gradients
Optimizers
SGD, Momentum, RMSprop, Adam, AdaGrad, AdamW — intuition
Regularization
L1/L2, Dropout, Batch Normalization, Early Stopping, Weight Decay
🖼️ CNN — Convolutional Neural Networks
Convolution Operation
Filters/kernels, feature maps, padding, stride — visual intuition
Pooling Layers
Max pooling, average pooling, global average pooling
Classic Architectures
LeNet → AlexNet → VGG → ResNet → EfficientNet evolution
Transfer Learning
Pre-trained weights, fine-tuning, feature extraction — CRITICAL skill
Object Detection
YOLO, SSD, Faster R-CNN — bounding boxes, IoU, mAP
Image Segmentation
FCN, U-Net for semantic/instance segmentation
📝 RNN & Sequential Models
Vanilla RNN
Hidden state, temporal dependencies, BPTT, vanishing gradient problem
LSTM
Cell state, forget/input/output gates — solving long-term dependencies
GRU
Simplified LSTM, reset & update gates, efficiency
Bidirectional RNNs
Forward + backward pass, context from both directions
Seq2Seq
Encoder-decoder, attention mechanism genesis, machine translation
⚡ Transformers & Attention
Attention Mechanism
Query, Key, Value — scaled dot-product attention, intuition
Multi-Head Attention
Parallel attention heads, concatenation, projection
Positional Encoding
Why RNNs know order; sinusoidal encoding for Transformers
Transformer Architecture
Encoder-decoder blocks, LayerNorm, FFN sublayers, residual connections
BERT & GPT
Encoder-only vs decoder-only; masked LM vs causal LM pre-training
Vision Transformer (ViT)
Patches as tokens, applying Transformers to images
🎨 GANs & Generative Models
GAN Architecture
Generator vs Discriminator, minimax game, training dynamics
Training Challenges
Mode collapse, training instability, Wasserstein GAN
GAN Variants
DCGAN, conditional GAN, CycleGAN, StyleGAN
VAE
Variational Autoencoders, latent space, reparameterization trick
Diffusion Models
DDPM, forward/reverse process, noise prediction — foundation of Stable Diffusion
PyTorch TensorFlow/Keras OpenCV Hugging Face CUDA / GPU
Flagship Project Idea

Build a "Real-Time Vision Edge Intelligence" system. Combine OpenCV for frame processing with a custom CNN/MobileNet for tasks like Hand Gesture Control, Face Mask Detection, or Driver Drowsiness Monitoring. Tech: PyTorch/TensorFlow, OpenCV, and Mediapipe.

Level 7
Generative AI
LLMs · Prompt Engineering · RAG · Fine-tuning · SLM
📁 Explore Folder3–4 Months
🧬 LLMs — Foundation
LLM Architecture
Decoder-only transformer, tokenization (BPE, WordPiece), context window
Training Process
Pre-training (next-token prediction), scale laws, compute requirements
RLHF
Reinforcement Learning from Human Feedback — how GPT/Claude are aligned
Model Families
GPT-4/4o, Claude 3.5, Gemini 1.5, Llama 3, Mistral, Phi-3, Qwen
Multimodal LLMs
Vision-language models, audio LLMs, tool-use capable models
Open vs Closed Models
Proprietary APIs vs open-weight models — trade-offs for production
💬 Prompt Engineering
Prompt Anatomy
System prompt, user prompt, assistant prefill, context injection
Zero/Few-Shot
Zero-shot prompting, few-shot examples, in-context learning
Chain-of-Thought
CoT prompting, step-by-step reasoning, "think step by step"
ReAct Pattern
Reasoning + Acting loops, tool integration in prompts
Structured Output
JSON mode, XML tags, function calling, output parsers
Prompt Optimization
DSPy, automatic prompt optimization, evaluation-driven iteration
🔌 LLM Integration
OpenAI API
Chat completions, embeddings, function calling, streaming
LangChain
Chains, memory, tools, callbacks, LCEL (LangChain Expression Language)
LlamaIndex
Data indexing, query engines, node parsers, retrievers
Hugging Face
transformers library, pipeline(), model hub, Inference API
Local LLMs
Ollama, llama.cpp, Hugging Face local inference, quantized models
Vector Databases
Pinecone, Weaviate, ChromaDB, FAISS, Qdrant — embeddings + search
📚 RAG — Retrieval Augmented Generation
RAG Architecture
Indexing pipeline, retrieval pipeline, generation pipeline
Document Processing
Chunking strategies, overlap, metadata extraction
Embedding Models
text-embedding-3, BGE, E5, sentence-transformers — choosing embeddings
Retrieval Strategies
Semantic search, BM25, hybrid search, re-ranking
Advanced RAG
HyDE, query expansion, multi-query retriever, RAPTOR
RAG Evaluation
RAGAS framework, faithfulness, relevance, context recall metrics
🎯 Fine-Tuning LLMs
When to Fine-Tune?
Prompt engineering → RAG → fine-tune decision tree
Instruction Fine-Tuning
SFT (Supervised Fine-Tuning), dataset formats (Alpaca, ShareGPT)
LoRA & QLoRA
Parameter-efficient fine-tuning, rank decomposition, 4-bit quantization
Training Frameworks
Unsloth, TRL (Hugging Face), Axolotl — practical tools
DPO
Direct Preference Optimization — alignment without reward model
Evaluation
Benchmark datasets, MMLU, MT-Bench, task-specific evals
🔧 Build Your Own SLM with PyTorch
Tokenizer from Scratch
BPE tokenizer training using tiktoken / sentencepiece
GPT Architecture in PyTorch
Token embeddings, positional encoding, Transformer blocks, LM head
Training Loop
DataLoader, optimizer, gradient accumulation, mixed precision (AMP)
Text Generation
Greedy, top-k, top-p (nucleus) sampling, temperature
Scaling
Flash Attention, gradient checkpointing, ZeRO optimization
Reference Project
Karpathy's nanoGPT — build a 124M param model on your dataset
Flagship Project Idea

Build a "Multimodal Healthcare Intelligence System" — An engine that ingests medical PDFs, scans (images), and audio notes. Implement Hybrid Search, Re-ranking, and Citations. Use a "Self-RAG" loop to verify medical facts before responding. Tech: LlamaIndex, Qdrant, OpenAI, and a Next.js Dashboard.

Level 8
🤖
Agentic AI
AI Agents · MCP · Workflows · Deployment · n8n
📁 Explore Folder2–3 Months (Ongoing)
🧩 AI Agents — Foundation
What is an AI Agent?
LLM + memory + tools + planning — agents vs chatbots vs pipelines
Agent Architecture
Perception → reasoning → action loop, tool calling, self-reflection
Tool Use
Function calling, tool schemas, results injection, error handling
Memory Systems
In-context (buffer), external (vector store), episodic, semantic memory
Planning
ReAct, Plan-and-Execute, Tree of Thoughts, self-consistency
Multi-Agent Systems
Agent roles, orchestrator-worker pattern, communication protocols
🌐 Agent Use Cases by Domain
DomainAgent Use CaseTools Involved
Data ScienceAutoEDA agents, code-gen for analysis, self-correcting ML pipelinesPython executor, database tool, plotting
FinTechAutomated report generation, portfolio analysis, compliance checkingMarket data API, SQL, PDF generator
HealthcareMedical record summarization, clinical decision support, literature reviewPubMed search, EHR API, OCR
E-CommerceCustomer support agent, product research, price monitoringWeb search, CRM, email tool
EducationPersonalized tutoring, content generation, student assessmentKnowledge base, quiz generator, progress tracker
DevOps / SREIncident response agents, log analysis, auto-remediationCloudWatch, PagerDuty, shell executor
🔌 MCP — Model Context Protocol
What is MCP?
Anthropic's open standard for connecting LLMs to data sources and tools universally
MCP Architecture
MCP Host (Claude/app), MCP Client, MCP Server — transport layer
MCP Primitives
Resources (data), Tools (functions), Prompts (templates)
Building MCP Servers
Python/TS SDK, exposing database, file system, APIs as MCP tools
MCP Ecosystem
GitHub, Slack, Postgres, Brave Search, filesystem MCP servers
Why MCP Matters
Replaces N×M integrations with N+M — standardization in agentic AI
⚙️ Agent Frameworks
LangGraph
Graph-based agent workflows, cycles, branching, state management — production-grade
CrewAI
Role-based multi-agent teams, tasks, crew orchestration
AutoGen
Microsoft's conversational multi-agent framework, code execution agents
OpenAI Swarm
Lightweight handoff-based multi-agent pattern
Semantic Kernel
Enterprise-grade orchestration, plugin system, planner
🔄 n8n & Workflow Automation
n8n Fundamentals
Self-hosted workflow automation, node-based visual programming
AI Agent Node
LLM integration in n8n, tool nodes, memory nodes
Trigger Types
Webhook, schedule, event-based, form triggers
Integration Nodes
Slack, Gmail, Notion, Airtable, HTTP, database nodes
Sub-workflows
Modular agent design, calling workflows from workflows
Error Handling
Error workflows, retry logic, fallback paths
🚀 Agent Deployment
FastAPI for Agents
Exposing agents as REST APIs, streaming responses (SSE)
Docker + Cloud Deploy
Containerizing agents, deploying to AWS/GCP/Azure
Agent Observability
LangSmith, Langfuse, tracing, token tracking, debugging
Safety & Guardrails
NeMo Guardrails, Constitutional AI, input/output validation
Rate Limiting & Cost
Token budgets, caching (GPTCache, semantic cache), cost control
Human-in-the-Loop
Approval gates, interrupt nodes, human feedback integration
LangGraph CrewAI n8n LangSmith Langfuse FastAPI Docker
Flagship Project Idea

Build an "Autonomous Market-Research & Investment Swarm" — Orchestrate a crew of specialized agents (Lead Researcher, Financial Analyst, Strategic Advisor) using LangGraph. Empower agents with tools via MCP to fetch market data and browse the web. The swarm autonomously generates deep-dive reports. Tech: CrewAI, LangGraph, n8n, and Langfuse.

What to Avoid

Mistakes to Avoid on Your Journey to Becoming a Data Scientist

01

Tutorial Hell

Watching 100 tutorials without building anything. Always code along, always build projects.

02

Skipping Math

Using sklearn without understanding what's inside. Leads to cargo-cult ML — dangerous in production.

03

Tool Hopping

Switching frameworks every week. Go deep on PyTorch first before jumping to JAX or MXNet.

04

Skipping EDA

Jumping to models without understanding the data. 70% of DS work is data, not models.

05

Ignoring Business Context

A perfect model that doesn't solve the business problem is worthless. Always start with "why."

06

No Version Control

Not using Git from day 1 is big mistake. Every project, every experiment must be version controlled.

Open Source

Ready to Contribute?

This is a Open Source Project. I build this project to help Beginner to Learn Real World Skills from Scratch.if you have any suggestions or you want to contribute to this project, you can contribute to this project.

Level 0: Foundation Level 1: Preprocessing Level 2: Visualization Level 3: DBMS & Cloud Level 4: Machine Learning Level 5: Applied DS Level 6: Deep Learning Level 7: Generative AI Level 8: Agentic AI