Data Science Cheat Sheet: Core Concepts, Tools & Code

pandas has over 200 methods. Most data scientists use about 15 of them in the majority of their work. This data science cheat sheet cuts through the noise — covering the Python syntax, statistical concepts, and machine learning fundamentals you need to have sharp before a job interview or your first real project.

This isn't a textbook index. It's a reference for the things that come up constantly: reshaping DataFrames, choosing between statistical tests, remembering when to use a random forest versus logistic regression, and not blanking on a SQL window function at 9am.

Python & pandas: The Core of Any Data Science Cheat Sheet

Python dominates data science for practical reasons — the ecosystem is extensive, the syntax is readable, and the community has produced libraries that cover almost every analytical task. The operations below appear in nearly every project.

Essential pandas Operations

  • Load data: df = pd.read_csv('file.csv')
  • Inspect: df.shape, df.info(), df.describe()
  • Select columns: df[['col1', 'col2']]
  • Filter rows: df[df['col'] > 100]
  • Group and aggregate: df.groupby('category')['value'].mean()
  • Handle missing values: df.dropna() or df.fillna(0)
  • Merge DataFrames: pd.merge(df1, df2, on='id', how='left')
  • Pivot table: df.pivot_table(values='sales', index='region', columns='month', aggfunc='sum')
  • Apply a function: df['col'].apply(lambda x: x * 2)
  • Sort: df.sort_values('col', ascending=False)

NumPy Basics

  • Create array: np.array([1, 2, 3])
  • Array math: np.mean(arr), np.std(arr), np.sum(arr)
  • Reshape: arr.reshape(3, 4)
  • Random sampling: np.random.choice(arr, size=100, replace=False)
  • Dot product: np.dot(a, b)
  • Boolean indexing: arr[arr > 5]

Statistics: The Data Science Cheat Sheet for Non-Coders

Strong data scientists understand the math behind their tools. These are the statistical concepts that appear most often in practice — either because you need them directly or because an interviewer will ask about them.

Descriptive Statistics

  • Mean: Sum divided by count. Sensitive to outliers.
  • Median: Middle value when sorted. More robust to outliers than mean.
  • Mode: Most frequent value. Used for categorical data.
  • Variance: Average squared deviation from the mean.
  • Standard deviation: Square root of variance. Same units as the original data.
  • IQR (Interquartile Range): Q3 minus Q1. Standard method for flagging outliers.

Hypothesis Testing

  • Null hypothesis (H₀): The default claim — usually "no effect" or "no difference."
  • Alternative hypothesis (H₁): What you're testing for.
  • p-value: Probability of seeing your result (or more extreme) if H₀ is true. Below 0.05 means you reject H₀ at 95% confidence.
  • Type I error: False positive — rejecting H₀ when it's actually true.
  • Type II error: False negative — failing to reject H₀ when it's false.
  • t-test: Compare means between two groups. Use when sample sizes are small.
  • Chi-square test: Test for a relationship between two categorical variables.
  • ANOVA: Compare means across three or more groups simultaneously.

Common Probability Distributions

  • Normal: Bell curve. Defined by mean and standard deviation. Many natural phenomena approximate this shape.
  • Binomial: Count of successes in n independent trials, each with probability p.
  • Poisson: Count of events in a fixed time period. Used for rare-event modeling.
  • Uniform: Equal probability for all outcomes within a range.

Machine Learning Algorithm Quick Reference

Picking the right algorithm is less about memorizing formulas and more about understanding tradeoffs. The table below maps common problem types to practical solutions.

Supervised Learning

  • Linear regression: Predict a continuous output. Assumes a linear relationship. Highly interpretable.
  • Logistic regression: Binary classification. Outputs probabilities. Strong baseline — always try it first.
  • Decision tree: Splits data on feature thresholds. Interpretable but overfits easily on its own.
  • Random forest: Ensemble of decision trees. More accurate, less interpretable. Good general-purpose choice.
  • Gradient boosting (XGBoost, LightGBM): Builds trees sequentially to correct prior errors. Top performer on tabular data.
  • Support vector machine (SVM): Finds an optimal boundary between classes. Works well in high dimensions.
  • K-nearest neighbors (KNN): Classifies based on similarity to k neighbors. No training phase; slow at inference on large datasets.

Unsupervised Learning

  • K-means clustering: Groups data into k clusters by minimizing within-cluster variance. You must choose k upfront.
  • DBSCAN: Density-based clustering. Handles irregular shapes and marks noise points as outliers.
  • PCA (Principal Component Analysis): Reduces dimensionality while preserving maximum variance.
  • t-SNE: Dimensionality reduction for visualization only. Not suitable as input for downstream models.

Model Evaluation Metrics

  • Accuracy: Correct predictions / total predictions. Misleading on imbalanced datasets.
  • Precision: True positives / (true positives + false positives). Use when false positives are costly (e.g., spam filters).
  • Recall: True positives / (true positives + false negatives). Use when false negatives are costly (e.g., disease screening).
  • F1 score: Harmonic mean of precision and recall. Use when you need one metric for imbalanced data.
  • AUC-ROC: Area under the ROC curve. Measures discrimination ability across all decision thresholds.
  • RMSE: Root mean squared error. For regression. Penalizes large errors more heavily than MAE.
  • MAE: Mean absolute error. More interpretable than RMSE. Less sensitive to outliers.

SQL Quick Reference

SQL is the language most data scientists use daily — it's how you extract data before it ever reaches Python. These patterns appear in nearly every analytics workflow.

  • Basic select: SELECT col1, col2 FROM table WHERE condition;
  • Aggregation: SELECT category, COUNT(*), AVG(value) FROM table GROUP BY category;
  • Joins: SELECT * FROM a LEFT JOIN b ON a.id = b.id;
  • Window functions: SELECT *, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) FROM employees;
  • CTE: WITH cte AS (SELECT ...) SELECT * FROM cte;
  • Subquery: SELECT * FROM table WHERE id IN (SELECT id FROM other WHERE ...);
  • Case statement: SELECT CASE WHEN score > 90 THEN 'A' ELSE 'B' END FROM grades;
  • Date truncation: DATE_TRUNC in PostgreSQL, DATEPART in SQL Server, DATE_FORMAT in MySQL — syntax varies by database.

Top Courses to Go Deeper

A cheat sheet gets you oriented. Courses provide the repetition needed to retain this under pressure. These are the highest-rated options for the core areas covered above.

Python for Data Science, AI & Development (IBM on Coursera)

IBM's course applies pandas, NumPy, and data visualization directly to real datasets through hands-on labs — the applied version of the Python section above. Rated 9.8.

Tools for Data Science (Coursera)

Covers the practical stack — Jupyter notebooks, RStudio, Git, and cloud-based environments — before employers expect you to know them. Useful if you're coming from Excel or BI tools. Rated 9.8.

Introduction to Data Analytics (Coursera)

Builds from data literacy through statistical thinking and visualization without assuming a math background. Structured well for career changers. Rated 9.8.

Process Data from Dirty to Clean (Coursera)

Focuses on data cleaning and validation — the work that consumes the majority of actual data science time — with reproducible workflows in SQL and spreadsheets rather than just theory. Rated 9.8.

Analyze Data to Answer Questions (Coursera)

Bridges knowing SQL syntax and using it to answer real business questions, with structured exercises that closely mirror what analyst interviews test. Rated 9.8.

Python Data Science (edX)

More academic in structure than the Coursera options; covers statistical inference and ML fundamentals using Python with a stronger emphasis on the underlying theory. Rated 9.7.

FAQ

What should a data science cheat sheet include?

The most useful ones cover: Python and pandas syntax for data manipulation, descriptive and inferential statistics, SQL for data extraction, machine learning algorithm tradeoffs, and model evaluation metrics. The exact emphasis depends on your role — a data analyst leans more on SQL and statistics, while an ML engineer needs more depth on algorithms and model deployment patterns.

Is there an official pandas cheat sheet?

The pandas development team maintains a one-page PDF reference available through the official documentation at pandas.pydata.org. The DataCamp and Dataquest cheat sheets are also widely used and cover pandas, NumPy, scikit-learn, and matplotlib in compact formats — a search for either by name will turn them up.

How do you actually memorize data science concepts?

You mostly don't — and working data scientists don't expect to. The goal is knowing what's possible and where to look, not recalling every parameter from memory. Build core syntax through regular use: write actual queries and scripts rather than just reading examples. For statistics formulas specifically, spaced repetition tools like Anki work well if you want to internalize them before an exam or interview.

What topics come up most in data science job interviews?

It varies by company and seniority, but the most consistently tested areas are: SQL (window functions, aggregations, self-joins), probability fundamentals (Bayes' theorem, distributions, expected value), Python data manipulation, and conceptual ML (not just what algorithms do, but when to use each and what can go wrong). At tech companies, product sense questions — "how would you measure success for this feature?" — appear frequently and catch people off guard.

Do I need calculus and linear algebra for data science?

For most applied data science and analyst roles, no — you won't derive backpropagation by hand in production. You do need enough linear algebra to understand matrix operations (relevant to neural networks and PCA) and enough calculus to understand optimization and gradient descent conceptually. For ML engineering or research roles, go deeper. For analytics, statistics and SQL matter more than calculus.

What's the real difference between data science, data analytics, and machine learning?

These overlap significantly in practice. Roughly: data analytics focuses on interpreting historical data to inform decisions — SQL-heavy, BI-tool-heavy. Data science adds predictive modeling and statistical inference. Machine learning engineering focuses on building and deploying production models at scale. Most job postings mix these responsibilities, so the title boundaries are considerably less clean than they appear.

Bottom Line

No cheat sheet replaces working through real data problems. What it does is reduce the friction of getting started and stop you from losing time to syntax lookups during the parts of the work that require actual thinking.

If you're early in your learning: pick one section from this cheat sheet — pandas, SQL, or statistics — and complete a structured course on it before moving to the next. Broad exposure without depth produces portfolios full of tutorials and no real projects.

If you're preparing for interviews: the SQL and machine learning algorithm sections are where most people fall short. Practice window functions until the syntax is automatic. Be ready to explain precision versus recall to someone without a technical background.

The courses above — particularly IBM's Python for Data Science and Analyze Data to Answer Questions — cover the applied versions of what's in this reference. They're worth completing if you want to move from recognizing concepts to using them correctly under pressure.

Looking for the best course? Start here:

Related Articles

More in this category

Course AI Assistant Beta

Hi! I can help you find the perfect online course. Ask me something like “best Python course for beginners” or “compare data science courses”.