What is a naive Bayes classifier?

A naive Bayes classifier is a probabilistic model based on Bayes’ theorem that assigns a class label to an observation by combining a prior probability for each class with the likelihood of the observed features under that class. The 'naive' assumption is that features are conditionally independent given the class, which simplifies the likelihood into a product of one-dimensional terms.

Why is it called 'naive' Bayes?

It is called 'naive' because the classifier assumes that features are conditionally independent given the class. This assumption is rarely exactly true in real data, but in practice the model often performs surprisingly well, especially for high-dimensional and sparse problems such as text classification.

When should I use a naive Bayes classifier?

Naive Bayes is well suited to problems with many features that can be treated as conditionally independent given the class, such as spam filtering, document categorization and simple medical triage models. It is fast to train and evaluate, works well with relatively small samples and can be used as a baseline model against which more complex methods are compared.

What is the difference between prior, likelihood and posterior in naive Bayes?

The prior P(Ck) reflects how likely class Ck is before observing any features. The likelihood P(x|Ck) measures how compatible the observed features x are with class Ck. The posterior P(Ck|x) combines these via Bayes’ theorem and is proportional to P(Ck)P(x|Ck); it represents how likely the class is after seeing the features.

Naive Bayes Classifier Calculator & Visual Guide

Build a simple naive Bayes classifier with discrete, binary features: set class priors and conditional probabilities, classify an observation, and see the posterior probabilities and Bayes rule steps.

Naive Bayes · Bayes’ theorem · Posterior probabilities Discrete/Bernoulli features · Machine learning

1. Configure the naive Bayes model structure

Choose how many classes and binary features your model has. Features are treated as present/absent; for each class you will specify \(P(\text{feature present} \mid \text{class})\).

Number of classes

E.g. spam vs not spam, positive/neutral/negative sentiment, etc.

Number of features

Features are binary (present/absent) in this tool.

Input mode

This calculator uses probabilities as input. To go from counts to probabilities, divide the count with feature present by the total count per class. For sparse data you may want to use Laplace smoothing.

2. Enter priors and conditional probabilities

For each class \(C_k\), enter a prior \(P(C_k)\). If you leave all priors 0 or empty, the calculator will assume equal priors. For each feature \(X_j\) and class \(C_k\), enter \(P(X_j = \text{present} \mid C_k)\).

Class	Prior P(Class)	Class name

Each cell is \(P(\text{feature present} \mid \text{class})\) in [0, 1]. The probability of absence is \(1 - p\).

To avoid zero-probability issues, probabilities of exactly 0 or 1 are internally clipped slightly toward the interior of (0, 1) when computing log-likelihoods.

3. Describe the observation to classify

Use the toggles to indicate which features are present in the observation. Features left off are treated as absent.

Posterior probabilities by class

Class	Prior P(C)	Log score (unnormalized)	Posterior P(C \| x)

Scores are computed as \( \log P(C_k) + \sum_j \log P(x_j \mid C_k) \) under the naive independence assumption, then exponentiated and normalized so posteriors sum to 1.

Naive Bayes classifier in a nutshell

A naive Bayes classifier uses Bayes’ theorem to assign a class label to an observation. For a discrete feature vector \(x = (x_1, \dots, x_d)\) and classes \(C_1, \dots, C_K\), the classifier predicts the class with the largest posterior probability \[ P(C_k \mid x) = \frac{P(C_k) P(x \mid C_k)}{P(x)}. \] Since \(P(x)\) is the same for every class, we only need to compare \[ P(C_k \mid x) \propto P(C_k) P(x \mid C_k). \]

The “naive” conditional independence assumption

The difficult part in high dimensions is modelling the joint likelihood \(P(x \mid C_k)\). Naive Bayes makes a simplifying – and often unrealistic – assumption: conditional independence of features given the class:

\[ P(x \mid C_k) = P(x_1, \dots, x_d \mid C_k) \approx \prod_{j=1}^{d} P(x_j \mid C_k). \]

For binary features (present/absent) this becomes especially simple. If we denote \(X_j \in \{0,1\}\) and write \[ \theta_{jk} = P(X_j = 1 \mid C_k), \] then \[ P(x_j \mid C_k) = \begin{cases} \theta_{jk}, & \text{if } x_j = 1,\\[4pt] 1 - \theta_{jk}, & \text{if } x_j = 0. \end{cases} \]

Posterior computation in this calculator

The calculator computes an unnormalized log-score for each class:

\[ s_k = \log P(C_k) + \sum_{j=1}^{d} \log P(x_j \mid C_k). \]

To avoid numerical underflow, we work in log space and then subtract the maximum score before exponentiating:

\[ \tilde p_k = \exp(s_k - s_{\max}), \quad P(C_k \mid x) = \frac{\tilde p_k}{\sum_{\ell=1}^{K} \tilde p_\ell}. \]

This yields stable posterior probabilities that sum to 1, even when there are many features.

From counts to probabilities (training a naive Bayes model)

In practice, you usually start from a labelled dataset. For a binary feature \(X_j\) and class \(C_k\), let:

\(n_{jk}^{(1)}\): number of training examples in class \(C_k\) with feature \(X_j = 1\)
\(n_{k}\): total number of training examples in class \(C_k\)

A simple estimate of the conditional probability is:

\[ \hat \theta_{jk} = \frac{n_{jk}^{(1)}}{n_k}. \]

To avoid zero probabilities, many implementations use Laplace (add-one) smoothing:

\[ \hat \theta_{jk}^{(\text{Laplace})} = \frac{n_{jk}^{(1)} + \alpha}{n_k + 2\alpha}, \] where \(\alpha = 1\) is a common choice for Bernoulli features.

FAQ – naive Bayes classifier

How does naive Bayes compare to logistic regression?

Both are probabilistic classifiers, but they model different quantities. Naive Bayes models the class-conditional likelihoods \(P(x \mid C_k)\) and priors \(P(C_k)\), then uses Bayes’ theorem to obtain the posterior. Logistic regression directly models \(P(C_k \mid x)\) using a linear function of x and the logistic link. Logistic regression does not assume feature independence but usually requires more data and iterative optimization to train.

What are typical applications of naive Bayes?

Classic applications include spam filtering, document classification, sentiment analysis and simple medical triage. Because the model is extremely fast to train and to evaluate, it is often used as a baseline for large-scale text and recommender systems.

What are the main limitations of naive Bayes?

The independence assumption is often violated: correlated features can lead to over-confident probabilities. In addition, naive Bayes with simple likelihoods (e.g. Bernoulli or Gaussian) may underfit complex class-conditional structures. It is therefore not a universal solution but a strong baseline and a good choice when data are limited and interpretability and speed are important.

Can I use this calculator for continuous features?

This specific implementation assumes binary features (present/absent). In practice, continuous naive Bayes models (e.g. Gaussian naive Bayes) are common: each feature is modelled as a normal distribution conditional on the class. To use this calculator with continuous features, you would need to first discretize them into binary indicators (for example, high vs low).