An implementation of a simple machine learning algorithm–Naive Bayesian algorithm in order to judge if a given list of string is abusive or not.

## Usage:

This program is for the specific use of my android diary app *Whisper*,which is a private diary highlighting a function of judging if a diary is positive or not and giving the corresponding feedback.

The diary app *“Whisper”* on Github

Codes on *Whisper* will fall a little behind this one.

## Theory foundation:

Naive Bayes classifier is a simple and effective classify method based on Bayes theorem.

### Bayes theorem:

```
P(A|B) = P(B|A) P(A) / P(B)
```

Now let’s see A as a specific category C from a collection C1、C2、…、Cm,and see B as a combination of n features F1、F2、…Fn of a certain individual.

What we want to do is to deduct which class an individual belongs from his/her set of features.We now know how to apply Bayes theorem to classifying procedure,which is to calculate the maximum value of the expression below:

```
P(C|F1F2...Fn) = P(F1F2...Fn|C)P(C) / P(F1F2...Fn)
```

From which P(F1F2…Fn) can be left out,for it has the same value when C changes.The question deteriorate into calculate the maximum of:

```
P(F1F2...Fn|C)P(C)
```

### Naive Bayes theorem:

If we take a step further and assume all n features are independent,we are using Naive Bayesian theorem,and the question is simplified again:

```
P(F1F2...Fn|C)P(C) = P(F1|C)P(F2|C) ... P(Fn|C)P(C)
```

We can get every *P* on the right side of the equation from our training set,and the problem is solved.

## Key points and codes:

### Overview:

### Train data to dictionary:

1 | for (List<String> data : trainData) { |

### Train data to train vectors:

1 | for (int i = 0; i < trainData.get(j).size(); i++) { |

### Train vectors to feature vector:

1 | pAbusive = (double) (count) / trainDocNum; |

We used a function called laplace here, and we also set the denominator to a certain value in case the value is too small and cause unpredicted situation.

### Classify the given sentence:

```
for (int x = 0; x < testVec.size(); x++) {
p0 += p0Vec.get(x) * testVec.get(x);
p1 += p1Vec.get(x) * testVec.get(x);
}
p1 += Math.log(pAbusive);
p0 += Math.log(1 - pAbusive);
if(p0 > -0.4 || p1>-1.4){
System.out.println("Neutral Words!");
}
else if (p1 > p0) {
System.out.println("Positive Words!");
}
else if(p0 > p1) {
System.out.println("Negative Words!");
}
```

Maybe you noticed that we used logarithm adding instead of probability multiply,FYI that’s to avoid overflow because the probability may be small.

### No need to train everytime

I also append a store function so that we don’t need to train the raw data everytime,when there are history dictionary and feature vector,just parse the txt file and use it,making it a lot faster and convinient.

## See the output:

Take a simple sentence “Sad my friend is gone” for example,the algorithm will use the dictioary to convert this sentence to a vector and calculate p0 and p1 using the feature vector.And this is the console output:

```
p0=-29.07335725097412 p1=-34.40233568665568 pA=0.421406667326145
Negative Words!
[sad, my, friend, is, gone]
```

## View codes on github:

https://github.com/FengyangZhang/NaiveBayesian