Text Processing Using Natural Language Processing or NLTK

Home Articles Text Processing Using Natural Language Processing or NLTK

There is the main function which contain three arguments from which second argument is optional. The first argument contains the text file and the last argument which is Boolean i.e. true or false. The second argument should be listing element or the second text file, third argument have the default value of false.

The output will be pair wise comparison of words and their word counts.
A More Detailed specification:
1.    These are the words which connect two sentences, words contain the full-stop, comma or exclamation mark, which followed by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character)
This is some text. This is yet more text
2.    The occurrences of some words (conjunctions) will be counted by the program which contains the following words:


3.    Program will be able to count the hyphen, single-quote, dash and semicolon, but in some cases, program should eliminate some single quote marks when the words are surrounded by apostrophes i.e. “shouldn’t” or “won’t”.

Note: Some of the texts we will use include double hyphen, i.e. "--". This is to be regarded as a space character

4.    The argument in the function will be responsible for: how many words are occurring in the sentence and how many words are occurring in the paragraph and it will be able to tell how many paragraphs have blank lines in the end or which paraph has ‘n’ in the ending.

5.    If the normalization will be set to the true, then text will be preprocessed. Except for the words per sentence and sentences per paragraph parameters, each of the others is to be divided by the number of words in the respective text. (Like there are two texts are nominated, and then user set the normalize to True then all profiles must be normalized.)

6.    The second argument is the string listing; after all the process the profile should be printed on the console output. On the other hand, if the second argument is passed as a text file then the formula of finding the standard distance formula is:


An example which you can find  , which you can find here, is based on two files, sample1.txt and sample2.txt, both excerpts taken from "Life on the Mississippi", by Mark Twain.

Some Text Files to Examine
Here are some files for you to try out. All of the texts, apart from "Kangaroo", were obtained from Project Gutenberg. All the files have a long text at the end which contains Project Gutenberg license and terms of use. I have linked the Gutenberg terms and license here rather than left them in the texts because that may affect the profiles.

Things to avoid
Not to use regular expressions or re module, not to assume that the input files to pass in arguments will only be either text or csv and to make sure that the input() function is not called because of the automation testing and it may make the program to hang-up or keep waiting for the input value and it may end up not testing the code at all.

Important: The important thing is to write def main() function which will handle all the further functionality. This main function is important for the file testing purpose to pass the test cases.

How the program will behave?

Program starts off by calling the main function. The program will have all the predefined libraries imported like nltk for text analysis and pandas for file handling, after the function get called firstly we will tokenize the data either into word tokenization or in sentences and convert all the words into either lower or upper case as of case sensitiveness, then we will separate them by the stop words, punctuation, commas, full stops and all. Remove all the alpha numeric characters as they do not make sense and change the whole contrast of the words and in the end we have to give ranking to each word how many times they have occurred in the sentence and every word is given with the average number of times the word is occurring in either a paragraph or in a sentence.