NLTK Tokenization
NLTK provides two methods: nltk.word_tokenize()
to divide given text at word level and nltk.sent_tokenize()
to divide given text at sentence level.
NLTK Word Tokenizer: nltk.word_tokenize()
The usage of these methods is provided below.
tokens = nltk.word_tokenize(text)
where
text
is the string provided as input.word_tokenize()
returns a list of strings (words) which can be stored astokens
.
Example – Word Tokenizer
In the following example, we will learn how to divide given text into tokens at word level.
example.py – Python Program
</>
Copy
import nltk
# download nltk packages
# for tokenization
nltk.download('punkt')
# input string
text = """Sun rises in the east."""
# tokenize text to words
tokens = nltk.word_tokenize(text)
# print tokens
print(tokens)
Output
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\TutorialKart\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
['Sun', 'rises', 'in', 'the', 'east', '.']
punkt is the required package for tokenization. Hence you may download it using nltk download manager or download it programmatically using nltk.download('punkt')
.
NLTK Sentence Tokenizer: nltk.sent_tokenize()
</>
Copy
tokens = nltk.sent_tokenize(text)
where
text
is the string provided as input.sent_tokenize()
returns a list of strings (sentences) which can be stored astokens
.
Example – Sentence Tokenizer
In this example, we will learn how to divide given text into tokens at sentence level.
example.py – Python Program
</>
Copy
import nltk
# download nltk packages
# for tokenization
nltk.download('punkt')
# input string
text = """Sun rises in the east.
Sun sets in the west."""
# tokenize at sentence level
tokens = nltk.sent_tokenize(text)
# print tokens
print(tokens)
Output
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\TutorialKart\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
['Sun rises in the east.', 'Sun sets in the west.']
punkt is the required package for tokenization.