NLTK Tokenization

NLTK provides two methods: nltk.word_tokenize() to divide given text at word level and nltk.sent_tokenize() to divide given text at sentence level.

NLTK Word Tokenizer: nltk.word_tokenize()

The usage of these methods is provided below.

 tokens = nltk.word_tokenize(text)

where

  • text is the string provided as input.
  • word_tokenize() returns a list of strings (words) which can be stored as tokens.

Example – Word Tokenizer

In the following example, we will learn how to divide given text into tokens at word level.

example.py – Python Program

</>
Copy
import nltk

# download nltk packages
# for tokenization
nltk.download('punkt')

# input string
text = """Sun rises in the east."""

# tokenize text to words
tokens = nltk.word_tokenize(text)

# print tokens
print(tokens)

Output

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\TutorialKart\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Sun', 'rises', 'in', 'the', 'east', '.']

punkt is the required package for tokenization. Hence you may download it using nltk download manager or download it programmatically using nltk.download('punkt').

NLTK Sentence Tokenizer: nltk.sent_tokenize()

</>
Copy
 tokens = nltk.sent_tokenize(text)

where

  • text is the string provided as input.
  • sent_tokenize() returns a list of strings (sentences) which can be stored as tokens.

Example – Sentence Tokenizer

In this example, we will learn how to divide given text into tokens at sentence level.

example.py – Python Program

</>
Copy
import nltk

# download nltk packages
# for tokenization
nltk.download('punkt')

# input string
text = """Sun rises in the east.

Sun sets in the west."""

# tokenize at sentence level
tokens = nltk.sent_tokenize(text)

# print tokens
print(tokens)

Output

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\TutorialKart\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Sun rises in the east.', 'Sun sets in the west.']

punkt is the required package for tokenization.