Python Word Length
## Python: How to Calculate Word Lengths in a String
In Python, analyzing text data is a common task in natural language processing (NLP), data cleaning, and text analytics. One fundamental operation is splitting a string into individual words and calculating the length of each word.
This tutorial demonstrates how to write a clean, efficient Python program to count the length of each word in a given string and map the results into a structured format.
---
### Method Overview
To calculate the length of each word in a string, we follow a simple three-step process:
1. **Tokenization**: Split the input string into a list of individual words using the `.split()` method.
2. **Length Calculation**: Iterate through the list of words and calculate the length of each word using the built-in `len()` function.
3. **Mapping**: Pair each word with its corresponding length. We can store this mapping in a dictionary where the keys are the words and the values are their respective lengths.
---
### Code Example
Here is a complete Python implementation using a dictionary comprehension and standard built-in functions:
```python
def get_word_lengths(text_string):
# Split the string into a list of words based on whitespace
words = text_string.split()
# Calculate the length of each word
lengths = [len(word) for word in words]
# Combine words and lengths into a dictionary
return dict(zip(words, lengths))
# Example string
text = "Hello world this is a test"
result = get_word_lengths(text)
print(result)
```
#### Output:
```python
{'Hello': 5, 'world': 5, 'this': 4, 'is': 2, 'a': 1, 'test': 4}
```
---
### Code Explanation
1. **`text_string.split()`**: By default, the `.split()` method splits a string by any consecutive whitespace (spaces, tabs, newlines). This converts our raw string into a list of words: `['Hello', 'world', 'this', 'is', 'a', 'test']`.
2. **`[len(word) for word in words]`**: This is a **list comprehension**. It iterates through each word in the `words` list, calculates its length using `len()`, and returns a new list of integers: `[5, 5, 4, 2, 1, 4]`.
3. **`zip(words, lengths)`**: The `zip()` function pairs elements from the `words` list and the `lengths` list together into tuples: `(('Hello', 5), ('world', 5), ...)`.
4. **`dict(...)`**: The `dict()` constructor converts the zipped pairs into a key-value dictionary, making it easy to look up the length of any specific word.
---
### Alternative & Optimized Approaches
While the `zip()` method is highly readable, Python offers other elegant ways to achieve the same result.
#### 1. Using Dictionary Comprehension (Recommended)
You can combine the splitting and mapping steps into a single, highly optimized line using a **dictionary comprehension**:
```python
text = "Hello world this is a test"
# One-liner dictionary comprehension
word_lengths = {word: len(word) for word in text.split()}
print(word_lengths)
# Output: {'Hello': 5, 'world': 5, 'this': 4, 'is': 2, 'a': 1, 'test': 4}
```
#### 2. Handling Punctuation
In real-world scenarios, strings often contain punctuation marks (like commas, periods, or exclamation points). If you do not strip them, they will be counted as part of the word length (e.g., `"world,"` would have a length of 6 instead of 5).
You can clean the text using the `string.punctuation` module:
```python
import string
def clean_word_lengths(text_string):
# Remove punctuation from the text
cleaned_text = text_string.translate(str.maketrans('', '', string.punctuation))
# Generate the word length dictionary
return {word: len(word) for word in cleaned_text.split()}
text_with_punctuation = "Hello, world! This is a test."
print(clean_word_lengths(text_with_punctuation))
# Output: {'Hello': 5, 'world': 5, 'This': 4, 'is': 2, 'a': 1, 'test': 4}
```
---
### Considerations
* **Duplicate Words**: Because dictionary keys must be unique, if a word appears multiple times in the input string, the dictionary will only keep a single entry for that word.
* **Case Sensitivity**: Words with different casing (e.g., `"Test"` and `"test"`) will be treated as separate keys in the dictionary. If you want case-insensitive results, convert the string to lowercase using `.lower()` before splitting: `text.lower().split()`.
YouTip