试卷中怎样查找重复题目

查找重复项的快速解决方案

在深入了解细节之前，让我们先从一些快速、实用的解决方案开始。以下是在列表中查找重复项的三种常见方法：

# Using a set to find duplicates def find_duplicates_set(lst): seen = set() duplicates = set() for item in lst: if item in seen: duplicates.add(item) seen.add(item) return list(duplicates) # Using a dictionary to count occurrences def find_duplicates_dict(lst): count_dict = {} for item in lst: count_dict[item] = count_dict.get(item, 0) 1 return [item for item, count in count_dict.items() if count > 1] # Using list comprehension with count() def find_duplicates_count(lst): return list(set(x for x in lst if lst.count(x) > 1))

每种方法都有其自身的优点——让我们探讨一下何时使用每种方法。

了解基于集合的方法

基于集合的方法通常是查找重复项的最直接且最节省内存的方法：

numbers = [1, 2, 2, 3, 4, 4, 5] seen = set() duplicates = set() for num in numbers: if num in seen: duplicates.add(num) seen.add(num) print(list(duplicates)) # Output: [2, 4]

该方法的工作原理是：
1. 创建一个空集来跟踪已看到的项目2. 为重复项创建另一组3. 将每个项目添加到“seen”并检查我们之前是否遇到过它4. 如果我们以前见过它，那么它是重复的

主要优点是集合查找的时间复杂度为 O(1)，使得这种方法对于大型列表来说非常快。然而，它需要额外的内存来存储集合。

使用词典获取更多信息

当您需要知道每个项目出现了多少次时，字典方法会发挥作用：

def find_duplicates_with_counts(lst): count_dict = {} # Count occurrences for item in lst: count_dict[item] = count_dict.get(item, 0) 1 # Filter and format results duplicates_info = { item: count for item, count in count_dict.items() if count > 1 } return duplicates_info # Example usage items = ['apple', 'banana', 'apple', 'cherry', 'banana', 'banana'] result = find_duplicates_with_counts(items) print(result) # Output: {'apple': 2, 'banana': 3}

此方法为您提供有关重复项的更多详细信息：
- 每个项目出现多少次- 如果需要，可以轻松修改以跟踪位置- 对于数据分析任务有用

现实示例：清理客户数据

以下是查找重复客户记录的实际示例：

class Customer: def __init__(self, id, name, email): self.id = id self.name = name self.email = email def __eq__(self, other): return self.email.lower() == other.email.lower() def __hash__(self): return hash(self.email.lower()) def find_duplicate_customers(customers): seen = set() duplicates = [] for customer in customers: if customer in seen: duplicates.append(customer) seen.add(customer) return duplicates # Example usage customers = [ Customer(1, "John Doe", "john@example.com"), Customer(2, "Jane Smith", "jane@example.com"), Customer(3, "John Doe", "JOHN@EXAMPLE.COM"), # Duplicate email ] duplicates = find_duplicate_customers(customers) for dup in duplicates: print(f"Duplicate customer: {dup.name} ({dup.email})")

此示例演示如何：
- 处理不区分大小写的匹配- 使用自定义对象- 实现正确的“__eq__”和“__hash__”方法- 处理真实世界的数据场景

查找嵌套结构中的重复项

有时您需要在更复杂的数据结构中查找重复项：

def find_nested_duplicates(data): from collections import defaultdict # Helper function to convert lists/tuples to hashable format def make_hashable(obj): if isinstance(obj, (list, tuple)): return tuple(make_hashable(item) for item in obj) elif isinstance(obj, dict): return tuple(sorted((k, make_hashable(v)) for k, v in obj.items())) return obj # Track items and their positions positions = defaultdict(list) def process_item(item, path=[]): hashable_item = make_hashable(item) positions[hashable_item].append(path[:]) if isinstance(item, dict): for key, value in item.items(): process_item(value, path [key]) elif isinstance(item, (list, tuple)): for i, value in enumerate(item): process_item(value, path [i]) process_item(data) return {k: v for k, v in positions.items() if len(v) > 1} # Example usage nested_data = { 'a': [1, 2, [3, 4]], 'b': [3, 4], 'c': [1, 2, [3, 4]] } duplicates = find_nested_duplicates(nested_data) for item, locations in duplicates.items(): print(f"Item {item} found at paths: {locations}")

这个高级示例展示了如何：
- 处理嵌套结构（列表、字典、元组）- 跟踪重复项的位置- 将复杂结构转换为可散列格式- 处理递归数据结构

性能考虑因素

不同的方法有不同的性能特点：

import time import random def benchmark_duplicate_methods(size): # Generate test data test_list = [random.randint(1, size//2) for _ in range(size)] # Test set-based method start = time.time() duplicates_set = find_duplicates_set(test_list) set_time = time.time() - start # Test dictionary method start = time.time() duplicates_dict = find_duplicates_dict(test_list) dict_time = time.time() - start # Test count method start = time.time() duplicates_count = find_duplicates_count(test_list) count_time = time.time() - start return { 'set_method': set_time, 'dict_method': dict_time, 'count_method': count_time } # Run benchmark with different sizes sizes = [1000, 10000, 100000] for size in sizes: print(f"\nBenchmark with {size} items:") results = benchmark_duplicate_methods(size) for method, time_taken in results.items(): print(f"{method}: {time_taken:.4f} seconds")

关键绩效洞察：
- 基于集合的方法：最适合重复很少的大型列表- 字典方法：当您需要计数或位置时很好- List.count() 方法：简单但对于大型列表来说速度较慢- 内存使用量随着 set/dict 方法的独特项目的增加而增加

处理重复项的技巧

1. 根据您的需求选择合适的方法：
— 使用集合进行简单的重复检测— 当您需要计数或位置时使用字典— 对小列表或一次性操作使用列表理解

2. 考虑内存使用情况：

# Memory-efficient approach for large lists def find_duplicates_generator(lst): seen = set() for item in lst: if item in seen: yield item seen.add(item) # Usage with large lists large_list = list(range(1000000)) [1, 2, 3] duplicates = list(find_duplicates_generator(large_list))

3、特殊情况处理：

def find_duplicates_robust(lst): from collections import defaultdict counts = defaultdict(list) # Track both values and positions for i, item in enumerate(lst): counts[item].append(i) # Return items with positions where they appear more than once return { item: positions for item, positions in counts.items() if len(positions) > 1 } # Handles special cases like None, float('nan'), etc. mixed_list = [1, None, 1.0, None, float('nan'), float('nan')] print(find_duplicates_robust(mixed_list))

本指南涵盖了在 Python 列表中查找重复项的基本方面，从基本方法到高级场景。这些示例旨在实用并适应现实世界的情况，而解释不仅可以帮助您理解如何使用这些方法，还可以帮助您理解为什么以及何时使用每种方法。