Skip to content

Commit 6ba3714

Browse files
committed
add top k frequent words
1 parent c692662 commit 6ba3714

File tree

2 files changed

+140
-0
lines changed

2 files changed

+140
-0
lines changed

zh-hans/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,7 @@
220220
* [Kth Smallest Number in Sorted Matrix](data_structure/kth_smallest_number_in_sorted_matrix.md)
221221
* [Big Data](bigdata/README.md)
222222
* [Top K Frequent Words (Map Reduce)](bigdata/top_k_frequent_words_map_reduce.md)
223+
* [Top K Frequent Words](bigdata/top_k_frequent_words.md)
223224
* [Problem Misc](problem_misc/README.md)
224225
* [Nuts and Bolts Problem](problem_misc/nuts_and_bolts_problem.md)
225226
* [String to Integer](problem_misc/string_to_integer.md)
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
---
2+
difficulty: Medium
3+
tags:
4+
- Pocket Gems
5+
- Hash Table
6+
- Amazon
7+
- Priority Queue
8+
- Bloomberg
9+
- Yelp
10+
- Heap
11+
- Uber
12+
- EditorsChoice
13+
title: Top K Frequent Words
14+
---
15+
16+
# Top K Frequent Words
17+
18+
## Problem
19+
20+
### Metadata
21+
22+
- tags: Pocket Gems, Hash Table, Amazon, Priority Queue, Bloomberg, Yelp, Heap, Uber, EditorsChoice
23+
- difficulty: Medium
24+
- source(lintcode): <https://www.lintcode.com/problem/top-k-frequent-words/>
25+
- source(leetcode): <https://leetcode.com/problems/top-k-frequent-words/>
26+
27+
### Description
28+
29+
Given a list of words and an integer k, return the top k frequent words in the list.
30+
31+
#### Notice
32+
33+
You should order the words by the frequency of them in the return list, the most frequent one comes first. If two words has the same frequency, the one with lower alphabetical order come first.
34+
35+
#### Example
36+
37+
Given
38+
39+
[
40+
"yes", "lint", "code",
41+
"yes", "code", "baby",
42+
"you", "baby", "chrome",
43+
"safari", "lint", "code",
44+
"body", "lint", "code"
45+
]
46+
47+
for k = `3`, return `["code", "lint", "baby"]`.
48+
49+
for k = `4`, return `["code", "lint", "baby", "yes"]`,
50+
51+
#### Challenge
52+
53+
Do it in O(nlogk) time and O(n) extra space.
54+
55+
## 题解
56+
57+
输出出现频率最高的 K 个单词并对相同频率的单词按照字典序排列。如果我们使用大根堆维护,那么我们可以在输出结果时依次移除根节点即可。这种方法虽然可行,但不可避免会产生不少空间浪费,理想情况下,我们仅需要维护 K 个大小的堆即可。所以接下来的问题便是我们怎么更好地维护这种 K 大小的堆,并且在新增元素时剔除的是最末尾(最小)的节点。
58+
59+
### Java
60+
61+
```java
62+
public class Solution {
63+
/**
64+
* @param words: an array of string
65+
* @param k: An integer
66+
* @return: an array of string
67+
*/
68+
public String[] topKFrequentWords(String[] words, int k) {
69+
// write your code here
70+
if (words == null || words.length == 0) return words;
71+
if (k <= 0) return new String[0];
72+
73+
Map<String, Integer> wordFreq = new HashMap<>();
74+
for (String word : words) {
75+
wordFreq.putIfAbsent(word, 0);
76+
wordFreq.put(word, wordFreq.get(word) + 1);
77+
}
78+
79+
PriorityQueue<KeyFreq> pq = new PriorityQueue<KeyFreq>(k);
80+
for (Map.Entry<String, Integer> entry : wordFreq.entrySet()) {
81+
KeyFreq kf = new KeyFreq(entry.getKey(), entry.getValue());
82+
if (pq.size() < k) {
83+
pq.offer(kf);
84+
} else {
85+
KeyFreq peek = pq.peek();
86+
if (peek.compareTo(kf) <= 0) {
87+
pq.poll();
88+
pq.offer(kf);
89+
}
90+
}
91+
}
92+
93+
int topKSize = Math.min(k, pq.size());
94+
String[] topK = new String[topKSize];
95+
for (int i = 0; i < k && !pq.isEmpty(); i++) {
96+
topK[i] = pq.poll().key;
97+
}
98+
99+
// reverse array
100+
for (int i = 0, j = topKSize - 1; i < j; i++, j--) {
101+
String temp = topK[i];
102+
topK[i] = topK[j];
103+
topK[j] = temp;
104+
}
105+
106+
return topK;
107+
}
108+
109+
class KeyFreq implements Comparable<KeyFreq> {
110+
String key;
111+
int freq;
112+
113+
public KeyFreq(String key, int freq) {
114+
this.key = key;
115+
this.freq = freq;
116+
}
117+
118+
@Override
119+
public int compareTo(KeyFreq kf) {
120+
if (this.freq != kf.freq) {
121+
return this.freq - kf.freq;
122+
}
123+
124+
return kf.key.compareTo(this.key);
125+
}
126+
}
127+
}
128+
```
129+
130+
131+
### 源码分析
132+
133+
使用 Java 自带的 PriorityQueue 来实现堆,由于需要定制大小比较,所以这里自定义类中实现了 `Comparable``compareTo` 接口,另外需要注意的是这里原生使用了小根堆,所以我们在覆写 `compareTo` 时需要注意字符串的比较,相同频率的按照字典序排序,即优先保留字典序较小的字符串,所以正好和 freq 的比较相反。最后再输出答案时,由于是小根堆,所以还需要再转置一次。此题的 Java 实现中,使用的 PriorityQueue 并非线程安全,实际使用中需要注意是否需要用到线程安全的 PriorityBlockingQueue
134+
135+
对于 Java, 虽然标准库中暂未有定长的 PriorityQueue 实现,但是我们常用的 Google guava 库中其实已有类似实现,见 [MinMaxPriorityQueue](https://google.github.io/guava/releases/snapshot/api/docs/com/google/common/collect/MinMaxPriorityQueue.html) 不必再自己造轮子了。
136+
137+
### 复杂度分析
138+
139+
堆的插入删除操作,定长为 K, n 个元素,故时间复杂度约 $$O(n \log K)$$, 空间复杂度为 $$O(n)$$.

0 commit comments

Comments
 (0)