关键字采集文章(基于stopwords和标点符号的通用库实验流程基准测试工作方式)

优采云 发布时间: 2022-03-29 22:06

  关键字采集文章(基于stopwords和标点符号的通用库实验流程基准测试工作方式)

  我一直在寻找一种有效的关键字提取任务算法。目标是找到一种算法,可以有效地提取关键字并平衡提取质量和执行时间,因为我的数据语料库正在迅速增长到数百万行。我对算法的主要要求之一是提取出来的关键词本身总是有意义的,即使断章取义,也能表达一定的意义。

  本文文章 使用收录 2000 个文档的语料库来测试和试验几种著名的关键字提取算法。

  使用的库列表

  我使用以下 python 库进行研究

  Pandas 和 Matplotlib 以及其他通用库

  实验过程

  基准的工作原理如下

  

  我们将首先导入收录我们的文本数据的数据集。然后我们将创建单独的函数来提取每个算法的逻辑

  algorithm_name(str: text) → [keyword1, keyword2, ..., keywordn]

  然后,我们创建一个函数来提取整个语料库的关键词。

  extract_keywords_from_corpus(algorithm, corpus) → {algorithm, corpus_keywords, elapsed_time}

  接下来,使用 Spacy 帮助我们定义一个匹配器对象,该对象将返回 true 或 false 以确定关键字是否对我们的任务有意义。

  最后,我们将所有内容打包到一个输出最终报告的函数中。

  数据集

  我正在使用来自互联网的小文本数字数据集。这是一个示例

  ['To follow up from my previous questions. . Here is the result!\n',

'European mead competitions?\nI’d love some feedback on my mead, but entering the Mazer Cup isn’t an option for me, since shipping alcohol to the USA from Europe is illegal. (I know I probably wouldn’t get caught/prosecuted, but any kind of official record of an issue could screw up my upcoming citizenship application and I’m not willing to risk that).\n\nAre there any European mead comps out there? Or at least large beer comps that accept entries in the mead categories and are likely to have experienced mead judges?', 'Orange Rosemary Booch\n', 'Well folks, finally happened. Went on vacation and came home to mold.\n', 'I’m opening a gelato shop in London on Friday so we’ve been up non-stop practicing flavors - here’s one of our most recent attempts!\n', "Does anyone have resources for creating shelf stable hot sauce? Ferment and then water or pressure can?\nI have dozens of fresh peppers I want to use to make hot sauce, but the eventual goal is to customize a recipe and send it to my buddies across the States. I believe canning would be the best way to do this, but I'm not finding a lot of details on it. Any advice?", 'what is the practical difference between a wine filter and a water filter?\nwondering if you could use either', 'What is the best custard base?\nDoes someone have a recipe that tastes similar to Culver’s frozen custard?', 'Mold?\n'

  大多数与食物有关。我们将使用 2000 个文档的样本来测试我们的算法。

  我们还没有对文本进行预处理,因为有些算法的结果是基于停用词和标点符号的。

  算法

  让我们定义关键字提取功能。

  # initiate BERT outside of functions

bert = KeyBERT()

# 1. RAKE

def rake_extractor(text):

"""

Uses Rake to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)

"""

r = Rake()

r.extract_keywords_from_text(text)

return r.get_ranked_phrases()[:5]

# 2. YAKE

def yake_extractor(text):

"""

Uses YAKE to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)

"""

keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text)

results = []

for scored_keywords in keywords:

for keyword in scored_keywords:

if isinstance(keyword, str):

results.append(keyword)

return results

# 3. PositionRank

def position_rank_extractor(text):

"""

Uses PositionRank to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)

"""

# define the valid Part-of-Speeches to occur in the graph

pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}

extractor = pke.unsupervised.PositionRank()

extractor.load_document(text, language='en')

extractor.candidate_selection(pos=pos, maximum_word_number=5)

# 4. weight the candidates using the sum of their word's scores that are

# computed using random walk biaised with the position of the words

# in the document. In the graph, nodes are words (nouns and

# adjectives only) that are connected if they occur in a window of

# 3 words.

extractor.candidate_weighting(window=3, pos=pos)

# 5. get the 5-highest scored candidates as keyphrases

keyphrases = extractor.get_n_best(n=5)

results = []

for scored_keywords in keyphrases:

for keyword in scored_keywords:

if isinstance(keyword, str):

results.append(keyword)

return results

# 4. SingleRank

def single_rank_extractor(text):

"""

Uses SingleRank to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)

"""

pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}

extractor = pke.unsupervised.SingleRank()

extractor.load_document(text, language='en')

extractor.candidate_selection(pos=pos)

extractor.candidate_weighting(window=3, pos=pos)

keyphrases = extractor.get_n_best(n=5)

results = []

for scored_keywords in keyphrases:

for keyword in scored_keywords:

if isinstance(keyword, str):

results.append(keyword)

return results

# 5. MultipartiteRank

def multipartite_rank_extractor(text):

"""

Uses MultipartiteRank to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)

"""

extractor = pke.unsupervised.MultipartiteRank()

extractor.load_document(text, language='en')

pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}

extractor.candidate_selection(pos=pos)

# 4. build the Multipartite graph and rank candidates using random walk,

# alpha controls the weight adjustment mechanism, see TopicRank for

# threshold/method parameters.

extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')

keyphrases = extractor.get_n_best(n=5)

results = []

for scored_keywords in keyphrases:

for keyword in scored_keywords:

if isinstance(keyword, str):

results.append(keyword)

return results

# 6. TopicRank

def topic_rank_extractor(text):

"""

Uses TopicRank to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)

"""

extractor = pke.unsupervised.TopicRank()

extractor.load_document(text, language='en')

pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}

extractor.candidate_selection(pos=pos)

extractor.candidate_weighting()

keyphrases = extractor.get_n_best(n=5)

results = []

for scored_keywords in keyphrases:

for keyword in scored_keywords:

if isinstance(keyword, str):

results.append(keyword)

return results

# 7. KeyBERT

def keybert_extractor(text):

"""

Uses KeyBERT to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)

"""

keywords = bert.extract_keywords(text, keyphrase_ngram_range=(3, 5), stop_words="english", top_n=5)

results = []

for scored_keywords in keywords:

for keyword in scored_keywords:

if isinstance(keyword, str):

results.append(keyword)

return results

  每个提取器都将文本作为参数并返回关键字列表。使用非常简单。

  注意:由于某种原因,我无法初始化函数之外的所有提取器对象。每当我这样做时,TopicRank 和 MultiPartiteRank 都会抛出错误。在性能方面,它并不完美,但仍然可以进行基准测试。

  

  我们通过传递 pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 限制了一些可接受的语法模式——这与 Spacy 一起将确保几乎所有关键字都是从人类语言的角度来选择的。我们还希望关键字收录三个单词,只是为了有更具体的关键字,避免过于笼统。

  从整个语料库中提取关键字

  现在让我们定义一个函数,它将单个提取器应用于整个语料库,同时输出一些信息。

<p>def extract_keywords_from_corpus(extractor, corpus):

"""This function uses an extractor to retrieve keywords from a list of documents"""

extractor_name = extractor.__name__.replace("_extractor", "")

logging.info(f"Starting keyword extraction with {extractor_name}")

corpus_kws = {}

start = time.time()

# logging.info(f"Timer initiated.")

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线