QNLP  v1.0
QNLP.proc.VectorSpaceModel.VSM_pc Class Reference
Collaboration diagram for QNLP.proc.VectorSpaceModel.VSM_pc:
Collaboration graph

Public Member Functions

def __init__ (self)
 
def tokenize_corpus (self, corpus, proc_mode=0, stop_words=True, use_spacy=False)
 

Data Fields

 pc
 

Private Member Functions

def _get_token_position (self, tagged_tokens, token_type)
 

Detailed Description

Definition at line 27 of file VectorSpaceModel.py.

Constructor & Destructor Documentation

◆ __init__()

def QNLP.proc.VectorSpaceModel.VSM_pc.__init__ (   self)

Definition at line 28 of file VectorSpaceModel.py.

28  def __init__(self):
29  self.pc = pc
30 

Member Function Documentation

◆ _get_token_position()

def QNLP.proc.VectorSpaceModel.VSM_pc._get_token_position (   self,
  tagged_tokens,
  token_type 
)
private
Tracks the positions where a tagged element is found in 
the tokenised corpus list. Useful for comparing distances.
If the key doesn't initially exist, it adds a list with a 
single element. Otherwise, extends the list with the new 
token position value.

Definition at line 89 of file VectorSpaceModel.py.

89  def _get_token_position(self, tagged_tokens, token_type):
90  """ Tracks the positions where a tagged element is found in
91  the tokenised corpus list. Useful for comparing distances.
92  If the key doesn't initially exist, it adds a list with a
93  single element. Otherwise, extends the list with the new
94  token position value.
95  """
96  token_dict = {}
97  for pos, token in enumerate(tagged_tokens):
98  if pc.tg.matchables(token_type, token[1]):
99  if isinstance(token_dict.get(token[0]), type(None)):
100  token_dict.update( { token[0] : np.array([pos])} )
101  else:
102  token_dict.update( { token[0] : np.append(token_dict.get(token[0]), pos) } )
103  return token_dict
104 

Referenced by QNLP.proc.VectorSpaceModel.VSM_pc.tokenize_corpus().

Here is the caller graph for this function:

◆ tokenize_corpus()

def QNLP.proc.VectorSpaceModel.VSM_pc.tokenize_corpus (   self,
  corpus,
  proc_mode = 0,
  stop_words = True,
  use_spacy = False 
)
Rewrite of pc.tokenize_corpus to allow for tracking of basis word 
positions in list to improve later pairwise distance calculations.

Definition at line 31 of file VectorSpaceModel.py.

31  def tokenize_corpus(self, corpus, proc_mode=0, stop_words=True, use_spacy=False):
32  """
33  Rewrite of pc.tokenize_corpus to allow for tracking of basis word
34  positions in list to improve later pairwise distance calculations.
35  """
36  token_sents = []
37  token_words = [] # Individual words
38  tags = [] # Words and respective tags
39  tagged_tokens = []
40 
41  if use_spacy == False:
42  token_sents = self.pc.nltk.sent_tokenize(corpus) #Split on sentences
43 
44  for s in token_sents:
45  tk = self.pc.nltk.word_tokenize(s)
46  if stop_words == False:
47  tk = self.pc.remove_stopwords(tk, self.pc.sw)
48  token_words.extend(tk)
49  tags.extend(self.pc.nltk.pos_tag(tk))
50 
51  if proc_mode != 0:
52  if proc_mode == 's':
53  s = self.pc.nltk.SnowballStemmer('english', ignore_stopwords = not stop_words)
54  token_words = [s.stem(t) for t in token_words]
55  elif proc_mode == 'l':
56  wnl = self.pc.nltk.WordNetLemmatizer()
57  token_words = [wnl.lemmatize(t) for t in token_words]
58 
59  tagged_tokens = self.pc.nltk.pos_tag(token_words)
60 
61  #spacy_tokenizer = English()
62  else: #using spacy
63  spacy_pos_tagger = spacy.load("en_core_web_sm")
64  #spacy_pos_tagger = 2000000 #Uses approx 1GB memory for each 100k tokens; assumes large memory pool
65  for s in spacy_pos_tagger(corpus):
66  if stop_words == False and s.is_stop:
67  continue
68  else:
69  text_val = s.text
70  if proc_mode != 0:
71  if proc_mode == 's':
72  raise Exception("Stemming not currently supported by spacy")
73  elif proc_mode == 'l':
74  text_val = s.lemma_
75 
76  text_val = text_val.lower()
77  token_words.append(text_val)
78  tags.append((text_val, s.pos_))
79  tagged_tokens = tags
80 
81  nouns = self._get_token_position(tagged_tokens, self.pc.tg.Noun)
82  verbs = self._get_token_position(tagged_tokens, self.pc.tg.Verb)
83 
84  count_nouns = { k:(v.size,v) for k,v in nouns.items()}
85  count_verbs = { k:(v.size,v) for k,v in verbs.items()}
86 
87  return {'verbs':count_verbs, 'nouns':count_nouns, 'tk_sentence':token_sents, 'tk_words':token_words}
88 
def remove_stopwords(text, sw)
def tokenize_corpus(corpus, proc_mode=0, stop_words=True)

References QNLP.proc.VectorSpaceModel.VSM_pc._get_token_position(), QNLP.proc.VectorSpaceModel.VSM_pc.pc, and QNLP.proc.process_corpus.remove_stopwords().

Here is the call graph for this function:

Field Documentation

◆ pc

QNLP.proc.VectorSpaceModel.VSM_pc.pc

The documentation for this class was generated from the following file: