Collaboration diagram for QNLP.proc.VectorSpaceModel.VSM_pc:

Public Member Functions
def	__init__ (self)

def	tokenize_corpus (self, corpus, proc_mode=0, stop_words=True, use_spacy=False)

Data Fields
	pc

Private Member Functions
def	_get_token_position (self, tagged_tokens, token_type)

Detailed Description

Definition at line 27 of file VectorSpaceModel.py.

Constructor & Destructor Documentation

◆ init()

def QNLP.proc.VectorSpaceModel.VSM_pc.__init__ ( self )

Definition at line 28 of file VectorSpaceModel.py.

     def __init__(self):
         self.pc = pc
     

Member Function Documentation

◆ _get_token_position()

def QNLP.proc.VectorSpaceModel.VSM_pc._get_token_position	(	self,
		tagged_tokens,
		token_type
	)

private

Tracks the positions where a tagged element is found in 
the tokenised corpus list. Useful for comparing distances.
If the key doesn't initially exist, it adds a list with a 
single element. Otherwise, extends the list with the new 
token position value.

Definition at line 89 of file VectorSpaceModel.py.

     def _get_token_position(self, tagged_tokens, token_type):
         """ Tracks the positions where a tagged element is found in 
         the tokenised corpus list. Useful for comparing distances.
         If the key doesn't initially exist, it adds a list with a 
         single element. Otherwise, extends the list with the new 
         token position value.
         """
         token_dict = {}
         for pos, token in enumerate(tagged_tokens):
             if pc.tg.matchables(token_type, token[1]):
                 if isinstance(token_dict.get(token[0]), type(None)):
                     token_dict.update( { token[0] : np.array([pos])} )
                 else:
                     token_dict.update( { token[0] : np.append(token_dict.get(token[0]), pos) } )
         return token_dict
 

Referenced by QNLP.proc.VectorSpaceModel.VSM_pc.tokenize_corpus().

Here is the caller graph for this function:

◆ tokenize_corpus()

def QNLP.proc.VectorSpaceModel.VSM_pc.tokenize_corpus	(	self,
		corpus,
		proc_mode = `0`,
		stop_words = `True`,
		use_spacy = `False`
	)

Rewrite of pc.tokenize_corpus to allow for tracking of basis word 
positions in list to improve later pairwise distance calculations.

Definition at line 31 of file VectorSpaceModel.py.

     def tokenize_corpus(self, corpus, proc_mode=0, stop_words=True, use_spacy=False):
         """
         Rewrite of pc.tokenize_corpus to allow for tracking of basis word 
         positions in list to improve later pairwise distance calculations.
         """
         token_sents = []
         token_words = [] # Individual words
         tags = [] # Words and respective tags
         tagged_tokens = []
         
         if use_spacy == False:
             token_sents = self.pc.nltk.sent_tokenize(corpus) #Split on sentences
             
             for s in token_sents:
                 tk = self.pc.nltk.word_tokenize(s)
                 if stop_words == False:
                     tk = self.pc.remove_stopwords(tk, self.pc.sw)
                 token_words.extend(tk)
                 tags.extend(self.pc.nltk.pos_tag(tk))
 
             if proc_mode != 0:
                 if proc_mode == 's':
                     s = self.pc.nltk.SnowballStemmer('english', ignore_stopwords = not stop_words)
                     token_words = [s.stem(t) for t in token_words]
                 elif proc_mode == 'l':
                     wnl = self.pc.nltk.WordNetLemmatizer()
                     token_words = [wnl.lemmatize(t) for t in token_words]
             
             tagged_tokens = self.pc.nltk.pos_tag(token_words)
 
         #spacy_tokenizer = English()
         else: #using spacy
             spacy_pos_tagger = spacy.load("en_core_web_sm")
             #spacy_pos_tagger = 2000000 #Uses approx 1GB memory for each 100k tokens; assumes large memory pool
             for s in spacy_pos_tagger(corpus):
                 if stop_words == False and s.is_stop:
                     continue
                 else:
                     text_val = s.text
                     if proc_mode != 0:
                         if proc_mode == 's':
                             raise Exception("Stemming not currently supported by spacy")
                         elif proc_mode == 'l':
                             text_val = s.lemma_
 
                     text_val = text_val.lower()
                     token_words.append(text_val)
                     tags.append((text_val, s.pos_))
             tagged_tokens = tags
             
         nouns = self._get_token_position(tagged_tokens, self.pc.tg.Noun)
         verbs = self._get_token_position(tagged_tokens, self.pc.tg.Verb)
 
         count_nouns = { k:(v.size,v) for k,v in nouns.items()}
         count_verbs = { k:(v.size,v) for k,v in verbs.items()}
 
         return {'verbs':count_verbs, 'nouns':count_nouns, 'tk_sentence':token_sents, 'tk_words':token_words}