My Progress


15th May

To start off, I created a Python notebook to demonstrate some basic functionality of the NLTK library and the use of WordNet synsets that come along with it.
For an input sentence
"""At eight o'clock on Thursday morning, Arthur didn't feel very good."""
the basic functionality of the NLTK Tokenizer was demonstrated.
After tokenizing the sentence, and tagging them on the basis of their positions in the sentence, entities can be generated.
Then I practiced with the WordNet corpus, with the help of CorpusReader available in the NLTK library. I wanted to extract synsets from the corpus, in order to see how they work and also use the definitions that are available for every lemma and use it to create my dataset to train the definition encoder later on.

17th May

I have been looking into practical implementations for ideas related to learning concept representations and entity, that go beyond word embeddings. That brought me to this paper.
Here, they have talked about two types of evaluation tasks that I will work on to get a better understanding. The Analogical Reasoning and Concept Learning.
Typically, analogies take the form "a is to b, is same as c is to _?", where ab and c are present in the model. Here, a similarity function can be used to answer these analogical reasoning questions.

d = arg max Sim( vec(d), vec(b) - vec(a) + vec(c) )

Here, d can be predicted by using Sim(), which is the similarity function.

21st May

The Google Analogy Task dataset contains 19,544 questions divided into semantic analogies and syntactic analogies. The questions file contains these pairs, divided into various categories.
London England Berlin Germany
So, I will be using this file to evaluate the pre-trained embeddings on these question-answer pairs. To do this, I use the first three words, "London England Berlin" to predict the word and check if it matches the answer, the fourth word "Germany". This file only contains single words, no multi-word phrases. Thus, there is still a need for a better dataset to evaluate wiki embeddings. I am planning to extract the wiki embeddings and train the model on them by using anchor text.

24th May

I pushed the scripts to the repository along with the updated README.md and the requirements.txt.
It currently includes the scripts to train a FastText model. Usage of the scripts has been explained in the README.md file.

Pre-Training and Evaluating Analogy

The script, pre-train.py takes the following arguments:
  • Files
    • Input File: Clean wiki dump
    • Output File: Saved model
  • Model Hyperparameters
    • Vector Size, - s: Defines the Embedding Size
    • Skipgram, - sg: Decides whether model uses skipgram or CBOW
    • Loss Function, - hs: Loss function used is Hierarchal Softmax or Negative Sampling
    • Epochs, - e: Number of epochs
$ mkdir model
$ python src/pre-train.py -i data/fil9 -o model/pre_wiki -s 300 -sg 1 -hs 1 -e 5

Next, we try to see how these pre-trained embeddings perform on the Google Analogy Task. For this, we have the analogy.py.

$ python analogy.py -i data/questions-words.txt -m model/pre_wiki
Question: high is to higher as great is to?
Answer: greater
Predicted: greater
Question: glendale is to arizona as akron is to ?
Answer: ohio
Predicted: ohio
Question: ethical is to unethical as comfortable is to ?
Answer: uncomfortable
Predicted: comfortably
Question: netherlands is to dutch as brazil is to ?
Answer: brazilian
Predicted: brazilian
Question: free is to freely as happy is to ?
Answer: happily
Predicted: happily
Question: luanda is to angola as monrovia is to ?
Answer: liberia
Predicted: liberia

27th May

This week I am working on the most efficient way to use the dump, as well as extract entities from it. The benchmark model will train on the word embeddings and the possible multi-phrase entity embeddings the same. This paper has some great insights on how to represent complex characteristics of a word. It also provides potential evaluation tasks that can be used for our purpose.

30th May

The current task is to extract entities from the corpus, the entity descriptions, and the WordNet synsets. These are the nouns, verbs, adjectives, and adverbs grouped into sets of cognitive systems, each expressing a distinct concept. These concepts are known to exhibit semantic and lexical relations.
>>> for sen in wn.all_synsets():
. . . print(sen.lemmas()[0].name(), ":", sen.definition())
. . . break
able : (usually followed by `to') having the necessary means or skill or know-how or authority to do something

Next up, I need to improve the Mention Extraction of the entities from the corpus. For this, I will be looking into the JSON files available. These wiki dumps seem to have all the possible surface forms of a given entity.

1st June

The JSON dump containing all Wikipedia entities was downloaded from here. The files are more than 20 GBs in size and will just be used to create the dictionary for the entities along with their possible surface forms in the corpus. These surface forms can be found in the JSON dump according to the format in which all these entities are stored in the form of a single JSON array. The surface forms are extracted from the labels and aliases in this JSON array.
{
  "labels": {
    "en": {
      "language": "en",
      "value": "New York City"
    }
  },
  "aliases": {
    "en": [
    {
      "language": "en",
      "value": "NYC"
    },
    {
      "language": "en",
      "value": "New York"
    },
    ]
  }
}

Thus, for an entity like "New_York", I will create a dictionary that saves all the surface forms for this entity and then performs Mention Extraction on the corpus.
Any mention of "New York City", "New York" or "NYC" is detected and changed to "entity/New_York".

4th June

Working with Surface Forms introduces a new problem. Whenever we encounter an instance of an entity that can represent different entities in different contexts, it is difficult to simply replace that instance with the entity. Here is an example that explores this problem in detail:
Let us say that we are dealing with the surface form "rock", then we have the following possible entities that are present in the dictionary:
Dwayne 'The Rock' Johnson and many more.
So, this kind of ambiguity in aliases leads to a major problem in identifying the particular entity that the surface form represents. Thus, simply replacing the surface form in the corpus will lead to unpredictable results while evaluating the entity embeddings. As we can see, we need to perform a specific task here known as "Disambiguation". A single surface form could essentially represent a place, a person, a genre of music or simply a substance. The only thing we can use to distinguish between them is the context in which the surface form is present.
So, the next thing we need to look into is an NER tool.

6th June

The blog post by OpenAI explains their system that can be used for discovering types for Entity Disambiguation. This aligns with our plan, that makes use of the description of entities from Wikipedia articles. I will follow the same approach as used in deeptype to extract entities along with the best-suited type for its surface form depending upon the context and the description that can be obtained from the abstract of the Wikipedia article.

8th June

Using a similar approach to last year, I am going to tag all the surface forms with their respective entities.
After the mentions have been extracted, a simple tag can be added to distinguish an entity from the rest of the corpus.

11th June

The scripts work with the labels, descriptions, and aliases in the following way:
Each surface form stores the entity that it represents as a key. 
dictionary[surface form] = entity
For example, the dictionary may contain the following entries:
dictionary['NYC'] = 'New_York_City'
The surface forms are extracted from the values of JSON dictionary entry "aliases", and the entity is obtained from the value of the entry "labels".
Finally, the dictionary containing all the entities have the descriptions of those entities as the keys.
dictionary['New_York_City'] = 'largest city in New York and the United States of America'
The applicability of such descriptions in our model is fairly straightforward. It helps to identify the relations between different entities. This means that the semantics can be established for the entities as:
New_York_City instance_of City part_of United_States. They are the key properties and a mapping from all Wikipedia article names to a Wikidata ID.


15th June

I evaluated the model using the Analogy Task to establish a benchmark score and the results for the evaluation are as follows:

Questions: 19,544
Semantic: 8,869
Syntactic: 10,675

i) Model Hyperparameters ( FastText )
   Vector Size: 200
   Epochs: 3
   Skipgram using Hierarchal Softmax

Final Accuracy: 51.14%

ii) Model Hyperparameters ( FastText )
    Vector Size: 300
    Epochs: 5
    Skipgram using Hierarchal Softmax

Final Accuracy: 54.63%


iii) Model Hyperparameters ( Word2Vec )
    Vector Size: 300
    Epochs: 5
    Skipgram using Hierarchal Softmax


Final Accuracy: 55.78%



The FastText model using character tri-grams to generate embeddings performs better on the syntactic questions, as compared to the semantic country: capital questions. This makes sense because breaking the words into character n-grams do not preserve the semantic information of those words in the embeddings.


17th June

This week, my work is focused on incorporating the deeptype system into the WikiDetector.py script. There are no pre-trained models available so I will have to train it myself. Here, the procedure followed is the same as they described in the repository.
Originally, the surface forms are loaded from the dictionary 


20th June

The JSON dump will be processed by the script entity.py, and it will work in a similar fashion as MakeDictionary.py. Along with this, I have also begun working on the evaluation task of Named Entity Recognition, in which I will use the CoNLL-2003 dataset.
For now, I will focus my work on the current pipeline. The evaluations for all the baselines will be dealt with during the later weeks. I shall progress as mentioned in my proposal, working out the model, and the hyperparameter tuning.


22nd June

The first thing needed for the current pipeline is the hash map between Wikipedia article titles and their abstracts. I will create this dictionary and store it in a JSON file. The abstracts can be obtained from this dump. The format of this XML dump is quite straightforward.

<feed>
<doc>
<title>Wikipedia: Anarchism</title>
<url>https://en.wikipedia.org/wiki/Anarchism</url>
<abstract>Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies,"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies.</abstract>

Using some simple regular expressions, I will store the mapping as:
dictionary['title'] = 'abstract'
While training the model, the target variable will be the entity embedding, "anarchism", and the input will be its abstract. After training the model against these pre-trained entity embeddings, the model will be able to predict entity embeddings on the go for out of vocabulary entities with the help of their abstracts.
x_train contains all the vectors for abstracts.
y_train contains the vector embedding for the target entity.


23rd June

I finished training all the baselines discussed above. The current corpus contains 243,425 articles extracted using the extract.py script.
venv ❯ python src/package/extract.py ../data/enwik9 -l -o ../data/output
Once the dump was cleaned, I trained it using Word2Vec as well as FastText to compare the results. These constituted the baselines with entities injected into the corpus.


24th June

Currently, I am trying to figure out which format of descriptions will be best suited for the LSTM. The aim is to train the model so that it predicts the output entity embedding given the input embeddings of its description. The two possible dictionaries that can be used are:
-> Using abstracts from Wikipedia dump with all Wiki articles and their abstracts.
-> Other possibility is to use the same corpus to generate this dictionary in which the entities were already injected.
The second approach appears to be better in terms of preserving the semantic relations of a given entity with the other entities present in the description of the said entity.


25th June

The baselines are on the cloud. The pre-trained embeddings, the processed corpus and the descriptions extracted in both aforementioned methods have been pushed to Dropbox. I will share the link along with the information regarding each and every file.

After the XML dump has been processed and the extracted text is in the output directory, we need to further find the surface forms in the corpus and then inject entities wherever we find these surface forms. This is done with the help of three scripts: MakeDictionary.py, CheckPerson.py and WikiDetector.py
venv ❯ python src/MakeDictionary.py ../data/output/ 
venv ❯ python src/CheckPerson.py ../data/text
venv ❯ python src/WikiDetector.py ../data/output/
Next up, the scripts description.py and combine.py are used to extract all the abstracts from the dump and combine the files generated by the Wiki extractor respectively. combine.py uses the first line of the resource article as the description.
venv ❯ python src/description.py ../data/enwiki/enwiki-latest-abstract.xml ../data/abstracts.json
venv ❯ python src/combine.py ../data/output ../data/filedocs9 ../data/descriptions.json


27th June

Currently, you can have a look at the Dropbox folder wherein I have pushed the entire data directory of my project. Here is a list of all the files to get a better understanding of each one of them:
|-Corpus
| |-descriptions.json
| |-entity9
| |-fil9
| |-abstracts.json
| |-filedocs9
|
|-EntityFastText
| |-entity_fasttext_n300
|
|-EntityWord2Vec
| |-entity_w2v_n300
|
|-WordFastText
| |-wiki_text_n300
|
|-TextWord2Vec
| |-wiki_w2v_n300


29th June

The current progress regarding the pipeline that was mentioned in the original proposal can be reviewed in the open pull request. The PR is a work in progress. I am training the model with different parameters to see what works and what doesn't. The heuristic remains the same:
I first created the dictionary that contained the entity title along with its description. This description was then encoded into its embedding form which is to be used by the Definition Encoder to output a possible entity for the target.


1st July

The mappings once loaded from the descriptions are variable in length. However, the input sequences to the network must have the same length. To deal with this, I have used padding where I am padding the encoded description with zeros in order to make all the descriptions of equal length.


3rd July

As of now, after training the model with different parameters and architectures, I have found that the training on descriptions is most effective with a Recurrent Neural Network. The LSTM Network works as good as the RNN but the outputs contain resources that are more related to the n-grams and less related to the semantic of the target entity.


10th July

As of this week, we have commenced with the second evaluation for the program. I have been working towards cleaning the code. Code in the repo was styled and proper documentation was added wherever necessary. All Python code written is in compliance with PEP-8 styling guide. The README for this repo was updated and all the changes made in the repo can still be viewed in the open PR to the main repository.
Next and final phase is to write an evaluation scheme. The problem with that is to identify the task that will help us accomplish the required evaluations of the system.


18th July

The encoder is working and the next step is to design an evaluation scheme for the predicted embeddings of the all the entities based on their abstracts. Once we have the database for all the embeddings, we can run the simple implementation of whether the output vector of a given entity shows any similarity with the vector that represents the class to which that entity belongs. Here is an example to show what I mean.
In this example, I have taken the entity "agatha_christie".
For this entity, the following abstract was encoded, "dame agatha  mary clarissa christie lady mallowan dame_commander_of_the_order_of_the_british_empire née miller 15 september 1890 – 12 january 1976 was an english writer she is known for her 66 detective novels and 14 short story collections particularly those revolving around her fictional detectives hercule_poirot and miss_marple christie also wrote the worlds longest-running play a murder mystery the_mousetrap and six romance_novel under the name mary westmacott in 1971 she was elevated to dame_commander_of_the_order_of_the_british_empire dbe for her contribution to literature".
Finally, when this description of the entity was provided to the model as input, an output vector was obtained. This output vector was then fed to the model containing the pre-trained embeddings trained on FastText, and the most similar vectors were printed along with the similarity score, which in this case was the simple cosine distance between the two vectors.
"[(‘writer’, 0.7674428224563599), (‘wallace_stevens’, 0.7610886693000793), (‘tragicomedy’, 0.7577036619186401), (‘robert_silverberg’, 0.7564967274665833), (‘fredric_jameson’, 0.7519606351852417), (‘sangster’, 0.7509233355522156), (‘novelist’, 0.750686526298523), (‘atlas_shrugged’, 0.7377546429634094), (‘lester_del_rey’, 0.7359971404075623), (‘writer-editor’, 0.7347295880317688)]".
Our approach to evaluate such embeddings is quite straightforward. A simple Vanilla scoring function that checks the similarity between the predicted vector for the entity and the vector representing the ontological class that contains this entity. Considering this example, we can see that the vector predicted for "agatha_christie" shows some similarity with the vector that represents the entity "writer" which in this case is also the class that encloses entities such as the one we predicted.
A vanilla scoring function would collect all the ontological classes for the DBpedia entities, map those entities to their respective ontological classes and finally the predicted embeddings will be evaluated based on the similarity between the two vectors. I will set a threshold that shall give us the results with some confidence.

Comments

Popular posts from this blog

Let us Begin...

Community bonding and what follows