d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Chat Ai
Prev1567
Add Reply New Topic New Poll
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Mar 27 2016 12:31pm
Can someone help me find this corpus: http://arxiv.org/abs/1603.06807

It should vastly improve my ability to try to get my AI working, but I can't find a download link anywhere.

I found microsofts 1.4k corpus though which I might download.
Member
Posts: 1,995
Joined: Jun 28 2006
Gold: 7.41
Mar 27 2016 01:21pm
From the abstract:

Quote
In this paper
we present the 30M Factoid QuestionAnswer
Corpus, an enormous question
answer pair corpus produced by applying
a novel neural network architecture
on the knowledge base Freebase to transduce
facts into natural language questions.


They are using Freebase, and the way it sets up its relationships to apply their template based factoids to generate question-answer pairs. With that said, the paper describes their process for producing this. I don't believe the corpus exists as a download. They have made the process available for implementation. At least that is what I gathered after reading through the entire paper

http://arxiv.org/pdf/1603.06807v1.pdf
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Mar 27 2016 01:38pm
Quote (Minkomonster @ Mar 27 2016 03:21pm)
From the abstract:



They are using Freebase, and the way it sets up its relationships to apply their template based factoids to generate question-answer pairs. With that said, the paper describes their process for producing this. I don't believe the corpus exists as a download. They have made the process available for implementation. At least that is what I gathered after reading through the entire paper

http://arxiv.org/pdf/1603.06807v1.pdf


Ah I thought the way they worded it was that there was a corpus available to download.

I did find this: http://research.microsoft.com/en-us/downloads/88c0021c-328a-4148-a158-a42d7331c6cf/

and this: https://github.com/deepmind/rc-data

Which I will be looking into eventually though so it's not like I came out with nothing.
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Apr 6 2016 07:30pm
I opened a ticket with imgur to see if I can obtain their comment database for public use.

They said they would add it to their feature list, but no saying how long it would take to get around to it, if ever.

So now I am slowly writing a spider for their API to pull comments off images. For example here is the partial output from a random image:

Code
>But what does this offer to a casual computer guy, whos only skill is browsing imgur and porn?
>free shit!!
>This guy gets it.
>Does he get free 'shit' ?
>He gets the 'shit' without even knowing it,
>????????????
>Wtf tried to do emojis xD
>*slowly slides back down to the aquarium*




>As a British citizen, I have a legal obligation to download this without the need for it

>Adding huge black bars on top and bottom of pictures.
>You monster
>Not the gum drop buttons!


>Makes porn look better without increasing cost, one would think
>Dank meme making
>"you rang?" -thousands of us.
>-dozens of us

>Photoshopping yourself into the porn and posting to imgur
>You can use up some of that valuable disk space before it expires!
>You can put Hillary's face on anything you like.
>Noise removal on amateur pictures.
>. Same
>Photo editing, for example making color photos look great in monochrome
>well i mean you can already reach a decent chunk of internttable shit from google.. google. thats what they offer.
>Photoshop ur crush in a porno
>Good question.


As you can see I have managed to abstract the comments into a tree like format. The way training would be done is that the Nth level will be the input and the Nth+1 level will be the response. For example:

Code
>"you rang?" -thousands of us.
>-dozens of us


The first comment is from the 2nd level, and the latter the 3rd level. The prior being the input that will trigger the later output.

This will all be fed into my keywords learning brain once I can scrape sufficient amount of data.

Heres the source so far of the crawler. Right now it it simply fetches and parses a hard coded ID from their API:

Code
require "http/client"
require "json"

def printComments(hash, level)
hash.each do |child|
level.times { print " " }
print ">"
puts child["comment"]
if !child["children"].as_a.empty?
printComments(child["children"], level + 1)
end
end
puts
end


headers = HTTP::Headers{"Authorization" => "Client-ID 123456"}

response = HTTP::Client.get("https://api.imgur.com/3/gallery/hRV78Jr/comments", headers)

parsedJson = JSON.parse response.body

printComments(parsedJson["data"], 0)


What needs to happen now is to format this data into a SQL database which I can populate, then learn from. It will likely be a simple schema of "id Integer Autoincrement, input TEXT, response TEXT" one to many mapping and go row by row to learn the database. This means the recursive function will likely evolve into something like "buildDatabase(hash, parentComment)" so that I can still transverse the hash but also have access to the parents comment so that I can pair it with the childs comment.

This post was edited by AbDuCt on Apr 6 2016 07:31pm
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Apr 7 2016 06:56pm
More or less hacked together something that parses the imgur API for collecting comments, and then adds them to the database in input:response pairs.

I will need to write a throttling parsing code to parse their throttle api to not rate limit myself, and then I will need to figure out the best way to collect image IDs for images to scrape comments from. Using the front page is okay, but new images do not come very often. Were as going into user submitted if I parse for comments to soon there will be no comments on the image.

Code
require "sqlite3"
require "http/client"
require "json"

def collectComments(db, imageJson, parentComment = "")
imageJson.each do |parent|
if !parent["children"].as_a.empty?
collectComments(db, parent["children"], parent["comment"])
end

if parentComment != ""
childComment = parent["comment"]
db.execute("Insert Into Comments Values (?, ?)", parentComment.to_s, childComment.to_s)
end
end
end

db = SQLite3::Database.new "comments.db"
db.execute "Create Table Comments (input TEXT, response TEXT)"
db.execute "Create Index CommentsIndex on Comments (input, response)"

uri = URI.parse "https://api.imgur.com/"
client = HTTP::Client.new uri
response = client.get("/3/gallery/hRV78Jr/comments", HTTP::Headers{"Authorization" => "Client-ID 1234565"})

parsedJson = JSON.parse response.body

collectComments(db, parsedJson["data"])




This post was edited by AbDuCt on Apr 7 2016 06:58pm
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Apr 22 2016 02:10pm
Been a while.

I scrapped imgur for comments for a while and I got over like 3 million input:output pairs.

I trained the AI with it, and as I figured nonsensical output was obtained.

I will have to tweak and redo the logic or find a new method. I thought that if I took the keywords from the input I could find relating sentences via making outputs with the input keywords. Makes sense right, guess not.

Unless I broke something when I patched the keyword generation methods I got, I had to fix a few bugs where it enters an infinite loop if words repeat themselves, and when a sentence only has 1 keyword. Could very well be my patches broke the thing, but unfortunately there are not unit tests for it so I'll have to debug it later to see if its actually functioning properly.

I wonder if there are any other frameworks that learn from input:output pairings beside prefabbed databases like AIML. I am not quiet sure how I would use a neural network to learn from the pairings.
Member
Posts: 13,425
Joined: Sep 29 2007
Gold: 0.00
Warn: 20%
Apr 25 2016 06:36pm
Code
Input > Hello
Output > I never asked for this.
Input > Hello
Output > Single Female Lawyer Fighting For Her Clients Wearing Sexy Mini Skirts and Being Self-Reliant
Input > That sounds like a good movie
Output > Mind blown
Input > My mind was blown as well
Output > THANKS! She really is a doll. I hate it when someone crosses the street to avoid her. She's SO friendsly http://i.imgur.com/IFSlxIf.jpg
Input > That is a very cute dog.
Output > I like you. He was indeed human, which is what sets the characteristic in stone.
Input > Your dog was human?
Output > Ok, cool story bro.
Input > You're the one telling me the story...
Output > Same! So glad we have Floor Jansen now. =)
Input > What is a floor Jansen?
Output > This needs to become a quote
Input > No it doesn't. Tell me what a floor Jansen is.
Output > I've never seen something, nor worried about something as grim as this. I'm hoping it won't be a reality.
Input >


The data learned from is 3.5 million scrapped Imgur comments from their top images. This data set is uncleaned and the output reflects that. I think if I use Imgurs tagging function in their API I could make the AI tailor to a more specific personality or knowledge set.

I revised my keywording and rolled my own. I sanitize and stem input (removing punctuation, expanding contractions, putting all words in the same tense) and then remove English stop words to remove useless noise from the text and use the remaining words as input tokens to generalize what has been said. From there I use those tokens to link to a response.

It kind of works, but the problem I am seeing is that with bad data the brain becomes bad. This is a problem with all chat AI though so it is not like I did anything new here.


I was doing some reading on conversational neural networks, but have found no proof of concepts yet that I could modify or play with. I don't think I am well versed enough in the math to attempt to implement it myself, although if I use the Treat Gem I could possibly pull it off. Treat offers normalization, matrixing, and other functions to allow NLP be applied to neural networks.
Go Back To Programming & Development Topic List
Prev1567
Add Reply New Topic New Poll