d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Detect Duplicate Files In Two Different Directory? > Script
12Next
Add Reply New Topic New Poll
Member
Posts: 29,345
Joined: Mar 27 2008
Gold: 504.69
Feb 24 2016 02:35pm
I am in the process of combining my movie collection with someone else.
Is there any way I can script something that can go through the directories and find matching file names, in all or part of the name.

example

Home Movie should match Home Movie CD1
Member
Posts: 1,995
Joined: Jun 28 2006
Gold: 7.41
Feb 24 2016 04:14pm
First thing that comes to mind is powershell and the like operator
Member
Posts: 10,812
Joined: Oct 15 2009
Gold: Locked
Warn: 20%
Feb 24 2016 04:44pm
Just an example in python
Code
import difflib
movie1='Home Movie'
movie2='Home Movie CD1'
print difflib.SequenceMatcher(None, movie1, movie2).ratio()

This is an 83% match. Problem is you are going to have to define a cut off of what is an acceptable match % and what is not.
Member
Posts: 29,345
Joined: Mar 27 2008
Gold: 504.69
Feb 24 2016 05:13pm
Quote (Azrad @ Feb 24 2016 06:44pm)
Just an example in python
Code
import difflib
movie1='Home Movie'
movie2='Home Movie CD1'
print difflib.SequenceMatcher(None, movie1, movie2).ratio()

This is an 83% match. Problem is you are going to have to define a cut off of what is an acceptable match % and what is not.


Yeah but I don't want to have to define the names of duplicates. I want to find them.
Powershell sounds like a path I should look into.
Member
Posts: 10,812
Joined: Oct 15 2009
Gold: Locked
Warn: 20%
Feb 24 2016 05:21pm
Quote (ROM @ Feb 24 2016 04:13pm)
Yeah but I don't want to have to define the names of duplicates.

Maybe what I said was confusing. This calculates a number that represents how close they are to each other. In the example the number was 0.83 (or 83%). For example you might tell the program that any pair that returns with greater then 0.80 should be considered a match. However there will be mistakes. For example:

The Thing (1982)
vs
john carpenter\'s the thing

is seen as a 33% match, but (to a human being) they are the same thing


Member
Posts: 161,550
Joined: Oct 18 2006
Gold: 4.03
Warn: 20%
Mar 2 2016 01:06am
Quote (ROM @ Feb 24 2016 04:13pm)
Yeah but I don't want to have to define the names of duplicates. I want to find them.
Powershell sounds like a path I should look into.


how does your code "know"?

you know they're the same thing

Quote (Azrad @ Feb 24 2016 04:21pm)
Maybe what I said was confusing. This calculates a number that represents how close they are to each other. In the example the number was 0.83 (or 83%). For example you might tell the program that any pair that returns with greater then 0.80 should be considered a match. However there will be mistakes. For example:

The Thing (1982)
vs
john carpenter\'s the thing

is seen as a 33% match, but (to a human being) they are the same thing


like this for example, because of years of history in the subject
does your code magically know?
Member
Posts: 1,158
Joined: Oct 5 2010
Gold: 0.00
Mar 6 2016 05:07am
total commander should be able to do this for you - trial version is free.
Member
Posts: 12,786
Joined: May 17 2013
Gold: 4,010.00
Mar 6 2016 06:25am
You could generate md5 hashes for each movie you are checking and compare those. If the files are 100% identical, they will match and you can just keep one of them.
Member
Posts: 62,215
Joined: Jun 3 2007
Gold: 9,039.20
Mar 6 2016 11:42am
Quote (Klexmoo @ Mar 6 2016 06:25am)
You could generate md5 hashes for each movie you are checking and compare those. If the files are 100% identical, they will match and you can just keep one of them.


Lol, that would only check for duplicate files which is pretty useless to use if you know anything about how movies are ripped, what he most likely wants are for movie titles to be matched for similarity.

Azrad posted the solution already
Member
Posts: 10,812
Joined: Oct 15 2009
Gold: Locked
Warn: 20%
Mar 6 2016 06:01pm
another idea is to have the program collect the name of potential matches as before (with say 50% matches) then have it display these to a human and have them make the final decision.
Go Back To Programming & Development Topic List
12Next
Add Reply New Topic New Poll