Now Playing
AVForums Super Veteran
For a variety of reasons I've found that over the years as my tagging, filing and backup conventions have changed I've sometimes ended up duplicating albums in my collection, principally because of changes to folder names and locations. I had a little time to do some deduping and thought I'd share my methods, 1) so it's there for me to reference when I need to, and 2) so others could benefit.
What I'm sharing basically culminates in a table listing of folders containing the same number of FLAC files with identical audio content to one or more of the other listed folders. This then leaves you to manually delete the FLAC content in the duplicated album should you wish to (for obvious reasons I'm not automating this part, feel free to do so yourself).
To get the job done I've used some simple terminal commands (Linux based, so Windows users will have to find the Windows equivalent) coupled with a little SQLite scripting.
Here goes...
1. Firstly, from the root folder of your music collection generate a text file produced from listing files, their path and FLAC md5sum using bash terminal:
2. Add the necessary delimiters to be able to import the file into a SQLite table
3. Now create a database in SQLite, add a table "audio" having text fields __path, __md5sig, __dirpath and __filename and import the csv into the audio table
4. Run the necessary SQL scripts:
The table __dirpaths_with_same_content will list all folders in your collection whose FLAC audio content is identical to one or more folders in your collection. The file is sorted so that folders with identical FLAC audio content are grouped together so as to ease navigating your file system.
The table __dirpaths_with_FLACs_to_kill represents a subset of __dirpaths_with_same_content. Deleting the FLAC content in the corresponding folders will eliminate the duplication. It goes without saying in choosing duplicates to add to the killfile no regard has been had to tagging etc?use contents of __dirpaths_with_FLACs_to_kill at your own risk.
IMPORTANT NOTE: Please do not post questions / concerns regarding FLAC files having different filenames, being housed in different folder names or having different file sizes and thus having incorrect results. FLAC files produced by a FLAC specification compliant FLAC encoder have an internal md5sum of the audio stream - this md5sum remains the same regardless of the level of compression applied and is independent of any metadata present in the FLAC header.
What I'm sharing basically culminates in a table listing of folders containing the same number of FLAC files with identical audio content to one or more of the other listed folders. This then leaves you to manually delete the FLAC content in the duplicated album should you wish to (for obvious reasons I'm not automating this part, feel free to do so yourself).
To get the job done I've used some simple terminal commands (Linux based, so Windows users will have to find the Windows equivalent) coupled with a little SQLite scripting.
Here goes...
1. Firstly, from the root folder of your music collection generate a text file produced from listing files, their path and FLAC md5sum using bash terminal:
Code:
find -type f -name \*.flac -print0 | xargs -0 -n1 metaflac --with-filename --show-md5sum > md5sums.txt
2. Add the necessary delimiters to be able to import the file into a SQLite table
Code:
echo __path:__md5sig:__filename:__dirpath > import.csv && sed ':a;N;$!ba;s/\n/::\n/g' md5sums.txt >> import.csv
3. Now create a database in SQLite, add a table "audio" having text fields __path, __md5sig, __dirpath and __filename and import the csv into the audio table
Code:
DROP TABLE IF EXISTS audio;
DROP TABLE IF EXISTS __dirpath_content_concat__md5sig;
DROP TABLE IF EXISTS __dirpaths_with_same_content;
DROP TABLE IF EXISTS __dirpaths_with_FLACs_to_kill;
CREATE TABLE audio (
__path TEXT UNIQUE,
__md5sig TEXT,
__filename TEXT,
__dirpath TEXT
);
4. Run the necessary SQL scripts:
Code:
-- derive filename from the full file path
UPDATE audio
SET __filename = [replace](__path, rtrim(__path, [replace](__path, '/', '') ), '');
Code:
-- derive __dirpath from the full file path
UPDATE audio
SET __dirpath = substr(__path, 1, length(__path) - length([replace](__path, rtrim(__path, [replace](__path, '/', '') ), '') ) );
Code:
-- create table in which to store concatenated __md5sig for all __dirnames in duplicates table
DROP TABLE IF EXISTS __dirpath_content_concat__md5sig;
CREATE TABLE __dirpath_content_concat__md5sig (
__dirpath TEXT,
concat__md5sig TEXT
);
Code:
-- populate table with __dirpath and concatenated __md5sig of all files associated with __dirpath (note order by __md5sig to ensure concatenated __md5sig is consistently generated irrespective of physical record sequence).
INSERT INTO __dirpath_content_concat__md5sig (
__dirpath,
concat__md5sig
)
SELECT __dirpath,
group_concat(__md5sig, '|')
FROM audio
GROUP BY __dirpath
ORDER BY __md5sig;
Code:
--now write the duplicate records into a separate table listing all __dirname's that have identical FLAC contents
DROP TABLE IF EXISTS __dirpaths_with_same_content;
CREATE TABLE __dirpaths_with_same_content (
__dirpath TEXT,
concat__md5sig TEXT
);
Code:
INSERT INTO __dirpaths_with_same_content (
__dirpath,
concat__md5sig
)
SELECT __dirpath,
concat__md5sig
FROM __dirpath_content_concat__md5sig
WHERE concat__md5sig IN (
SELECT concat__md5sig
FROM __dirpath_content_concat__md5sig
GROUP BY concat__md5sig
HAVING count( * ) > 1
)
ORDER BY concat__md5sig, __dirpath;
Code:
-- populate table listing directories in which FLAC files should be deleted as they're duplicates
DROP TABLE IF EXISTS __dirpaths_with_FLACs_to_kill;
CREATE TABLE __dirpaths_with_FLACs_to_kill (
__dirpath TEXT,
concat__md5sig TEXT
);
INSERT INTO __dirpaths_with_FLACs_to_kill (
__dirpath,
concat__md5sig
)
SELECT __dirpath,
concat__md5sig
FROM __dirpaths_with_same_content
WHERE rowid NOT IN (
SELECT min(rowid)
FROM __dirpaths_with_same_content
GROUP BY concat__md5sig
);
The table __dirpaths_with_same_content will list all folders in your collection whose FLAC audio content is identical to one or more folders in your collection. The file is sorted so that folders with identical FLAC audio content are grouped together so as to ease navigating your file system.
The table __dirpaths_with_FLACs_to_kill represents a subset of __dirpaths_with_same_content. Deleting the FLAC content in the corresponding folders will eliminate the duplication. It goes without saying in choosing duplicates to add to the killfile no regard has been had to tagging etc?use contents of __dirpaths_with_FLACs_to_kill at your own risk.
IMPORTANT NOTE: Please do not post questions / concerns regarding FLAC files having different filenames, being housed in different folder names or having different file sizes and thus having incorrect results. FLAC files produced by a FLAC specification compliant FLAC encoder have an internal md5sum of the audio stream - this md5sum remains the same regardless of the level of compression applied and is independent of any metadata present in the FLAC header.