You are here:

C++/Checking for Duplicates


Hello Ralph,
I have a program that loads file names and their respectives paths into memory on a per directory basis.  If the user selects another directory, the files in that directory get appended to the list.  However, if the user reselects a directory they've already loaded, then I end up with duplicate filenames, which isn't a huge problem persay, but it just looks bad.  I know the brute force way to check for duplicates, but I'm trying to avoid lots of string comparisons.  Is there a faster algorithm for checking for duplicate entries in a list?  

As far as I know there is no faster way if the collection is unsorted. Also, even if the collection is sorted then the best results will be obtained if the collection type has the capability for random access to existing elements - such as an array or vector. With elements in a list - as in a linked list - you cannot access element n without accessing elements first ... n-1 or last ... n + 1 first.

So assuming you have a sorted collection then you can use algorithms such as a binary chop (or binary search) to more quickly search for a match. In your case you are in fact searching for a non-match, which of course is always the worst case in terms of operations performed.

To perform a binary chop you first compare the value you a looking for with the middle element in the collection. If there is no match then the value is either less or greater than the middle element. If less you compare against the element at the quarter point, and if greater you compare against the element at the three quarter point. You achieve this by halving the size of the 'stride' (initially set to half the number of elements in the collection) and subtracting or adding this value to the position of the current element. You repeat this procedure until the stride size reaches zero. You might be interested to know that the C++ standard library, apart from having various collection types such as std::list and std::vector, also has many algorithms for operating on sequences within these collections, such as std::find (for finding values in unordered sequences), and std::binary_search (for finding values in ordered collections), and of course the likes of std::sort. Note that some collection types have in-built support for some algorithms - such as std::list::sort as they can do a better job with additional internal knowledge for the collection type in question.

Of course you will now be wondering if it will be worth it as you have to keep the collection sorted all the time, which will involve more work! An alternative approach would be to use an associative container type that will not allow the insertion of duplicates and is also efficient at inserting new items. One such obvious candidate for this task is the C++ standard library type std::set, which keeps the collection sorted and does not allow duplicates to be inserted.

If you are unfamiliar with the C++ standard library's facilities I suggest you spend time learning about it and that you obtain a good reference such as "The C++ Standard Library A Tutorial and Reference" by Nicolai M. Josuttis (I use my copy all the time!). There is some online documentation here on the containers and algorithms etc.:

Oh, and if you are doing awkward file system stuff then you might like to look at the Boost file system library. See:, and

One final thought: A linear search through a vector (1D array) of items consisting of a single contiguous chunk of memory can be quicker than a more direct search using a node-based collection such as a set consisting of many chunks of memory throughout the address range if the vector fits wholly in the cache of the processor!

Hope this is of help.


All Answers

Answers by Expert:

Ask Experts


Ralph McArdell


I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.


My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at


©2016 All rights reserved.