Introducing dedupe-csv: a command line tool to bulk remove duplicates

As data continues to grow exponentially, SEO specialists are having to process and analyse datasets of ever-increasing proportions.

Large files that contain hundreds of thousands (or millions!) of rows can make even basic tasks, such as removing duplicate values, extremely challenging. Spreadsheet software, which uses a GUI, simply isn’t built to handle data on this scale – and it isn’t always possible to prefilter data when exporting.

Thankfully, programmatic solutions are available to help, but these usually require knowledge of a programming language. The command line offers an alternative approach, but the syntax is far from intuitive, and the features are more limited.

In the absence of a perfect solution, we took it upon ourselves to create our own: dedupe-csv.

dedupe-csv

What is the dedupe-csv tool?

Available on NPM, dedupe-csv is a command line tool that reads a CSV file, scans for duplicates and exports the unique entries to a new CSV.

Based on Pandas’ drop_duplicates function, it provides a rich set of features with intuitive syntax and the convenience of a command line utility.

How to use dedupe-csv

As with our last NPM module Pivot-Table-JS, make sure you’ve installed Node.js and then add the dedupe-csv package:

npm i -g dedupe-csv

Once installed, the command line tool can be used from the terminal of your choosing, such as command prompt, PowerShell or Node.js command prompt on Windows; and iTerm2 or Terminal on Mac/Linux.

Just refer to the package name, followed by CSV file you wish to deduplicate:

dedupe-csv file='data.csv'

It’s important to note that the file parameter must point to the file relative to where you are currently located within the terminal. For example, if your terminal’s root directory is Documents and your CSV file is in a subdirectory called project, you’ll need to specify the full path:

dedupe-csv file='project/data.csv'

The command will start reading the file, look for identical rows/lines and remove duplicates, leaving you the first duplicate found.
dedupe-csv processing a file
When finished, it will output your results to a new file, keeping the original file unedited.
dedupe-csv outputting a file

Other options

Remove duplicates by column

If you want to fine-tune the way the tool finds duplicates, you can use the column option to select a specific header (or headers) for the evaluation.

Here, we have a CSV called data.csv, where the brand header has two duplicates, the style header has duplicates on all entries, and rating has no duplicates.

brand,style,rating
Yum,cup,1
Yum,cup,2
Foo,cup,6
Foo,cup,8

If we want to check for duplicates within the brand column only, we can run this command:

dedupe-csv file='data.csv' column='brand'

Which will produce a file called data_deduped.csv containing:

brand,style,rating
Yum,cup,1
Foo,cup,6

To evaluate based on a sequence of columns, add as many headers as you want, separated by commas.

So given this file data.csv:

brand,style,rating
Yum,cup,1
Yum,cup,2
Foo,cup,3
Foo,pack,1
Foo,pack,5

You can enter:

dedupe-csv file='data.csv' column='brand, style'

Which will return:

brand,style,rating
Yum,cup,1
Foo,cup,3
Foo,pack,1

Remove duplicates but keep last value

In all these examples, we’ve been keeping the first entry found out of the duplicates. Instead, if we want to keep the last entry, we can specify the keep option.

For example, using this data.csv and command:

brand,style,rating
Yum,cup,1
Yum,cup,2
Foo,pack,1
Foo,pack,5
dedupe-csv file='data.csv' column='brand' keep='last'

Will return a data_deduped.csv containing:

brand,style,rating
Yum,cup,2
Foo,pack,5

It’s my hope that the SEO and wider digital marketing community find this module a useful addition to their toolkit.

If you have any questions about using the module or want to learn more about how Builtvisible build custom solutions to help businesses maximise their marketing spend, please get in touch.

Join the Inner Circle

Industry leading insights direct to your inbox every month.