Kirix Support Forums

filter duplicates

Please post any help questions, requests or other feedback here.

filter duplicates

Postby no1donjuan on Fri Oct 23, 2009 8:01 am

Ken/Ben

I am just getting my head aroud the prog. I have a large list of emails in one column and firstnames in another and I want to filter out any duplicate emails.

Is there a simple way of doing this task.

thx

tim
no1donjuan
Registered User
 
Posts: 1
Joined: Fri Oct 23, 2009 7:57 am

Re: filter duplicates

Postby Ken on Fri Oct 23, 2009 9:25 am

Hi Tim,

If it is just a two column field, I'd probably just do the following:

1) Select the grouping tool (sigma icon or Data > Groups > Group Records).
2) Drag in your "Email" field so that it says "Group By" Email.
3) Drag in a <count> field.
4) Click OK.
5) In the resulting table, sort Descending on the Count field.
6) Now you'll see which emails are duplicated in your list. Tile the table vertically so you can see both tables at once and then you can quick Filter on your main table for the email addresses you know are duplicated. Then, you can delete the ones you don't want.

A longer version of this is here: http://www.kirix.com/stratablog/removin ... -form-data

Hope that helps,
ken
Ken Kaczmarek
Kirix Support Team
User avatar
Ken
Kirix Support Team
 
Posts: 147
Joined: Mon Dec 19, 2005 10:36 am

Re: filter duplicates

Postby Andrew S on Sat Jun 26, 2010 2:00 pm

Is there a way to automate the process? I have 20 million records of which 5 million are duplicate entries. Is there a way to have the program remove the duplicates so that I am not filtering and deleting them manually?
Andrew S
Registered User
 
Posts: 1
Joined: Sat Jun 26, 2010 1:05 pm

Re: filter duplicates

Postby Ken on Mon Jun 28, 2010 11:34 am

Hi Andrew,

If you're only looking for distinct email addresses, I'd do the following:

1. File > New > Query
2. Drag in your 20 million record table from the project tree into the upper section of the query dialog.
3. Select the fields you want in the output table, by highlighting them in the table and dragging them down to the bottom section of the query.
4. Find your "email" field, and then, in the Function section, select "Group By"

Your resulting table will show you all the fields you selected, grouped by the email field (so you should end up with about 15 million records). See this help page for more info about the query builder: http://www.kirix.com/help/docs/creating_queries.htm

Please note that this query will select the "first" email record it finds and bring along the corresponding record with it to the output table. So, if your second email record has different data in it (say, the name field is "Jon" instead of "John" in the first field), you won't see it (e.g., it isn't combining any other data fields).

Best,
ken
Ken Kaczmarek
Kirix Support Team
User avatar
Ken
Kirix Support Team
 
Posts: 147
Joined: Mon Dec 19, 2005 10:36 am

Return to Strata Help & Feedback