I suppose it is possible to use bash to handle this, but I figured perl would be easier.
And since I don't know perl, I had to teach myself that.
First, we have to prepare our domains. Assuming one domain per line, then I process the file using reversed string sort, so my incoming domain text file:
Code: Select all
efa-project.com
test.efa-project.org
efa-project.org
demo.efa-project.org
my.efa-project.com
a.org
b.org
a.a.org
will look like this:
Code: Select all
a.org
a.a.org
b.org
efa-project.org
demo.efa-project.org
test.efa-project.org
efa-project.com
my.efa-project.com
meaning all my subdomains appear immediately after my top domain
I accomplish this with the following shell command:
Now the algorithm is simple. Assuming a correct ordering in the text file, I grab the first line, then I compare it with the next line. If I find the domain of the first line is present in the next line, then I exclude it, otherwise I keep it and make it my "previous line". Repeat until end of file. Like so:
Code: Select all
#!/usr/bin/perl
use strict;
use warnings;
my $pline = <> ;
my $nline = '';
chomp $pline;
print $pline ."\n";
while ( $nline = <> )
{
chomp $nline;
if (index($nline, $pline) == -1 )
{
# if I don't find the previous string in this string then keep it and move on
print $nline . "\n";
$pline = $nline;
}
else
{
# the domain is found in this subdomain, so skip it and move on
#print $nline . " contains " . $pline . "\n";
}
}
chmod u+x the perl file, then run it like so:
Code: Select all
$ cat domains-in.txt |rev|sort|rev|./process.pl
a.org
b.org
efa-project.org
efa-project.com
Will this work in all cases? Don't know, not sure, I didn't test it fully. But it's a start.
Is this the best solution? No idea. You could use awk. It's possible to do it with sed, but the syntax for sed is kinda weird, and bash (v4 and greater?) is an option because it does have decent string handling capabilities. However, this seems to work so you can take it from there.
Good luck.