| « Benchmarking Gandi, OVH, Slicehost, and Linode VPS | Announcing Swit 0.9.0 » |
The other day, I happened to use one of my classes that manages preferences (from a property file). Pretty common, I know. So common, that eclipse popped up the completion window with 11 (eleven) possible classes with that very same name. The class name's Node (because it creates a tree, and I severely lack imagination).
Wait. ELEVEN just in the path of a single project: I may not be the only one with imagination issues. How many Node classes are there in the world?
Pretty much impossible to know, but I figured out that checking the whole maven repository may give a rough idea. So I ftp'ed the whole maven2 central repository to grab the jars (that's for SCIENCE, guys!!), and made a script.
156, that's the number of classes that are called Node. But Node is also used as a morpheme in 4856 classes, out of the 340K in the whole maven repository. In other words, 1 out of 70 classes contains the morpheme Node. WOW. I'm feeling less lonely. Thanks, brothers!
So I extended the script to get the whole figures, and created the Top 100 Java Morphemes with it.
So 1 class out of 22 contains the morpheme Impl. Haha, so much for those java-bashers that complain we over-engineer with interfaces and abstract classes, this is factual proof that Java coders can also implement real classes, at least 4.43% of the time! Yay!
The same Top 100 in a Web Cloud:
Think it looks familiar? Of course, that's what most of your code looks like. Now let's check the classnames, are we doing better?
The Token morpheme may not be the most used morpheme to create a classname (only #86), but it's the most common classname. We sure like parsing.
My smartest readers will probably notice some salient data
Regarding *Util* classes, the chart comes out as expected: if you're intenting to create a StringUtil or FileUtil class, chances are there may be some out there that provide something you need, so don't miss out a chance to duplicate it and create your own!
Here's the script used to calculate those numbers. Note that extra care was taken to avoid counting duplicated classnames from the same project (fully-qualified classnames in different jars, or re-jarred, and avoiding duplicates inside the same project -- some developers create myproject.v1.SomeClass and myproject.v2.SomeClass which are just copypasta and don't qualify as two distinct classes).
#!/bin/bash
# Where the jars are located
JARFOLDER=ftp
if [[ "clean" == "$1" ]]
then
echo "Cleaning workfiles..."
rm -f jars.lst allclasses.lst sorted.lst
exit 0
fi
if [ ! -f jars.lst ]
then
echo "Creating jar list..."
find $JARFOLDER -type f | grep '\.jar$' >jars.lst
else
echo "Jar list already exists, skipping"
fi
CURRENT=1
TOTAL=`cat jars.lst | wc -l`
echo "found" $TOTAL "jar files"
#
# Extract the list of classes from the jars
# rm -f allclasses.lst
if [ ! -f allclasses.lst ]
then
cat jars.lst | while read LINE ; do
zipinfo -1 $LINE | grep -v '^META-INF/' | grep '.class$' | grep -v '^schema.system.' | grep -v '\$' | cut -d. -f1 | tr [/] [.] | sort | uniq >>allclasses.lst
echo -n -e "done " $CURRENT "/" $TOTAL \\r
CURRENT=$((CURRENT+1))
done
else
echo allclasses.lst already exists, skipping.
fi
#
# Each line contains "f.q.d ClassName" (3 first domain element + classname)
# then sorted and uniquified, then only the classname is output.
# This prevents classnames duplicated in the same project to appear
# several times, but let them as duplicate if they appear to be from different
# projects (which is the case if the 3 first elements of the full classname
# are distinct).
#
# For instance, we don't want project.v1.SomeClass and project.v2.SomeClass
# to be counted as 2, because they're from the same project and are likely
# to be copypasta.
#
if [ ! -f sorted.lst ]
then
echo "Sorting the classes"
cat <allclasses.lst | awk "BEGIN{FS=\".\";}{ if (length(\$NF)>2) {printf(\"%s.%s.%s %s\n\",\$1,\$2,\$3,\$NF);}}" | sort | uniq | cut -d' ' -f2 | sort >sorted.lst
fi
echo "found" `cat sorted.lst | wc -l` "distinct classes, now sorting..."
#
# Counts the occurence of each class name, and outputs a CSV line
# containing the short classname and its appearance count.
echo doing classnames...
cat sorted.lst | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >classnames.csv
#
# Extract each word contained in the class name, sort and count, then
# output a CSV file similar to the one create above.
echo doing morphemes...
cat sorted.lst | awk 'BEGIN{FS="";}{for(i=1;i<=NF;i++){if ($i == toupper($i) && i>1) {printf("\n");} printf("%s",$i);} printf("\n");}' | \
sort | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\" && length(last)>1) {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >morphemes.csv
#
# All of the above, but just for google :-)
echo doing google...
cat <allclasses.lst | grep '^com\.google\.' | awk "BEGIN{FS=\".\";}{ if (length(\$NF)>2) {printf(\"%s.%s.%s %s\n\",\$1,\$2,\$3,\$NF);}}"| sort | uniq | cut -d' ' -f2 | awk 'BEGIN{FS="";}{for(i=1;i<=NF;i++){if ($i == toupper($i) && i>1) {printf("\n");} printf("%s",$i);} printf("\n");}' | \
sort | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\" && length(last)>1) {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >google-morphemes.csv
#
# Now some fun with common classes
echo doing base64...
cat sorted.lst | grep "Base64" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >base64-classnames.csv
echo doing String...
cat sorted.lst | grep "String" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >string-classnames.csv
echo doing Log...
cat sorted.lst | grep "Log" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >log-classnames.csv
Recent comments