« Benchmarking Gandi, OVH, Slicehost, and Linode VPSAnnouncing Swit 0.9.0 »

Scary fact: 1 java class out of 70 is a Node!

12/04/09

Permalink 02:36:27 pm, by nogunner Email , 1119 words   English (US)
Categories: Misc

Scary fact: 1 java class out of 70 is a Node!

The other day, I happened to use one of my classes that manages preferences (from a property file). Pretty common, I know. So common, that eclipse popped up the completion window with 11 (eleven) possible classes with that very same name. The class name's Node (because it creates a tree, and I severely lack imagination).

Wait. ELEVEN just in the path of a single project: I may not be the only one with imagination issues. How many Node classes are there in the world?

Pretty much impossible to know, but I figured out that checking the whole maven repository may give a rough idea. So I ftp'ed the whole maven2 central repository to grab the jars (that's for SCIENCE, guys!!), and made a script.

156 Node classes

156, that's the number of classes that are called Node. But Node is also used as a morpheme in 4856 classes, out of the 340K in the whole maven repository. In other words, 1 out of 70 classes contains the morpheme Node. WOW. I'm feeling less lonely. Thanks, brothers!

So I extended the script to get the whole figures, and created the Top 100 Java Morphemes with it.

Top 100 Java Morphemes

So 1 class out of 22 contains the morpheme Impl. Haha, so much for those java-bashers that complain we over-engineer with interfaces and abstract classes, this is factual proof that Java coders can also implement real classes, at least 4.43% of the time! Yay!

The same Top 100 in a Web Cloud:

Think it looks familiar? Of course, that's what most of your code looks like. Now let's check the classnames, are we doing better?

Top 40 Java Classname

The Token morpheme may not be the most used morpheme to create a classname (only #86), but it's the most common classname. We sure like parsing.

My smartest readers will probably notice some salient data

  • There are 105 Base64 classes (not taking into account all the Base64Encoder, Base64Code, Base64Decoder, Base64Utils, etc).
  • There are 99 StringUtils and 65 StringUtil. Besides the fact that java programmers tend to prefer plurals, it probably says something about the lack of String manipulation methods in the standard classes.

Utilities classes

Regarding *Util* classes, the chart comes out as expected: if you're intenting to create a StringUtil or FileUtil class, chances are there may be some out there that provide something you need, so don't miss out a chance to duplicate it and create your own!

Methodology and script

Here's the script used to calculate those numbers. Note that extra care was taken to avoid counting duplicated classnames from the same project (fully-qualified classnames in different jars, or re-jarred, and avoiding duplicates inside the same project -- some developers create myproject.v1.SomeClass and myproject.v2.SomeClass which are just copypasta and don't qualify as two distinct classes).

#!/bin/bash

# Where the jars are located
JARFOLDER=ftp

if [[ "clean" == "$1" ]]
then
    echo "Cleaning workfiles..."
    rm -f jars.lst allclasses.lst sorted.lst
    exit 0
fi


if [ ! -f jars.lst ]
then
    echo "Creating jar list..."
    find $JARFOLDER -type f | grep '\.jar$' >jars.lst
else
    echo "Jar list already exists, skipping"
fi

CURRENT=1
TOTAL=`cat jars.lst | wc -l`
echo "found" $TOTAL "jar files"

#
# Extract the list of classes from the jars
# rm -f allclasses.lst

if [ ! -f allclasses.lst ]
then
 cat jars.lst | while read LINE ; do
    zipinfo -1 $LINE | grep -v '^META-INF/' | grep '.class$' | grep -v '^schema.system.' | grep -v '\$' | cut -d. -f1 | tr [/] [.] | sort | uniq >>allclasses.lst
    echo -n -e "done " $CURRENT "/" $TOTAL  \\r
    CURRENT=$((CURRENT+1))
 done
else
    echo allclasses.lst already exists, skipping.
fi

#
# Each line contains "f.q.d ClassName" (3 first domain element + classname)
# then sorted and uniquified, then only the classname is output.
# This prevents classnames duplicated in the same project to appear
# several times, but let them as duplicate if they appear to be from different
# projects (which is the case if the 3 first elements of the full classname 
# are distinct).
#
# For instance, we don't want project.v1.SomeClass and project.v2.SomeClass
# to be counted as 2, because they're from the same project and are likely 
# to be copypasta.
#
if [ ! -f sorted.lst ]
then
  echo "Sorting the classes"
  cat <allclasses.lst | awk "BEGIN{FS=\".\";}{ if (length(\$NF)>2) {printf(\"%s.%s.%s %s\n\",\$1,\$2,\$3,\$NF);}}" | sort | uniq | cut -d' ' -f2 | sort >sorted.lst
fi
echo "found" `cat sorted.lst | wc -l` "distinct classes, now sorting..."

#
# Counts the occurence of each class name, and outputs a CSV line 
# containing the short classname and its appearance count.
echo doing classnames... 
cat sorted.lst | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >classnames.csv

#
# Extract each word contained in the class name, sort and count, then 
# output a CSV file similar to the one create above.
echo doing morphemes...
cat sorted.lst | awk 'BEGIN{FS="";}{for(i=1;i<=NF;i++){if ($i == toupper($i) && i>1) {printf("\n");} printf("%s",$i);} printf("\n");}' | \
     sort | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\" && length(last)>1) {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >morphemes.csv

#
# All of the above, but just for google :-)
echo doing google...
cat <allclasses.lst | grep '^com\.google\.' | awk "BEGIN{FS=\".\";}{ if (length(\$NF)>2) {printf(\"%s.%s.%s %s\n\",\$1,\$2,\$3,\$NF);}}"| sort | uniq | cut -d' ' -f2 | awk 'BEGIN{FS="";}{for(i=1;i<=NF;i++){if ($i == toupper($i) && i>1) {printf("\n");} printf("%s",$i);} printf("\n");}' | \
     sort | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\" && length(last)>1) {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >google-morphemes.csv

#
# Now some fun with common classes
echo doing base64...
cat sorted.lst | grep "Base64" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >base64-classnames.csv

echo doing String...
cat sorted.lst | grep "String" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >string-classnames.csv

echo doing Log...
cat sorted.lst | grep "Log" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >log-classnames.csv

No feedback yet

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
PoorExcellent
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
nogunner's blog

Pointless technical stuffs are the bomb diggity of life.

March 2010
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Search

XML Feeds

Web Monitoring

Be sure to check my LinkLogics web monitoring application if you happen to need external monitoring.
blogging software