Pages: 1 2 3 4 >>

09/18/12

Permalink 05:41:00 pm, by nogunner Email , 534 words   English (US)
Categories: Javascript

Fullproof: A javascript search engine library for the browser

For a side project of mine that I was working on, I stumbled upon the fact there is no great way to provide a good quality text search for my offline web application.

Document searching has always been a server thing, but now that we have the technology to deliver offline applications that can work in stand-alone, or that can synchronize with their server when an internect connection pops up, that's something that appears more and more like a missing feature.

The usual way javascript developers provide search is:
1. iterate through all the document with a regex pattern,
2. collect documents, and
3. display results.

I'll admit that's a kind of fulltext search, but it lacks a few things to traditional search engines:

- It does not scale, the more documents there are, the longer it takes to scan. But let's consider that it's not a big deal, as we're not supposed to store a lot of documents on a web application, right ? (ok, that's not right at all, but whatever, let's suppose it is).

- The search quality is awful. It can find documents that match exactly what I typed. But what if I make a small typo ? What if I use a plural in my query instead of the singular form that is stored in my document ? And what about unicode, what happens if my document happens to store the text in decomposed NFKD form, but my browser gives me text in NFC normalized form ? Sure, it's not a big deal for english, but for most languages that use diacritical marks, it definitely is.

So what the users really need is something like that:

Here comes Fullproof, a javascript library that provides the full component stack of traditional search engines:
- Text analyzers: it provides text parsing and token normalization using several, chainable, algorithms: removing letter duplicates, lowercasing, unicode support, metaphone, porter stemmer, etc.
- Several possible HTML5-enabled stores: using either WebSQL, IndexedDB, or plain Memory, with automatic selection of the best available store.
- A Boolean-based search engine and a ScoringEngine. If the difference between both is not clear, boolean sets work by intersecting sets of data, while scoring sets are like merging all the results, and sorting them according to their score. Google is of the scoring kind, while boolean is used in, well, lot of the domain-specific engines where you only want to show the results that contain all the tokens in the query.

Although I'm pleased with the quality of the search it provides, I'm not so by the current state of the implementation of the offline storage engines in the browsers. Even if IndexedDB has a great spec, it's still unclear how the quota limits are managed on the browser side: while chrome allows both indexeddb and websql, firefox dropped websql and using indexeddb requires an authorization from the user, which is great, or rather would be great if it was done right. However, browsers are evolving at an amazing pace nowadays, and I'm pretty confident offline HTML applications are the next great thing happening to the web.

You can use and fork fullproof at github: http://reyesr.github.com/fullproof/

By the way, the screenshot above is from the fullproof's examples, you can try it online: http://reyesr.github.com/fullproof/examples/colors/colors.html

04/12/10

Permalink 07:44:10 pm, by nogunner Email , 346 words   English (US)
Categories: Misc

UpFrost deployed

I open-sourced and published today this java persistence library that I've been developing and using for some months now.
Although I started working on it after switching from hibernate to something lighter for a project where memory footprint was an important constraint, I started using it on most of my projects, and was happy enough with it to share it as open-source.

The key points of the library are

  • JPA-Annotation based. Although it's not JPA at all, it uses a subset of JPA annotations to describe the java/sql mapping. I just hope people won't be misleaded to think this is a JPA library.
  • SQL: I like SQL, I deeply think it's an amazingly powerful query language, and I've always been reluctant to using something else (read: HQL). The point is that I like to test some of my complex queries in my MySQL workbench, and just copy-pasta them in my java project, and that's something really easy to do with plain SQL.
  • No XML. Yurk, I really wanted to avoid using it. I couldn't switch back to iBatis because of it. Now that iBatis 3 has annotations, I guess most of my reluctance should fade away, but I also like the idea of using standard JPA annotations, and have my business objects be compatible with both JPA and UpFrost (provided I stick to the subset).

There's nothing disruptive in this library, but it gathers all the concepts I like, and excludes all the painful stuffs that degrade my productivity (XML, but I also stopped fighting with HQL joins).

Because I wanted to avoid writing a website, I used the maven "site" feature, and although I wish there were an out-of-the-box support for docbook or some other standard documentation format (rather than this apt thing), the feature is handy to use and flexible enough to let me customize the layout the way I wanted. The maven site:deploy is the killer feature, imho, it deploys the whole site to a remote server with a minimal amount of configuration work.

Anyway, here's the site: http://www.upfrost.org

01/29/10

Permalink 05:24:46 pm, by nogunner Email , 293 words   English (US)
Categories: Misc

Benchmarking Gandi, OVH, Slicehost, and Linode VPS

I thought about VPS benchmarking when I was looking for a new server, and found out that Gandi was providing a lot of useful information in their benchmark page, and they used this UnixBench program to evaluate the processor-equivalence of their shares.

Having trust issues myself ^^ I ran the program on a bunch of VPS I manage:

  • Gandi (of course)
  • OVH, another famous french hosting provider.
  • Slicehost
  • Linode
  • Myself, as I own a few linux computers at home


So here's the result (the higher the score, the better):



To get those results, I ran the benchmarks at several hours of the day (to cope with timezone issues impacting the performance of european servers vs. US servers), and always kept the lowest values. Anyway, I never got more than a 10% variation between the results taken at several hours.

For the details of the results, check the full table:



NOTES:

  • Performance are strongly dependant on the host and time of the day, so your mileage may vary. However, all the results I provide were ran several time to check the consistency of the results. I usually got differences <0.3 on the final score.
  • OVH's RPS are not VPS, they are real servers, but they use network disks. Consequently, the restriction above does not apply, except for the disk usage (shared on their network)


I was really surprised by the good performance of linode, but keep in mind that all those providers are excellent, and all provide a very different kind of service, while this benchmark only provides a score on CPU and Disks (not bandwidth, not additional services, not reliability, nor capacity).

I couldn't test all the possible configuration and flavours of these providers, but hopefully it gives a good picture of their respective performance.

12/04/09

Permalink 02:36:27 pm, by nogunner Email , 1119 words   English (US)
Categories: Misc

Scary fact: 1 java class out of 70 is a Node!

The other day, I happened to use one of my classes that manages preferences (from a property file). Pretty common, I know. So common, that eclipse popped up the completion window with 11 (eleven) possible classes with that very same name. The class name's Node (because it creates a tree, and I severely lack imagination).

Wait. ELEVEN just in the path of a single project: I may not be the only one with imagination issues. How many Node classes are there in the world?

Pretty much impossible to know, but I figured out that checking the whole maven repository may give a rough idea. So I ftp'ed the whole maven2 central repository to grab the jars (that's for SCIENCE, guys!!), and made a script.

156 Node classes

156, that's the number of classes that are called Node. But Node is also used as a morpheme in 4856 classes, out of the 340K in the whole maven repository. In other words, 1 out of 70 classes contains the morpheme Node. WOW. I'm feeling less lonely. Thanks, brothers!

So I extended the script to get the whole figures, and created the Top 100 Java Morphemes with it.

Top 100 Java Morphemes

So 1 class out of 22 contains the morpheme Impl. Haha, so much for those java-bashers that complain we over-engineer with interfaces and abstract classes, this is factual proof that Java coders can also implement real classes, at least 4.43% of the time! Yay!

The same Top 100 in a Web Cloud:

Think it looks familiar? Of course, that's what most of your code looks like. Now let's check the classnames, are we doing better?

Top 40 Java Classname

The Token morpheme may not be the most used morpheme to create a classname (only #86), but it's the most common classname. We sure like parsing.

My smartest readers will probably notice some salient data

  • There are 105 Base64 classes (not taking into account all the Base64Encoder, Base64Code, Base64Decoder, Base64Utils, etc).
  • There are 99 StringUtils and 65 StringUtil. Besides the fact that java programmers tend to prefer plurals, it probably says something about the lack of String manipulation methods in the standard classes.

Utilities classes

Regarding *Util* classes, the chart comes out as expected: if you're intenting to create a StringUtil or FileUtil class, chances are there may be some out there that provide something you need, so don't miss out a chance to duplicate it and create your own!

Methodology and script

Here's the script used to calculate those numbers. Note that extra care was taken to avoid counting duplicated classnames from the same project (fully-qualified classnames in different jars, or re-jarred, and avoiding duplicates inside the same project -- some developers create myproject.v1.SomeClass and myproject.v2.SomeClass which are just copypasta and don't qualify as two distinct classes).

#!/bin/bash

# Where the jars are located
JARFOLDER=ftp

if [[ "clean" == "$1" ]]
then
    echo "Cleaning workfiles..."
    rm -f jars.lst allclasses.lst sorted.lst
    exit 0
fi


if [ ! -f jars.lst ]
then
    echo "Creating jar list..."
    find $JARFOLDER -type f | grep '\.jar$' >jars.lst
else
    echo "Jar list already exists, skipping"
fi

CURRENT=1
TOTAL=`cat jars.lst | wc -l`
echo "found" $TOTAL "jar files"

#
# Extract the list of classes from the jars
# rm -f allclasses.lst

if [ ! -f allclasses.lst ]
then
 cat jars.lst | while read LINE ; do
    zipinfo -1 $LINE | grep -v '^META-INF/' | grep '.class$' | grep -v '^schema.system.' | grep -v '\$' | cut -d. -f1 | tr [/] [.] | sort | uniq >>allclasses.lst
    echo -n -e "done " $CURRENT "/" $TOTAL  \\r
    CURRENT=$((CURRENT+1))
 done
else
    echo allclasses.lst already exists, skipping.
fi

#
# Each line contains "f.q.d ClassName" (3 first domain element + classname)
# then sorted and uniquified, then only the classname is output.
# This prevents classnames duplicated in the same project to appear
# several times, but let them as duplicate if they appear to be from different
# projects (which is the case if the 3 first elements of the full classname 
# are distinct).
#
# For instance, we don't want project.v1.SomeClass and project.v2.SomeClass
# to be counted as 2, because they're from the same project and are likely 
# to be copypasta.
#
if [ ! -f sorted.lst ]
then
  echo "Sorting the classes"
  cat <allclasses.lst | awk "BEGIN{FS=\".\";}{ if (length(\$NF)>2) {printf(\"%s.%s.%s %s\n\",\$1,\$2,\$3,\$NF);}}" | sort | uniq | cut -d' ' -f2 | sort >sorted.lst
fi
echo "found" `cat sorted.lst | wc -l` "distinct classes, now sorting..."

#
# Counts the occurence of each class name, and outputs a CSV line 
# containing the short classname and its appearance count.
echo doing classnames... 
cat sorted.lst | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >classnames.csv

#
# Extract each word contained in the class name, sort and count, then 
# output a CSV file similar to the one create above.
echo doing morphemes...
cat sorted.lst | awk 'BEGIN{FS="";}{for(i=1;i<=NF;i++){if ($i == toupper($i) && i>1) {printf("\n");} printf("%s",$i);} printf("\n");}' | \
     sort | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\" && length(last)>1) {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >morphemes.csv

#
# All of the above, but just for google :-)
echo doing google...
cat <allclasses.lst | grep '^com\.google\.' | awk "BEGIN{FS=\".\";}{ if (length(\$NF)>2) {printf(\"%s.%s.%s %s\n\",\$1,\$2,\$3,\$NF);}}"| sort | uniq | cut -d' ' -f2 | awk 'BEGIN{FS="";}{for(i=1;i<=NF;i++){if ($i == toupper($i) && i>1) {printf("\n");} printf("%s",$i);} printf("\n");}' | \
     sort | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\" && length(last)>1) {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >google-morphemes.csv

#
# Now some fun with common classes
echo doing base64...
cat sorted.lst | grep "Base64" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >base64-classnames.csv

echo doing String...
cat sorted.lst | grep "String" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >string-classnames.csv

echo doing Log...
cat sorted.lst | grep "Log" | awk "BEGIN{FS=\".\"; count=1; last=\"\";}{ C=\$NF; if (last==C) { count++ } else { if (last != \"\") {printf(\"%s;%d\n\", last, count);} last=C; count=1; }}" | sort -r -n -t ';' -k 2 >log-classnames.csv

06/18/09

Permalink 02:41:44 pm, by nogunner Email , 740 words   English (US)
Categories: Misc

Jdbc Jutsu Saves Heap Memory Against Wicket's Stateful Ajax

Hey, I hope Quentin Tarantino won't steal my title and make a movie out of it! Anyway, the Wicket's implementation of Ajax is so good that I just stopped using anything else to make my web client communicate with my server. There's this one thing however, that prevents the fine scaling of Wicket sites, that's namely memory consumption. Wicket by itself does not consume that much memory though, but if you want to use the sweet Ajax components, you're stuck with stateful pages. And using stateful pages + lots of visitors usually implies memory issues: it's not just your own data contained in the WebSession object that is stored in memory, but also the current page hosting the ajax components, and the n last pages visited. If your pages are big, the session are big, that's the point.

Add to that a specific need to make the session last longer than usual, to allow users to stop using the site, then going back after a long period of time, and still have their session available, and you'd rapidly be waiting your users with fear.

So, to conciliate long sessions, limited-memory servers, and stateful wicket, the best possible solution, besides asking your users not to tell their friends, is to apply a Jdbc Jutsu to the session management of your servlet container. At least, that's how I solve my issue.

Tomcat provides a JDBCStore for its PersistentManager class (which is responsible for managing the sessions): this java jutsu just saved my server's memory from going out of control. Unlike the default org.apache.catalina.session.StandardManager that stores everything in your precious heap memory, the PersistentManager provides a lot of flexibility regarding the session storage. For instance, a typical configuration would keep the sessions in-memory, but passivate them into a database after a few minutes of inactivity (instead of consuming all this good memory for idle or disconnected users until their session expires).

The Tomcat documentation lacks a few examples, so here's mine: I wanted unlimited sessions, and passivate idle sessions in the JDBC database after 120 seconds of inactivity (lines to customize are yellow).

  <Host ......
	  <Context path="/MYPATH" docBase="MY-APPLICATION.WAR">
	    <Manager className="org.apache.catalina.session.PersistentManager" saveOnRestart="true" 
		     maxIdleSwap="120"
		     minIdleSwap="-1" 
		     maxActiveSessions="-1"
		     maxIdleBackup="-1">
	      <Store className="org.apache.catalina.session.JDBCStore"

		     driverName="com.mysql.jdbc.Driver"
		     connectionURL="jdbc:mysql://localhost/MYAPP"
		     connectionName="DATABASE-USERNAME"
		     connectionPassword="DATABASE-PASSWORD"

		     sessionTable="tomcat_sessions"
		     sessionIdCol="session_id"
		     sessionValidCol="valid_session"
		     sessionMaxInactiveCol="max_inactive"
		     sessionLastAccessedCol="last_access"
		     sessionAppCol="app_name"
		     sessionDataCol="session_data"

		     checkInterval="60"
		    />
	      </Manager> 
	  </Context>
  </Host>
Note: Remember to put your jdbc driver jar in the lib/ folder of Tomcat, or the connection won't work.

Then in the database and schema specified in the configuration above, just add the following table:
(from the Tomcat Manual at http://tomcat.apache.org/tomcat-6.0-doc/config/manager.html):

create table tomcat_sessions (
  session_id     varchar(100) not null primary key,
  valid_session  char(1) not null,
  max_inactive   int not null,
  last_access    bigint not null,
  app_name       varchar(255),
  session_data   mediumblob,
  KEY kapp_name(app_name)
);

Well, that's it, no more wicket-related memory issues.

Update after Thyzz comment (see below)

Wicket can actually use either its original HttpSessionStore (that stores everything in the servlet http session), or its new SecondLevelCacheSessionStore that stores the PageMap on disk. Here is below some metrics I made that compare the memory usage with and without the tomcat's JDBCStore:


(click to enlarge)

To get those metrics, I used a web application that makes a moderate memory usage, and changed it so that the SessionStore can be either the HttpSessionStore or the SecondLevelCacheSessionStore. Additionnally, I added a new byte[1024*500] allocation into the session object, so that the memory consumption be artificially higher (for the purpose of testing in this specific configuration, applications with a really low memory footprint of their session behave totally differently, and are unlikely to have any memory-related issue). Then I ran a script on another computer that would make a loop over an http request for the frontpage (and all the resources of that page).

As a result, both HttpSessionStore and SecondLevelCacheSessionStore end up with an OutOfMemoryException. The memory usage is slightly better when using the SecondLevelCacheSessionStore, but I did not test the http latency; a real benchmark would require to compare both memory and speed, but at least the figures shows that the memory issue is prevented.

1 2 3 4 >>

nogunner's blog
Pointless technical stuffs are the bomb diggity of life.
November 2014
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30            

Search

XML Feeds

Web Monitoring

Be sure to check my LinkLogics web monitoring application if you happen to need external monitoring.
powered by b2evolution free blog software