Open Source Machine Vision Object Detection and Ranging

GroG · December 5, 2008, 11:09pm

This is a basic demonstration of software which I have written which does a form of machine vision for object detection and ranging. I will continue to develop it for use with loki advanced robotic platform and other robots. The software is written in java and is theoretically capable of running on any platform capable of supporting the java virtual machine. Unfortunately my software will not run on microcontrollers (yet) :D

Setup

This particular setup consists of 2 X 3Com Homeconnect Webcams I bought off of Ebay about 4 years ago. They are fairly good cameras for the price ($12 to $15) and more importantly they have Linux drivers readily available. The computer is a donated Intel Celeron with a 2.60GHz CPU. The OS is Fedora Core 8 with a variety of other fun free software packages. Everything is shifted red in the images because I had to remove an internal cloudy blue filter from inside the cameras. Previously the cameras were extremely blurry. The software should be able to handle any webcam as long as it has a Linux driver. I have not written a Window$ interface (yet). In fact by today's standards my cameras would be rated as very poor quality. They have a fairly slow frame rate, sometimes as low as 3 fps. They have a maximum color resolution of 320 X 240 pixels. I continue to use these cameras for 2 reasons. 1. They are cheap. 2. I believe that as I develop the software for the worse possible sensor - it will work much better for better cameras. I have done some obstacle avoidance tests and routines before, but for this test I will only be exercising object detection and the beginnings of some simple ranging.

2 webcams on motorized mount

Computer: Intel Celeron with a 2.60GHz CPU

Cameras: 2 X 3Com Homeconnect Webcam for $15

Software: A Java program I wrote which uses some Java Advanced Imaging

Operating System: Fedora Core 8

Target: Duplo blocks and a variety of other objects

Background: small light beige fabric on the floor, my wife let me borrow it because she watched part of STeven’s video and said to me, “Geeze, that is clean, you better not put our basement carpet online!”

Definitions and Descriptions

Here is a list of definitions and descriptions relating to the video in the 4 applets displays. Although, a web page and an applet is being utilized for the display, any application which can show a series of jpeg images could be connect to the robot. I am using GWT for a control panel of loki, so an applet display seemed to be a logical choice. I can view and control from anywhere on the internet (Bwah Ha Ha).

Bounding Box: a rectangle drawn around all the tiles of an object

Object Number: an arbitrary number assigned to a object

Center of Object: center of object derived by the bounding box width and height divided by 2

X, Y Top Corner Position: top corner pixel location of the bounding box. The X (horizontal component) will become important inStereo Vision when determining Horizontal Disparity. This will help in ranging the distance of objects.

Area Size: total size of tiled area in pixels

Average Color: average color of all tiles grouped with this object

Number of Tiles: number of tiles grouped with this object

Color Description: at the moment it can only handle red green or blue and this is done finding which value of the average color is the highest, I might add “martha stewart’s winter wire color” as another option as soon as I figure out what RGB value that is.

English Object Description:this is derived by a threshold of how many tiles fill the bounding box. If it is enough then it will consider it a “box thingy”. The bounding box could be changed into other templates. For example, “triangle thingy” or “circle thingy”. With range added, dimensions can also be calculated. With that comes a higher order of conceptualization of “box thingy”. It could be a “12 inch X 18 inch black box thingy 4 feet away”. This could match with “Computer Screen” or some other appropriate data item. As you can see, this can be in Danish or English.

Overview of Algorithm

I did some previous experiments with a Sobel and Canny operators but my implementation seems to have brought out a significant amount of noise. I would like to revisit these algorithms at a later time.

What I am currently doing, I think, is some form of sparse color blob detection?

Here are the individual parts:

recursively split based on a max color deviation from a sparse sample
begin grouping on a tile-size threshold
recursively group adjacent tiles based on a color threshold
match objects based on size color and other variables
range object in a 2 camera system with horizontal disparity of matched objects

Splitting

I have done tried some experimentation with edge detectors. It was seemed rather noisey for measurements and appeared to increase the correspondence problem. After thinking about it, if my objective was to match objects in two extremely divergent quality cameras, I would try to match the “largest colored area” first. I believe this might be referred to as Blob Detection. I also figured that trying to determine the as many characteristics of an object possible, in one camera, would help me find the corresponding object in the other camera. I mean, if you know its a “Shoe” in one camera picture, you can find the “Shoe” in the other camera without much difficulty. So, the more data I processed and refined the object in one camera, the higher the possibility of matching it correctly in the second camera. So the first part of this algorithm was to split the screen up into tiles of color.

If a color difference threshold is met the tile is divided into 4 quadrants tiles, the 4 tiles are recursively split or ignored depending on this threshold. I created indexes for the tiles’ corners. This was an optimization to search adjacent rectangles. In order to support this, the splitting need to be divided equally divided into 4 parts.

This means the 320 X 240 image can be divided 4 recursive times. The width can be divided a couple more times equally – that is why the smallest tiles are rectangular rather than square.

Grouping

After splitting the second algorithm would group selected tiles together.

The particular algorithm I used will not begin to group anything below a certain size-threshold, which I think is one of the biggest short comings at the moment. I will probably change this in the near future. Once a tile has been identified as being “big enough” adjacent tiles are searched and if the color difference is within a certain amount, the adjacent tiles are joined to the object. This is done recursively until all the corners of the grouped tiles have been checked.

Filtering

There is not much filtering going on in this experiment. I filtered the background object out, because it would put a red dot in the center of the screen with a bunch of statistics, and I did not want that displayed at the same time other objects in the view were being processed. It’s a very lame filter, I just grab any object above a certain brightness, since in this scenario the background was bright. And as you might have noticed, that is why certain other object were also filtered out. The white cat hair would be an example, also part of my hand was filtered out because of this silly background filter.

Ranging

I have not finished this ranging section, because I keep finding things which might aid in reducing the correspondence problem in single camera mode. But I did put a few indicators in so I could begin some calculations. The points on any corner of the bounding box could be used to derive range information. The X position is used to find the horizontal disparity and using trig or a simple look-up table, the range can be calculated. In this particular instance the distance to the Red bricks is 38.1 cm (15") and the distance to the Green bricks is 25.4 cm (10").

To Do

Optimizations - there are many serious optimizations that could be done, but at the moment on limited time and having too much fun being the mad scientist I am saving that for later.

Other Algorithms - The data coming from a video camera is huge. On my crud cameras there is 320 X 240 X 3 X 3 bits flowing per second. There is a vast amount of interpreting which could be done. So many algorithms, so little time. Humans, after 6 billion years of evolutions seem to have perfected it on many different levels. For example, human’s multi-dimensional understanding of objects and our understanding of perspective will allow us to easily range, navigate, and identify aspects of the environment around us even with an eye missing! If I only had 6 billion years. Here is a list of some other algorithms I am interested in.

Determining location and direction of light source
Handling reflections and shadows better
Revisit Sobel and Canny operators as another processing unit
Template / neural network identification (see below)
Determining distance through Jittering (moving the camera back and forth on a horizontal plane to generate a 3D map)
Make a little motor which can move the cameras further apart or closer together, the further apart they are the better at ranging long distances

Identification - I am interested in being able to identify objects. I was thinking of using templates stored in a database. I intend to hook up a neural network to resolve the statistical complexity when the robot finds something in the environment and tries to match the appropriate template. An interface where the robot drives around and locates an object in its environment. It will process against the neural network and come up with a guess. At this point I will be able to sadistically train it :D. It will drive around and pick up a sock on the floor and ask “Towel?”. I then will punish… I mean train by saying “No” … Then it will make another template to be stored in the database… And hopefully I won’t find it wiping down the table with my socks.

Console to manage Inputs and Outputs - The software currently can manage inputs and outputs of video data streams. In this experiment the video stream was put in 4 queues. Two of the video queues were left undisturbed while the other two were fed through the splitter, the grouping, and the filter before being converted into jpegs and sent to the applets. The streams can be forked apart or joined back together. At the moment there is no console to construct or tweak parameters of the algorithms. So that is on my todo list too…

3D & 2D mapping - creating, storing, recalling, and comparing maps in a database or internet (google maps, IFRIT, etc)

Context creation - the idea that every single frame does not need to be processed and thrown away. But to store the useful results of processing in a database to be referred to later.

Trac Project and SVN repository - create an open source project so anyone who wants to try it or improve it can…

Splitting and Grouping Processor seperated - at the moment they are in the same processor, but they should be seperated since it may be desirable to split and send to a different processor ie. Canny Sobel or something else.

https://www.youtube.com/watch?v=_2Wj1uj1I9k

robologist · December 6, 2008, 2:58am

**This is **
scary and cool all at the same time. Lot of cool info, and it will be great to see more as you develop it. There seems to be a lot of seeming complexity in video processing, and I hope to pick it up someday, so thanks for the info on your approach.

TheCowGod · December 6, 2008, 4:22am

Awesome, thanks. I’ll

Awesome, thanks. I’ll definitely be keeping an eye on this and reading it more closely when I’m ready to start playing with vision. I’m glad I’ll have someone to talk to who has already done a good bit of work with it

Dan

rik · December 6, 2008, 6:21am

Blaze that trail Grog!

Excellent writing and very informative.

What a treat to find this article on my breakfast table this morning. Wished I had brought that second cup right away. This node is a two-cupper! Responding will take another one. gone to kitchen

Thanks for blazing a trail for us to follow. When we are ready for machine vision that is. I understand you too are following other people’s trails. Could you tell a bit more about those? You explained which software you are expanding on. Which functions were already available to you and which ones did you program yourself? still too hot

I researched that “martha stewart’s winter wire color” for you: I believe it to be RGB #b0d9f7. That would be blue for your code.

One question and one nitpick. Why not place left/right images left and right in the interface or video? And you are not really spitting on your project, are you? Recursively or otherwise?!

8ik

GroG · December 6, 2008, 10:32am

Thanks Rik,As someone who’s

Thanks Rik,

As someone who’s articles I really enjoy, I take that as a great complement. In person I’m not such a chatty guy, but there are a few topics which I can blather on and on about… earthships as an example… but I digress…

Brilliant! left on the left side, right on the right… geeze why didn’t i think of that. I had a multi-threading issue and just recently figured it out… So left and right could not exist at the same time. Now that it works, the displays can be positioned by moving a little html around. I uploaded another test. It has “martha stewart’s winter wire color” in the column header. Thanks for the research! The test shows. 1. I’m still pretty unhappy with the initial condition of minimum area to start object/tile grouping. When the red Duplo bock falls over, its not picked up again because of its smaller size. 2. Thin wire and glass does not get picked up… maybe I’ll get thicker frames

What, who’s not ready? In regards to functions, I have not found an open-source Java machine vision library for robots, that is why I am interested in starting one. The largest organized collection of open-source code for machine vision I believe is the OpenCV. I have not looked into it at any real depth, but have noted it as a reference to investigate sometime. It’s written in C++, and by trade I’m a C++ developer, but there really appears to be more well organized Java open source projects and libraries than C++. For example, here, and neural nets, the lists go on and on. I am using JAI for writing to the applet screen. For the Canny and Sobel filters I read what was on wikipedia and attempted my own implementations, the same goes for the colored blob splitting/grouping stuff I’m doing now, ie. I’m stumbling through, trying to make useful libraries, which is pretty dang fun

GroG · December 6, 2008, 10:36am

Scary? Wait till I try to

Scary? Wait till I try to use it to clean the house :D. Sure, its the least I could do considering the amount of electronic info I welch off of you!

GroG · December 6, 2008, 10:39am

Sure Dan, maybe Bully Bot

Sure Dan, maybe Bully Bot might benefit with what to target?

rik · December 6, 2008, 11:09am

not ready

Me. I am not yet ready to take this on. At the moment I am converting from picaxe to linksys. My first attempt is to make the damn led blink. Again. Then come the wheels, then the eyes.

So, I understand that you found a way to feed a stream of individual pictures into your java program and that the program is interpreting those on an individual basis (for now) and spits (!) out info about the blobs (in two different formats: altered pictures and raw data).

And that all the streaming and stuff is premade libs, but the interpreting is home made?

8ik

TheCowGod · December 6, 2008, 12:45pm

Yup, I think that’ll be one

Yup, I think that’ll be one of my first uses for this sort of thing

Dan

GroG · December 6, 2008, 2:44pm

If you have a webcam and a

If you have a webcam and a computer you could start now, but I understand your priorities. I will need to get back to H-Bridge details soon again and a few other electronic parts…

Another wonderful thing about Linux is its directness and elegant simplicity. Yes, if you can find a driver for your webcam in Linux, you just need to open a file. e.g. /dev/video0 or /dev/video1 with windows the API is alot more convoluted, opening device contexts yattah yattah and such - i would have provided a link on how to do it but could not find one

So, the Java program just opens the video file and the program is interpeting those individual frames into temporary data structures, e.g. an Object with an ArrayList of Tiles. Here’s a ascii flow diagram:

/dev/video0 --> Java open file --> byte Array read --> GroG Image Queue --> Fork Queue --> GroG’s attempt of Spitter / Grouper Algorithm --> Image converted to JPEG using JAI --> Object details written to display using JAI

Input and Output is managed by a Control Framework which I created.
It’s the same thing which controls Lokibot’s other electronic outputs. I have a Motor service for example which control Loki’s motors.

In this experiment I’m using the “ImageProcessor” service.

Conceptually the framework is kindof like Matlab which in turn kind of like M$robotics, which is all about managing inputs and outputs…

Tell me when I should stop

rik · December 6, 2008, 3:03pm

that’ll do grog

I totally understand the /dev design in Linux, or at least the appeal of it. I am currently manipulating /dev/gpio/out to make my led blink. Linux is my thing. Programming is not. I plan on using perl. Don’t speak no fancy languages like java or C/C++.

That won’stop me though. I plan to do as we all do in here. Use whatever I’ve got lying around the house. Routers, pizza boxes, tape, soldering iron, tea kettle, my wit and my perl. Sans webcam 8-(

8ik

GroG · December 6, 2008, 4:04pm

Righty-O Linux is Great!

Programming is not your thing? Perl? What? It’s all the same man! At some point they all become machine code for a processor. You have already done this. Perl Java C++ Python Lisp all the same sematics just different syntax…

$motor = GenericMotor(“neck”);  # create a new motor named "neck"

Remember, a Language Thingy by any other name would smell as sweet…

For completeness I have re-added the first code snippet in decent resolution

rik · December 6, 2008, 4:31pm

agreed

Sure the syntax is the only real diff. But for now I am not ready to grow my lingo bulge any further. (… he lied, while trying to hack the gpio.c equiv of helloworld.c using a cross compiler for the very first time in his life!)

With your helpfull hints in the margin above, I can read the java code. Even the oo is not scaring me.

By “my thing” I meant to say “my profession”. Dammit Jim, I’m a sysadmin, not a coder! Or something to that effect.

For the record: I love programming. And *nix appeals to me because of its culture.

8ik

GroG · December 7, 2008, 11:57pm

Hi OddBot,Interesting

Hi OddBot,

Interesting ideas. The first attempt which I worked on used a sample from the left camera and pushed it across the right camera image. It would do a subtraction of the luminence values and if the two images were in alignment, theoretically the difference would be 0. I have a lame video of that here . The top is the difference between the left and right images. The bottom is an image from the right camera, it is being pushed across an image from the the left camera and the difference is at the top.

The blob detection did a better job so far on alighning correct edges. I think it is worthwhile trying to extrapolate data that quantify something as an “object” - It should help in ranging too, but there are alot of other reasons “objectifying” will be useful.

Thanks for the suggestio. I will try to make a image processor with some of the optimizations which you suggest. I have played a little with a blur filter before. Some blur filters use averages, recently I started using “sparse” sampling - which I think is a nifty optimization. Instead of summing all the pixels in an area and dividing by the pixel count, sparse sampling just sums every nth pixel then divides by the count. It is being used in the splitting algorithm mentioned in the walkthrough.

I am trying to collect and implement many processes so that they can be chained togther and experimented with. Quite possibly, the processes should change depending on the environment. e.g. navigating will need something very quick and not cpu-intensive, but say when the robot stops and is trying to find something - a more intensive, higher detail process can start.

I was thinking of a Jitter mechanism which moves a single camera back and forth in a horizontal plane, this would avoid the correspondence problem which stereo vision has. I put a threaded rod on a small motor shaft and made a rail for one of the webcams, the threaded rod would slide the camera back and forth. Not being very pleased with the first iteration, I have taken it apart … but I think there is potential (the travel of the webcam was too rough/crude… redesigning)… so much to do … so little time

GroG · December 8, 2008, 2:10am

Yes, I have been stalking
Yes, I have been stalking around for one. But, unfortunately all of them appear to be serving printing jobs
Maybe someone at work will give me an "Office Space" victim…

fritsl · December 8, 2008, 6:53am

Hats off, Grog. Completely
Hats off, Grog. Completely off. Hair and some skin off. I want to come over and play. Lord give us time and money. And a ticket to Grog’s house!

GroG · December 9, 2008, 12:24am

Thanks Frits,Come on over…

Thanks Frits,

Come on over… hope you like salmon, blackberries & beer…
If all goes well, you will be remotely driving around in Portland, then again … at the rate I seem to be finishing things, I might see you in person first

GroG · December 9, 2008, 8:42am

That’s what we raise cattle

That’s what we raise cattle for

edgee · December 20, 2008, 5:04am

amazing
This is truly amazing. Ive started working on image processing using processing 1.0. Its no where near as good as this though.

GroG · December 20, 2008, 9:22am

Thanks Edgee,What is

Thanks Edgee,

What is “processing 1.0” ? If you have something which can run a java vm I’d be glad to give you all the source code - although I have not formally released it yet (still needs work).

GroG