Interview
Following is a (fictitious) interview with Sal Khan from Khanacademy. As in he interviewed me, not the other way around.
Sal: Hello.
Me: Hi.
Sal: Thanks for your time.
Me: No worries, it's only $130 an hour.
Sal: Tell me a little bit about yourself.
Me: My name is Ravit Sharma.
Sal: Tell me more about yourself.
Me: Okay then. I just woke up about 10 minutes ago and am writing this blog to get it out of the way for the day. Yesterday I learned stuff, so might as well flex my knowledge. So I learned about different algorithms for object detection, including YOLO, RCNN, and Mask RCNN. Let's go through each one at a time. So let's start with YOLO, which stands for You Only Look Once. I'll be discussing YOLO v3 since it is the most recent version. So here's how it works. The image is divided into SxS grid cells. The algorithm takes a sliding window approach to output at each grid cell, an output in the form for each of 3 anchor boxes: the probability of there being an object, class probabilities, and the bounding box. The entire network is over a hundred layers, and there are outputs at 3 different scales, meaning that the algorithm is good at detecting objects small or large. An Intersection over Union (IoU) is taken to determine the optimal bounding box. While YOLO is great for real time object detection, and rarely identifies background as an object (unlike RCNN), it is generally more inaccurate for its bounding box predictions. Okay, now let's talk about RCNN. RCNN, standing for regional CNN has two parts. In the first, selective search is used to predict regions for potential objects. Then, each region is passed into an AlexNet, then an SVM for object detection. Then regression is used to tighten the bounding box. In Fast RCNN, the three models (AlexNet, SVM classification, and regression model) are combined into one model, making the training process easier. In addition, convolution is performed once and the computation is shared as opposed to once for each region. And then Faster RCNN came along (very original names btw), and replaced Selective Search for region proposals (a slow and static algorithm) with a region proposal network. First a feature map is generated with the help of Convolutional Neural Networks, from which region proposals are generated. Interestingly, the same feature map is used for classification based on the region proposals. Then the Mask RCNN came along, which has another output Excuse me, I have really bad allergies.
Sal: It's ok, I understand.
Me: Anyhow, there are three outputs. In addition to the classification and the bounding box, the Mask RCNN also outputs a mask, in which it identifies a pixel mask for each instance of the object (instance transformation), unlike semantic segmentation, which does not differentiate between different instances, only looking at the object as a whole. Anyway, this is the part where you ask how this is implemented.
Sal: How is this implemented?
Me: Well, Sal, that's a great question, and it's wonderful that we have curious thinkers like you in this world. A fully convolutional network (consisting of only convolution layers) is added to the end of a regular Faster RCNN to predict such pixel masks for each output. A great dataset to train on is the COCO (Common Objects in Context) dataset, which contains the painstakingly annotated objects for training. Yesterday, I tried to learn about Fourier Transform, but I got lost pretty quickly. All I got was that somehow you transform from one domain to the inverse domain, and somehow this involves the addition of multiple sine functions and Euler's Formula. I also went outside yesterday and did a dead hang for 90 seconds, which was pretty tough. Oh my, this is so fun I've lost track of time. Unfortunately, I've got a meeting to attend with Zuckerberg, and then lunch with Jeff Dean. You know, the usual.
Sal: Thanks for your time.
Me: No problem. If you ever need any advice or help for your nonprofit, feel free to shoot me an email.
Sal: Hello.
Me: Hi.
Sal: Thanks for your time.
Me: No worries, it's only $130 an hour.
Sal: Tell me a little bit about yourself.
Me: My name is Ravit Sharma.
Sal: Tell me more about yourself.
Me: Okay then. I just woke up about 10 minutes ago and am writing this blog to get it out of the way for the day. Yesterday I learned stuff, so might as well flex my knowledge. So I learned about different algorithms for object detection, including YOLO, RCNN, and Mask RCNN. Let's go through each one at a time. So let's start with YOLO, which stands for You Only Look Once. I'll be discussing YOLO v3 since it is the most recent version. So here's how it works. The image is divided into SxS grid cells. The algorithm takes a sliding window approach to output at each grid cell, an output in the form for each of 3 anchor boxes: the probability of there being an object, class probabilities, and the bounding box. The entire network is over a hundred layers, and there are outputs at 3 different scales, meaning that the algorithm is good at detecting objects small or large. An Intersection over Union (IoU) is taken to determine the optimal bounding box. While YOLO is great for real time object detection, and rarely identifies background as an object (unlike RCNN), it is generally more inaccurate for its bounding box predictions. Okay, now let's talk about RCNN. RCNN, standing for regional CNN has two parts. In the first, selective search is used to predict regions for potential objects. Then, each region is passed into an AlexNet, then an SVM for object detection. Then regression is used to tighten the bounding box. In Fast RCNN, the three models (AlexNet, SVM classification, and regression model) are combined into one model, making the training process easier. In addition, convolution is performed once and the computation is shared as opposed to once for each region. And then Faster RCNN came along (very original names btw), and replaced Selective Search for region proposals (a slow and static algorithm) with a region proposal network. First a feature map is generated with the help of Convolutional Neural Networks, from which region proposals are generated. Interestingly, the same feature map is used for classification based on the region proposals. Then the Mask RCNN came along, which has another output
Sal: It's ok, I understand.
Me: Anyhow, there are three outputs. In addition to the classification and the bounding box, the Mask RCNN also outputs a mask, in which it identifies a pixel mask for each instance of the object (instance transformation), unlike semantic segmentation, which does not differentiate between different instances, only looking at the object as a whole. Anyway, this is the part where you ask how this is implemented.
Sal: How is this implemented?
Me: Well, Sal, that's a great question, and it's wonderful that we have curious thinkers like you in this world. A fully convolutional network (consisting of only convolution layers) is added to the end of a regular Faster RCNN to predict such pixel masks for each output. A great dataset to train on is the COCO (Common Objects in Context) dataset, which contains the painstakingly annotated objects for training. Yesterday, I tried to learn about Fourier Transform, but I got lost pretty quickly. All I got was that somehow you transform from one domain to the inverse domain, and somehow this involves the addition of multiple sine functions and Euler's Formula. I also went outside yesterday and did a dead hang for 90 seconds, which was pretty tough.
Sal: Thanks for your time.
Me: No problem. If you ever need any advice or help for your nonprofit, feel free to shoot me an email.
Comments