Hello Programmers, In this post, you will learn how to solve HackerRank Build a Stack Exchange Scraper Solution. This problem is a part of the Regex HackerRank Series.
One more thing to add, don’t straight away look for the solutions, first try to solve the problems by yourself. If you find any difficulty after trying several times, then look for the solutions. We are going to solve the Regex HackerRank Solutions using CPP, JAVA, PYTHON, JavaScript & PHP Programming Languages.

HackerRank Build a Stack Exchange Scraper Solution
Problem
Stack Exchange is an information power-house, which contains libraries of crowdsourced problems (with answers) across a large number of topics which are as diverse as electronics, cooking , programming, etc.
We are greatly interested in crawling and scraping as many questions, as we can, from stack-exchange. This is an example of a question library page from stack–exchange.
Your task will be, to scrape the questions from each library page, in the order in which they are listed. You will be provided with the markup of question listing pages, from which you need to detect:
(1) Identifier (2) Question text (which is on the Hyperlink to the question) (3) How long ago the question was asked.
The Markup in the Test Cases will be similar to the sample fragment shown below. Please note, that since this markup is real markup from the website, it is likely to contain some stray control and escape characters, unexpected whitespaces and newlines.
Sample Markup Fragment
<div class="question-summary" id="question-summary-80407"> <div class="statscontainer"> <div class="statsarrow"></div> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post "><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered"> <strong>1</strong>answer </div> </div> <div class="views " title="60 views"> 60 views </div> </div> <div class="summary"> <h3><a href="/questions/80407/about-power-supply-of-opertional-amplifier" class="question-hyperlink">about power supply of opertional amplifier</a></h3> <div class="excerpt"> I am constructing an operational amplifier as shown in the following figure. I use a batter as supplier for the OP Amp and set it up as a non-inverting amp circuit. I saw that the output was clipped ... </div> <div class="tags t-op-amp"> <a href="/questions/tagged/op-amp" class="post-tag" title="show questions tagged 'op-amp'" rel="tag">op-amp</a> </div> <div class="started fr"> <div class="user-info "> <div class="user-action-time"> asked <span title="2013-08-27 21:49:14Z" class="relativetime">11 hours ago</span> </div> <div class="user-gravatar32"> <a href="/users/17060/user1285419"><div class=""><img src="https://www.gravatar.com/avatar/08ee68b20a4eceff26f7eee99b708c08?s=32&d=identicon&r=PG" alt="" width="32" height="32"></div></a> </div> <div class="user-details"> <a href="/users/17060/user1285419">user1285419</a><br> <span class="reputation-score" title="reputation score" dir="ltr">165</span><span title="5 bronze badges"><span class="badge3"></span><span class="badgecount">5</span></span> </div> </div> </div> </div> </div> <div class="question-summary" id="question-summary-80405"> <div class="statscontainer"> <div class="statsarrow"></div> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post "><strong>4</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>2</strong>answers </div> </div> <div class="views " title="64 views"> 64 views </div> </div> <div class="summary"> <h3><a href="/questions/80405/5v-regulator-power-dissipation" class="question-hyperlink">5V Regulator Power Dissipation</a></h3> <div class="excerpt"> I am using a 5V regulator (LP2950) from ON Semiconductor. I am using this for USB power and I'm feeding in 9V from an adapter. USB requires maximum of 500mA right? So the maximum power dissipation in ... </div> <div class="tags t-voltage-regulator t-surface-mount t-heatsink t-5v t-power-dissipation"> <a href="/questions/tagged/voltage-regulator" class="post-tag" title="show questions tagged 'voltage-regulator'" rel="tag">voltage-regulator</a> <a href="/questions/tagged/surface-mount" class="post-tag" title="show questions tagged 'surface-mount'" rel="tag">surface-mount</a> <a href="/questions/tagged/heatsink" class="post-tag" title="show questions tagged 'heatsink'" rel="tag">heatsink</a> <a href="/questions/tagged/5v" class="post-tag" title="show questions tagged '5v'" rel="tag">5v</a> <a href="/questions/tagged/power-dissipation" class="post-tag" title="show questions tagged 'power-dissipation'" rel="tag">power-dissipation</a> </div> <div class="started fr"> <div class="user-info "> <div class="user-action-time"> asked <span title="2013-08-27 21:39:31Z" class="relativetime">11 hours ago</span> </div> <div class="user-gravatar32"> <a href="/users/10082/david-norman"><div class=""><img src="https://www.gravatar.com/avatar/8b073417e471077280b3fc5ff2eaf1f7?s=32&d=identicon&r=PG" alt="" width="32" height="32"></div></a> </div> <div class="user-details"> <a href="/users/10082/david-norman">David Norman</a><br> <span class="reputation-score" title="reputation score" dir="ltr">322</span><span title="3 silver badges"><span class="badge2"></span><span class="badgecount">3</span></span><span title="10 bronze badges"><span class="badge3"></span><span class="badgecount">10</span></span> </div> </div> </div> </div> </div>
Output Format
The output file should contain N lines, where N is the number of questions you have identified in the provided fragment.Each line contains the identifier, question text and (relative) time when the question was asked (with no leading or trailing spaces surrounding each section); separated by semi-colons. The information about the questions in the output file should match with the ordering in the original markup.
Sample Output
80407;about power supply of operational amplifier;11 hours ago
80405;5V Regulator Power Dissipation;11 hours ago
Explanation
The given markup fragment points to two questions on electronics.stackexchange.com (at the time the markup was noted).
The first question has ID 80407, it is “about power supply of operational amplifier” and it was asked “11 hours ago” (relative to the time when this markup was noted). Search for these values in the given markup fragment to gain a better understanding of where we identified these values from. The second question has ID 80405, it is about “5V Regulator Power Dissipation”, and it was asked “11 hours ago” (relative to the time when this markup was noted).
A Note Regarding the Test Cases
The markup in the test cases will resemble the markup fragment provided above, however, each markup fragment might contain a larger number of questions embedded in it. A markup fragment will have no more than 100 questions embedded in it.
HackerRank Build a Stack Exchange Scraper Solution in Cpp
#include <stdio.h> #include <string.h> const char * p1 = "question-summary-"; const char * p2 = "question-hyperlink"; const char * p3 = "relativetime"; void setPalavra(char * in, char * out, int k, int size); bool letra(char a); int main() { char ent[1010], aux[1010]; int size; char saida[3][1010]; while( gets(ent) != NULL ) { size = strlen(ent); ent[size++] = '.'; for(int i=0; i<size; i++) if(letra(ent[i])) { setPalavra(ent, aux, i, size); if(!strcmp(aux, p1)) { int a = 0; for(int j=i+17; ent[j] != '\"'; j++) { saida[0][a++] = ent[j]; } saida[0][a] = 0; } else if(!strcmp(aux, p2)) { int a = 0; for(int j=i+20; ent[j] != '<'; j++) { saida[1][a++] = ent[j]; } saida[1][a] = 0; } else if(!strcmp(aux, p3)) { int a = 0; for(int j=i+14; ent[j] != '<'; j++) { saida[2][a++] = ent[j]; } saida[2][a] = 0; printf("%s;%s;%s\n", saida[0], saida[1], saida[2]); } i += strlen(aux)-1; } } } void setPalavra(char * in, char * out, int k, int size) { int a = 0; for(int i=k; i<size; i++) { if(letra(in[i])) { out[a++] = in[i]; } else { out[a] = 0; return; } } } bool letra(char a) { if((a >= 'a' && a <= 'z') || (a >= 'A' && a <= 'Z') || a == '-') return true; else return false; }
HackerRank Build a Stack Exchange Scraper Solution in Java
import java.io.*; import java.util.*; import java.text.*; import java.math.*; import java.util.regex.*; public class Solution { public static void main(String[] args) { /* Enter your code here. Read input from STDIN. Print output to STDOUT. Your class should be named Solution. */ Scanner in = new Scanner(new BufferedInputStream(System.in)); String format1 = "<a href=\"/questions/([0-9]+).*>(.*)</a>"; String format2 = ".*class=\"relativetime\">(.*)</span>"; Pattern pattern1 = Pattern.compile(format1); Pattern pattern2 = Pattern.compile(format2); ArrayList<String>ID = new ArrayList<String>(); ArrayList<String>question = new ArrayList<String>(); ArrayList<String>time = new ArrayList<String>(); while(in.hasNext()){ String assessed = in.nextLine(); Matcher match = pattern1.matcher(assessed); Matcher match2 = pattern2.matcher(assessed); while(match.find()){ match.groupCount(); ID.add(match.group(1)); question.add(match.group(2)); } while(match2.find()){ match2.groupCount(); time.add(match2.group(1)); } } for(int j = 0;j<ID.size();j++){ System.out.println(ID.get(j) + ";"+question.get(j)+";" + time.get(j)); } } }
HackerRank Build a Stack Exchange Scraper Solution in Python
import sys import re s=sys.stdin.read() pQ=[] pI=[] pT=[] patternQuestion='<a.* class="question-hyperlink">.*</a>' for x in re.findall(patternQuestion,s): pQ.append(re.sub("<[^>]*>","",x)) patternId='[^<]*id="question-summary-[0-9]*'; for x in re.findall(patternId,s): x=re.sub('div class="question-summary" id="question-summary-',"",x) pI.append(x) patternTime='<.*relativetime.*'; for x in re.findall(patternTime,s): x=re.sub('<[^>]*>',"",x) pT.append(x) for x in xrange(len(pT)): print pI[x]+";"+pQ[x]+";"+pT[x]
HackerRank Build a Stack Exchange Scraper Solution in JavaScript
'use strict'; function processData(input) { var lines = input.split('\n').join(' '); var questionREStr = '<\\s*a[^>]+href="/questions/([0-9]+)/[^"]*"[^>]*>([^<]*)<'; var timeREStr = '<\\s*span[^>]+class="relativetime"[^>]*>([^<]*)<'; var re = new RegExp('(?:' + questionREStr + '|' + timeREStr + ')', 'ig'); var res = []; var arr = null; while ((arr = re.exec(lines)) != null) { if (arr[1] !== undefined && arr[2] !== undefined) { res.push({id: arr[1], text: arr[2].trim() }); } if (arr[3] !== undefined) { res[res.length - 1].time = arr[3].trim(); } } res.forEach(function (o) { console.log(o.id + ';' + o.text + ';' + o.time); }); } process.stdin.resume(); process.stdin.setEncoding("ascii"); var _input = ""; process.stdin.on("data", function (input) { _input += input; }); process.stdin.on("end", function () { processData(_input); });
HackerRank Build a Stack Exchange Scraper Solution in PHP
<?php $f = fopen( 'php://stdin', 'r' ); $markup = ""; while( $line = fgets( $f ) ) $markup .= $line; fclose( $f ); $matches = array(); $regEx = '/class="question-summary" id="question-summary-([0-9]*)">.*class="question-hyperlink">(.*)<\/a>.*class="relativetime">(.*)<\/span>/siU'; preg_match_all( $regEx, $markup, $matches ); foreach( $matches[ 1 ] as $key => $id ) print $id . ";" . $matches[ 2 ][ $key ] . ";" . $matches[ 3 ][ $key ] . "\n"; ?>
Disclaimer: This problem (Build a Stack Exchange Scraper) is generated by HackerRank but the solution is provided by Chase2learn. This tutorial is only for Educational and Learning purposes.