[tw] Importing Wikipedia Dumps

Discussion:

Richard Smith

2015-01-07 06:07:58 UTC

Hi All. I'm wondering if anybody has tried importing (largish amounts of)
wikipedia data into TiddlyWiki?

I can use BJ's excellent TiddlyClip to import individual pages but I wonder
if there's a way to get larger chunks of wikipedia?

It's possible to download offline dumps of various Wikipedia projects
(http://en.wikipedia.org/wiki/Wikipedia:Database_download) but I'm not sure
what the best format would be for getting the stuff into TW in a nice clean
way. Any ideas?

Regards,
Richard

--
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at http://groups.google.com/group/tiddlywiki.
For more options, visit https://groups.google.com/d/optout.

Richard Smith

2015-01-07 06:14:20 UTC

Permalink

OK. That was a bit lazy. I searched after I posted and found a similar
recent thread.
https://groups.google.com/forum/#!searchin/tiddlywiki/wikipedia/tiddlywiki/BefZrA4BpqQ/-XLsXOaav5wJ

I'll contextualise my question a little better.

I have recently been corresponding with someone who works for the Wikipedia
"Offline Content Generator" project (
http://www.mediawiki.org/wiki/Offline_content_generator) and I want to ask
him if it's possible to add a widget/filter (?) to get content in a
TiddlyWiki-friendly format. What should I ask for? :)

Regards,
Richard

PMario

2015-01-07 08:37:06 UTC

Permalink

There is a possibility to download Wikipedia databases [1] in XML format,
which needs some post processing, to be used with TW. ...
BUT the problem here is size:

These files expand to multiple terabytes of text. Please only download
these if you know you can cope with this quantity of data. Go to Latest
Dumps <http://dumps.wikimedia.org/enwiki/latest/> and look out for all the
files that have 'pages-meta-history' in their name.

So IMO this is a no go!

-----------

I think, the most promising way is the wikipedia api:
http://www.mediawiki.org/wiki/API
or export: http://en.wikipedia.org/wiki/Special:Export
or http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot#APIs_for_bots

Special:Export seems to use XML only, so imo the wiki/API and API for bots
is an option here.

If Special:Export could create CVS or JSON it could be directly used by TW,
with drang and drop import. ... but ...

There is still a syntax problem. MediaWiki syntax is completely different
to the TW syntax. ... With an export / import mechanism, you'll need to
also export / import the "meta structure" eg: tags, fields and may be
relations ...

So imo it can be done, for a limited amount of data but there is some work,
that needs to be done.

---------------

The TW json format for 2 tiddlers would look like this:

[
{
"created": "20150107082527588",
"text": "some text for tiddler 1 with an internal link to [[tiddler
2]]",
"title": "tiddler 1",
"tags": "tag1 tag2",
"modified": "20150107082619778",
"field1": "some text for field1"
},
{
"created": "20150107082624952",
"text": "some text for tiddler 2 with an internal link to [[tiddler
1]]",
"title": "tiddler 2",
"tags": "tag1 tag2",
"modified": "20150107082705968",
"field1": "some more text for field1",
"field2": "text for field2"
}
]

A file, that contains this text can be directly imported to TW out of the
box.

----------

CVS format looks like this:

"title","text","modified","created","field1","field2","tags"
"tiddler 1","some text for tiddler 1 with an internal link to [[tiddler
2]]","20150107082619778","20150107082527588","some text for
field1","","tag1 tag2"
"tiddler 2","some text for tiddler 2 with an internal link to [[tiddler
1]]","20150107082705968","20150107082624952","some more text for
field1","text for field2","tag1 tag2"

I don't know, how to import this file.

-------------

Important:
- Tiddler fields are dynamic.
- Number of Tags is dynamic
- TW Wiki Syntax is completely different, so some conversion would need to
be done.

I hope that helps.

have fun!
mario

[1] http://en.m.wikipedia.org/wiki/Wikipedia:Database_download

PMario

2015-01-07 08:44:40 UTC

Permalink

tiddlers.json
I can't uplaod tiddlers.csv :/
-m

PMario

2015-01-07 08:45:24 UTC

Permalink

tiddlers.csv

Tobias Beer

2015-01-07 09:18:17 UTC

Permalink

Hi Richard,

importing (largish amounts of) wikipedia data into TiddlyWiki

The first response that pops off my synapses reading such a proposal is:
*Why?* and again: *Why?*

As Mario hints, you will need...

- field mapping
- for importing
- content conversion
- formatting adaptors in TW5 to show WikiMedia style markup

I don't think there's anything from within tw5 yet and...
the idea doesn't sounds utterly compelling, to me.

A Wikipedia import for individual pages, ok, but large dumps?

...if it's possible to add a widget/filter (?)

to get content in a TiddlyWiki-friendly format.

What should I ask for? :)

It would be terrific if someone created such a thing, but you gotta wonder
though:
Why would they ever want to invest so much effort into that?

So, what's the actual incentive / goal you have in mind other than
"because it would be cool" for TiddlyWiki?

What do you want to do with all this "stuff" in TiddlyWiki?

Best wishes, Tobias.

Andreas Hahn

2015-01-07 16:28:48 UTC

Permalink

Post by Tobias Beer
So, what's the actual incentive / goal you have in mind other than
"because it would be cool" for TiddlyWiki?
What do you want to do with all this "stuff" in TiddlyWiki?

Well, I can think of several reasons:

- To take ownership of the information you need. (i.e. have them on YOUR
computer where they belong to YOU, this is a core idea behind TW)
- To have them readily available in the format you work with.
- To fully incorporate a specific piece of information in a specific
revision into your wiki without relying on external sources (which may
and will change over time).
- To search them << obviously

For most TW users, it will be obvious that the amount of data we are
talking about here makes it unfeasable to actually "Import" the data
into a TiddlyWiki. Instead I imagine that, after you converted the data
into a TW-like format, you would need:

- A suitable storage WITH a TiddlyWeb interface on top of it (this is
comparatively easy actually).
- A suitable lazy loading mechanism within your client TW.
- A proper search mechanism, since the default TW one will not do the job.

/Andreas

Tobias Beer

2015-01-07 19:08:53 UTC

Permalink

For a simple offline Wikipedia, perhaps use:

http://www.kiwix.org

To take ownership of the information you need. (i.e. have them on YOUR

Post by Andreas Hahn
computer where they belong to YOU, this is a core idea behind TW)

I wouldn't think of it as ownership. More like a copy, perhaps an offline
backup.

To have them readily available in the format you work with.

kiwix will give you that, I think ...there also is that android app if you
wanted it

Post by Andreas Hahn
To fully incorporate a specific piece of information in a specific
revision into your wiki without relying on external sources (which may and
will change over time).

I believe, you can access and refer to specific revisions on wikipedia, you
don't need to point to the latest revision or have to make a copy of it.

To search them

Wikipedia is quite good at that. Of course, always on the latest content,
which makes sense to me. Kiwix works too..

For most TW users, it will be obvious that the amount of data we are

Post by Andreas Hahn
talking about here makes it unfeasable to actually "Import" the data into a
TiddlyWiki.

For everyone, really. A standalone TiddlyWiki is clearly not designed for
that.

Post by Andreas Hahn
Instead I imagine that, after you converted the data into a TW-like
format...

Ok, so that's the thing Richard is proposing... some (commandline)
interface that primarly does precisely that.

A suitable storage WITH a TiddlyWeb interface on top of it (this is

Post by Andreas Hahn
comparatively easy actually).

In terms of getting to show anything, mostly flat, sure... in terms of
getting the relations and intricacies of fields and templates as on
WikiPedia, that's not at all "easy".

Post by Andreas Hahn
A suitable lazy loading mechanism within your client TW.

A proper search mechanism, since the default TW one will not do the job.

I guess, both of these go hand in hand. So, it would be some yet to be
invented server-side search and indexing module doing the heavy lifting.

I am still keen to hear of a compelling reason to do all that for large
dumps of WikiPedia, rather than individual articles.

Best wishes, Tobias.

Richard Smith

2015-01-08 09:16:40 UTC

Permalink

Post by Tobias Beer
*Why?* and again: *Why?*

Hi Tobias,

My goal is to package information to be accessible completely offline (in
an environment where there is no possibility of an internet connection). I
would like to curate a collection which is suitable for a target audience
of young children and am also interested in taking foreign language
materials and using them as a a starting-point for a personal learning wiki.

The fact that TW is 'stand-alone' is only one of the great things about it.
The thing I like the most is that it's so easy to (re-)compose content (and
also build custom UI) , which is why I find it interesting to use it as a
container for largish data-sets.

One possibility, I guess, would be to adapt the idea of TiddlyClip so that
it can be given a list of pages and then fetch the content for all of them
in a systematic manner.

Thanks (all) for the useful suggestions

Regards,
Richard

RichShumaker

2015-01-09 05:11:21 UTC

Permalink

I agree with Tobias on why repeat the repeating if something exists already
and can work.
I can also say that on more than one occasion when asked why did you do
that?
My response has been 'because I can.'

I have worked with HTML TW5 dealing with larger data sets and have not
enjoyed it.
So from personal experience I would avoid it.
Although I am about to explore using node.js and seeing how I can break my
browser.
You see TW is broken if you get a red box of death or some other error with
the code.
When you put a 55mb PDF in a single Tiddler and view it and the browser
chokes, that's the browser.

With all of that said, I can honestly say, I would like to use TiddlyMap(?
- New TaskGraph name) with the Wikipedia to visually navigate the data in a
different way.
See relationships that I may have missed when they were words.
Also it would be interesting to link my current data set into the Wikipedia
data sets.
Obviously you have an area of interest that may say something like this on
Wikipedia
*TiddlyWiki* is an open-source <http://en.wikipedia.org/wiki/Open_source> single
page application <http://en.wikipedia.org/wiki/Single_page_application> wiki
<http://en.wikipedia.org/wiki/Wiki>. A single HTML
<http://en.wikipedia.org/wiki/HTML> file contains CSS
<http://en.wikipedia.org/wiki/Cascading_Style_Sheets>, JavaScript
<http://en.wikipedia.org/wiki/JavaScript>, and the content. The content is
divided into a series of components, or *Tiddlers*. A user is encouraged to
read a TiddlyWiki by following links rather than sequentially scrolling
down the page.
and you want to say more but still link into what may already exist.
Hey that just struck a cord. Why don't we re-write what we think
TiddlyWiki is and then press go on having it at Wikipedia(until someone
changes it).
I read the description above from Wikipedia
<http://en.wikipedia.org/wiki/TiddlyWiki> that I posted above and said
'true but lacking'.

Okay back on track from my point of view I agree with Tobias and also see
why someone would want this, I would use it if it was made.

Rich Shumaker